[00:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161213T0000). Please do the needful. [00:00:05] yurik: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:09] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [00:01:39] Dereckson: new wiki? was the addWiki.php script from WikimediaMaintenance run? What was the output? [00:02:02] * MarcoA wonders if a patch of his could be added [00:03:27] ebernhardson: addWiki.php run was incomplete for Cirrus, so to destroy and recreate index would probably a good idea: it timed out after some minutes, I did a ctrl + c, resumed without the done step, and it finished like a charm [00:04:01] ebernhardson: so basically the Cirrus part ran twice, first with a timeout, then without error [00:04:11] Dereckson: it always runs twice, there are two clusters [00:05:05] (03PS1) 10Filippo Giunchedi: add instances for restbase101[678] [dns] - 10https://gerrit.wikimedia.org/r/326855 (https://phabricator.wikimedia.org/T150964) [00:07:30] MarcoA, yep you can add a patch [00:08:25] (03CR) 10Yurik: tilerator: deploy config with scap3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324761 (https://phabricator.wikimedia.org/T150021) (owner: 10Gehel) [00:08:36] MaxSem: done :) [00:09:07] (03PS1) 10Kaldari: Add cron job for PageAssessments maintenance script to puppet [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) [00:10:11] (03CR) 10jenkins-bot: [V: 04-1] Add cron job for PageAssessments maintenance script to puppet [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) (owner: 10Kaldari) [00:11:28] (03PS2) 10Kaldari: Add cron job for PageAssessments maintenance script to puppet [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) [00:11:29] MarcoA, hmm - no community consensus? [00:11:35] 06Operations: Something is wrong with installer root disk stuff - https://phabricator.wikimedia.org/T149845#2766226 (10fgiunchedi) I've ran into the same thing while installing restbase1016, assembling the array manually the first time worked and subsequent reboots were fine. Oddly enough the same didn't happen... [00:11:39] !log created search indices and re-indexed existing pages for ecwikimedia [00:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:08] no, it's a security issue. Never used, never really requested, with typos, can't be assigned locally... [00:12:56] split from https://phabricator.wikimedia.org/T144638 [00:14:44] Dereckson: Oh shit. [00:15:07] mmm no, I don't feel confident about deploying this - sorry [00:15:16] MaxSem: it's fine [00:15:31] I'll schedule it for another window [00:15:36] Dereckson: Looks like https://gerrit.wikimedia.org/r/#/c/323236/ got scheduled for European mid-day which I can't attend. [00:15:39] but that right is going [00:15:59] RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [00:17:20] odder: MaxSem is swatting [00:17:39] ebernhardson: thanks, I'm asking someone from ec to retry [00:17:42] an edit. [00:17:54] swat is empty now Dereckson [00:21:14] MaxSem: odder would like to add https://gerrit.wikimedia.org/r/#/c/323236/ to SWAT [00:21:49] odder, looking [00:22:07] Yeah, I meant to do it straight away, added it to the wrong window [00:22:59] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [00:23:57] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2866850 (10GWicke) @RobH, could you update this task with a summary of the progress so far & ideally an estimate of the ETA fo... [00:25:18] (03CR) 10MaxSem: [C: 032] Add localized logo for Gujarati Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323236 (https://phabricator.wikimedia.org/T121853) (owner: 10Odder) [00:25:43] MarcoA: if it can't be assigned locally, how is it a security issue? [00:26:09] p858snake|L2: editinterface permissions are sensitive [00:26:23] more than any other [00:26:30] (03PS1) 10Filippo Giunchedi: hieradata: add restbase101[678] [puppet] - 10https://gerrit.wikimedia.org/r/326862 (https://phabricator.wikimedia.org/T150964) [00:26:33] i'm more than aware [00:26:50] and we're (stewards) certainly not going to assign that to anyone [00:27:10] so there are two options: keep it there for show or get rid of them [00:27:21] and I prefer being hygienic [00:27:25] yeah - still why would it hurt to ask the community? [00:27:41] (03CR) 10Filippo Giunchedi: [C: 032] add instances for restbase101[678] [dns] - 10https://gerrit.wikimedia.org/r/326855 (https://phabricator.wikimedia.org/T150964) (owner: 10Filippo Giunchedi) [00:27:47] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [00:27:58] (03PS3) 10MaxSem: Add localized logo for Gujarati Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323236 (https://phabricator.wikimedia.org/T121853) (owner: 10Odder) [00:28:22] stabstabstab gerrit [00:28:26] It doesn't hurt certainly [00:29:00] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2866874 (10RobH) @gwicke: Task T151075 tracks the setup and installation of these hosts. It shows that while the hosts have b... [00:30:09] odder, pulled on mwdebug1002 [00:30:26] MarcoA: I opened a discussion on the embassy to reconfigure the interface editor for ur.wikipedia, and get support there for the request, try perhaps the same on tr.wikiquote, with notifications to active users + original group requester? [00:31:59] Dereckson: I'll try that tomorrow. [00:32:28] * MarcoA is curious how 20after4 can edit Hiera: pages at wikitech when he's just "shell" there... [00:32:29] https://tr.wikiquote.org/w/index.php?title=Vikis%C3%B6z:K%C3%B6y_%C3%A7e%C5%9Fmesi&oldid=135584 [00:33:01] wiki community seems active [00:33:54] ebernhardson: users can now edit on this wiki, thanks [00:34:25] I'm pretty sure that those votes were canvased from trwiki when they requested both groups. Looking at https://tr.wikiquote.org/wiki/%C3%96zel:SonDe%C4%9Fi%C5%9Fiklikler it seems a very different situation [00:34:38] in any case I'll notify them [00:36:32] odder, does it look good? [00:36:47] MaxSem: Can't see it... [00:40:36] https://gu.wikiquote.org/static/images/project-logos/guwikiquote-1.5x.png works with debug headers [00:41:32] okay, wfm [00:42:16] odder, I'm using https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [00:43:27] PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:43:47] MaxSem: I meant that it doesn't appear on the actual wiki yet. [00:44:33] (03PS5) 10Filippo Giunchedi: prometheus: export gdnsd stats via node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/325975 (https://phabricator.wikimedia.org/T147426) [00:44:45] Krenair: so, for ec.wikimedia, main page is currently set to "Página principal", but in InitialiseSettings.php 'wgWhitelistRead' => [ 'private' => [ 'Main Page', 'Special:UserLogin', 'Special:UserLogout' ], ...], [00:45:09] yeah, you'll have to override wgWhitelistRead for that wiki [00:45:15] it's assumed that private wikis = english [00:45:15] Krenair: private wikis must use "Main Page", or communicate the main page they want so we can override [00:45:18] indeed [00:45:42] !log maxsem@tin Synchronized static/images/project-logos: https://gerrit.wikimedia.org/r/#/c/323236/ (duration: 00m 57s) [00:45:43] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: export gdnsd stats via node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/325975 (https://phabricator.wikimedia.org/T147426) (owner: 10Filippo Giunchedi) [00:45:48] while we have something like 5 non-english private wikis [00:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:11] 18* MarcoA is curious how 20after4 can edit Hiera: pages at wikitech when he's just "shell" there... [00:46:30] OpenStackManager allows it without extra rights if you're a projectadmin of the appropriate project [00:46:54] https://il.wikimedia.org/wiki/Main_Page [00:47:00] created in English too [00:47:01] weird wiki :P [00:47:52] ilwikimedia or wikitech? [00:47:59] both [00:48:04] yeah [00:48:07] but MarcoA probably said wikitech [00:48:25] MaxSem: I believe you still need to synchronise InitialiseSettings.php on tin? [00:48:33] yep [00:48:43] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/323236/ (duration: 00m 44s) [00:48:47] so perhaps in January a goal could be to check every private wiki to be sure we've a nice public homepage at each of them [00:48:47] I waited a bit to avoid race conditions [00:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:52] MaxSem: Sweet beans, thank you. [00:50:02] Looks good. [00:51:01] now the fun part: caching [00:52:27] Filled https://phabricator.wikimedia.org/T153031 to check the private dblist [00:55:33] 06Operations, 06Operations-Software-Development: conftool service removal bugs - https://phabricator.wikimedia.org/T152977#2866959 (10Volans) Regarding the first issue (not removing all the nodes that were removed from the yaml): - the problem is that in [[ https://phabricator.wikimedia.org/diffusion/OSCT/b... [00:55:34] is it possible to oversight moodbar entries? [00:55:58] I need to [00:56:33] Is it a long entry, MarcoA? [00:56:52] A log* [00:56:55] wasn't that one of the extensions that historically had poor deletion support? [00:56:57] ha :-) [00:57:03] and it's content, yep [00:57:21] maybe Krinkle knows [00:57:46] there's a moodbar-delete right which can't be assigned via globalgroupmembership [00:57:58] afaics on the ext:MoodBar at mediawiki [00:58:06] probably missing an AvailableRights entry. is there a task for that? [00:58:34] but the question is if that "deletion" means OS or not [00:58:39] nope afaik [00:58:49] there's a Task to get rid of the extension though [00:59:20] what kind of thing are you trying to suppress? [00:59:31] They'll send me the whole data in need to be hidden later [00:59:59] I'm not sure about the word. Does "doxing" exist? [01:00:07] Yes. [01:00:09] yes [01:00:20] * Krenair sighs [01:00:23] so you can imagine [01:01:43] okay, if you can't use any form of deletion on it, talk to James A [01:05:05] I'll have to I'm afraid [01:08:57] PROBLEM - puppet last run on elastic1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:11:27] RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [01:18:39] (03CR) 10Eevans: [C: 031] "One question (in-lined), but not a blocker." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/326862 (https://phabricator.wikimedia.org/T150964) (owner: 10Filippo Giunchedi) [01:34:25] !log redacted nlWiki moodbar comment via DB [01:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:57] RECOVERY - puppet last run on elastic1025 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [01:49:09] (03PS1) 10Reedy: Disable MoodBar on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326875 (https://phabricator.wikimedia.org/T131340) [01:50:38] (03PS4) 10Reedy: De-deploy the MoodBar extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280624 (https://phabricator.wikimedia.org/T131340) (owner: 10Catrope) [01:51:09] (03CR) 10Reedy: [C: 032] Disable MoodBar on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326875 (https://phabricator.wikimedia.org/T131340) (owner: 10Reedy) [01:51:43] (03Merged) 10jenkins-bot: Disable MoodBar on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326875 (https://phabricator.wikimedia.org/T131340) (owner: 10Reedy) [01:52:58] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:54:01] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Disable moodbar on nlwiki T131340 (duration: 00m 45s) [01:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:14] T131340: De-deploy MoodBar from WMF wikis - https://phabricator.wikimedia.org/T131340 [01:55:03] (03PS5) 10Reedy: Undeploy the MoodBar extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280624 (https://phabricator.wikimedia.org/T131340) (owner: 10Catrope) [01:55:08] (03CR) 10Reedy: [C: 032] Undeploy the MoodBar extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280624 (https://phabricator.wikimedia.org/T131340) (owner: 10Catrope) [01:55:55] (03Merged) 10jenkins-bot: Undeploy the MoodBar extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280624 (https://phabricator.wikimedia.org/T131340) (owner: 10Catrope) [01:56:47] PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:57:48] !log reedy@tin Synchronized wmf-config: Remove all MoodBar config T131340 (duration: 00m 47s) [01:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:57] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [02:24:58] RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [02:32:46] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 10 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2867141 (10GWicke) Considering the upcoming deployment freezes and relatively low priority, the start of the roll-out is looking like January at this point. [02:32:57] PROBLEM - puppet last run on notebook1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:34:57] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [03:01:47] PROBLEM - puppet last run on mw1244 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:01:57] RECOVERY - puppet last run on notebook1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:02:57] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [03:07:07] PROBLEM - puppet last run on radon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:08:47] PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:20:41] (03PS1) 10Andrew Bogott: Keystone: Incorrect role assignment is critical [puppet] - 10https://gerrit.wikimedia.org/r/326887 (https://phabricator.wikimedia.org/T152708) [03:20:43] (03PS1) 10Andrew Bogott: Keystone: add existence checks for critical projects [puppet] - 10https://gerrit.wikimedia.org/r/326888 (https://phabricator.wikimedia.org/T152708) [03:25:53] (03PS2) 10Andrew Bogott: Keystone: add existence checks for critical projects [puppet] - 10https://gerrit.wikimedia.org/r/326888 (https://phabricator.wikimedia.org/T152708) [03:25:55] (03CR) 10Andrew Bogott: [C: 032] Keystone: Incorrect role assignment is critical [puppet] - 10https://gerrit.wikimedia.org/r/326887 (https://phabricator.wikimedia.org/T152708) (owner: 10Andrew Bogott) [03:28:07] (03PS3) 10Andrew Bogott: Keystone: add existence checks for critical projects [puppet] - 10https://gerrit.wikimedia.org/r/326888 (https://phabricator.wikimedia.org/T152708) [03:30:50] RECOVERY - puppet last run on mw1244 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [03:32:57] (03CR) 10Andrew Bogott: [C: 032] Keystone: add existence checks for critical projects [puppet] - 10https://gerrit.wikimedia.org/r/326888 (https://phabricator.wikimedia.org/T152708) (owner: 10Andrew Bogott) [03:35:09] RECOVERY - puppet last run on radon is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [03:36:44] (03PS1) 10Andrew Bogott: Keystone: Fix misnamed icinga check [puppet] - 10https://gerrit.wikimedia.org/r/326889 [03:36:48] RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [03:37:06] (03PS2) 10Andrew Bogott: Keystone: Fix misnamed icinga check [puppet] - 10https://gerrit.wikimedia.org/r/326889 (https://phabricator.wikimedia.org/T152708) [03:39:45] (03PS3) 10Andrew Bogott: Keystone: Fix misnamed icinga check [puppet] - 10https://gerrit.wikimedia.org/r/326889 (https://phabricator.wikimedia.org/T152708) [03:41:41] (03CR) 10Andrew Bogott: [C: 032] Keystone: Fix misnamed icinga check [puppet] - 10https://gerrit.wikimedia.org/r/326889 (https://phabricator.wikimedia.org/T152708) (owner: 10Andrew Bogott) [03:42:58] PROBLEM - puppet last run on mc1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:53:56] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:09:33] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=588.50 Read Requests/Sec=245.70 Write Requests/Sec=2.10 KBytes Read/Sec=31412.80 KBytes_Written/Sec=736.40 [04:09:54] RECOVERY - puppet last run on mc1031 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [04:21:33] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=62.10 Read Requests/Sec=0.00 Write Requests/Sec=23.80 KBytes Read/Sec=0.00 KBytes_Written/Sec=782.00 [04:22:53] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [04:29:43] PROBLEM - High load average on labstore1003 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [24.0] [04:32:43] RECOVERY - High load average on labstore1003 is OK: OK: Less than 50.00% above the threshold [16.0] [04:44:35] (03PS1) 10Tim Landscheidt: Tools: Make tools-clush-generator project-agnostic [puppet] - 10https://gerrit.wikimedia.org/r/326892 [05:01:43] PROBLEM - High load average on labstore1003 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [24.0] [05:04:43] RECOVERY - High load average on labstore1003 is OK: OK: Less than 50.00% above the threshold [16.0] [05:12:33] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:33:54] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:41:33] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [05:55:43] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [05:56:43] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4548153 keys, up 42 days 21 hours - replication_delay is 0 [06:01:53] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:03:33] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 686.32 seconds [06:32:46] 06Operations, 06Operations-Software-Development: conftool service removal bugs - https://phabricator.wikimedia.org/T152977#2867244 (10Joe) Please note there are a couple of patches that we should merge that would change how the syncer (which is admittedly very hacky at the moment) works, I'd rather work on tho... [06:33:43] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [06:53:13] PROBLEM - Disk space on elastic1032 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=90%) [07:00:34] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [07:06:23] !log Deploy alter table db1049 (master) dewiki.revision - https://phabricator.wikimedia.org/T148967 [07:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:43] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 251.24 seconds [07:47:49] !log Stop MySQL db2048 for maintenance - T149553 [07:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:01] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [07:48:47] 06Operations, 06Operations-Software-Development, 15User-Joe: conftool service removal bugs - https://phabricator.wikimedia.org/T152977#2867307 (10Joe) [07:56:22] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2867333 (10Joe) [08:02:34] 06Operations, 10Traffic, 07HHVM, 15User-Joe, 15User-mobrovac: Enable TLS termination on the MediaWiki clusters - https://phabricator.wikimedia.org/T153042#2867337 (10Joe) [08:06:46] (03PS1) 10Marostegui: db-codfw.php: Depool db2067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326898 (https://phabricator.wikimedia.org/T151552) [08:08:03] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326898 (https://phabricator.wikimedia.org/T151552) (owner: 10Marostegui) [08:08:17] (03PS2) 10Jcrespo: mariadb: Depool db1051 for schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326493 (https://phabricator.wikimedia.org/T69223) [08:08:41] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326898 (https://phabricator.wikimedia.org/T151552) (owner: 10Marostegui) [08:11:30] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2067 - T151552 (duration: 02m 19s) [08:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:43] T151552: Import S2,S6,S7,m3 and x1 to dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T151552 [08:12:08] !log Stop replication on db2067 for maintenance - T151552 [08:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:11] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1051 for schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326493 (https://phabricator.wikimedia.org/T69223) (owner: 10Jcrespo) [08:15:47] (03Merged) 10jenkins-bot: mariadb: Depool db1051 for schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326493 (https://phabricator.wikimedia.org/T69223) (owner: 10Jcrespo) [08:18:18] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1051 (duration: 00m 54s) [08:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:20] !log elasticsearch1032 deleting production-search-eqiad.log.1 [08:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:21] !log elastic1032 truncating production-search-eqiad.log [08:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:13] RECOVERY - Disk space on elastic1032 is OK: DISK OK [08:30:06] !log alter table on db1051, db1057 T69223 [08:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:18] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [08:33:20] 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2867409 (10jcrespo) @kaldari What is the status. Has the script finished? Is it running still? This is to make the maintenance win... [08:33:55] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1051 for schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326900 [08:36:35] 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2867427 (10kaldari) @jcrespo: The script is still running, but I expect it to finish by the end of the window (10 hours from now). [08:39:07] 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2867428 (10jcrespo) Good. [08:39:19] (03CR) 10Gehel: tilerator: deploy config with scap3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324761 (https://phabricator.wikimedia.org/T150021) (owner: 10Gehel) [08:40:22] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: elastic2020.codfw.wmnet (tags: ['dc=codfw', 'cluster=elasticsearch', 'service=elasticsearch']) [08:40:29] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: elastic2020.codfw.wmnet (tags: ['dc=codfw', 'cluster=elasticsearch', 'service=elasticsearch-ssl']) [08:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:44] !log depool elastic2020, T149006 [08:40:46] (03CR) 10Marostegui: [C: 031] Revert "mariadb: Depool db1051 for schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326900 (owner: 10Jcrespo) [08:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:54] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [08:41:36] (03CR) 10Jcrespo: [C: 04-2] "Maintenance in progress." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326900 (owner: 10Jcrespo) [08:43:30] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2867449 (10akosiaris) Depooled and powered off. @Papaul server is ready for maintenance. [08:44:53] PROBLEM - Host elastic2020 is DOWN: PING CRITICAL - Packet loss = 100% [08:46:06] <_joe_> down again [08:46:18] <_joe_> heh [08:46:46] _joe_: it's planned I think this time [08:46:49] I powered it off [08:47:03] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:47:13] papaul has maintenance for today 9:30 am local time [08:48:14] 06Operations, 10Wikimedia-Logstash: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#2867464 (10elukey) Adding an info about the varnishkafka instance: we could filter 5xx Response statues from all the varnish frontends and send them directly to Kafka, and the consume them as we wi... [08:52:15] <_joe_> ok [08:52:44] like yesterday we will probably have some icinga alerts flapping (regarding latencies) on elastic@codfw, we're a bit short with 23 nodes [08:54:02] or maybe just during rebalancing, idk... we'll see [09:03:53] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [2000.0] [09:04:46] we also have the added latency of cross DC, but the thresholds we use are the same for eqiad and codfw... I might want to change that [09:04:53] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1001 is OK: OK: Less than 20.00% above the threshold [1200.0] [09:05:24] !log restarting and upgrading db1051 [09:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:53] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [2000.0] [09:12:21] gehel: would it make sense to disable this check? ^, at least the time needed for the cluster to rebalance? [09:13:14] dcausse: done [09:13:20] thanks! [09:14:15] dcausse: we actually already have a different check for codfw, but those latency checks are not very stable, especially if we want them to work both when there is traffic on the cluster and when there isnt... [09:15:13] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [09:15:47] gehel: at some point I wonder if could not use both all the time (split by wiki or some other criteria) [09:17:08] dcausse: if I understood correctly, the general idea for multi-DC is to keep codfw as a failover solution, or for apps that can be active-active, to server the traffic from codfw to codfw [09:17:32] dcausse: but switching the traffic from eqiad to codfw as we do is an abuse of the system [09:17:43] oh ok [09:17:47] * gehel would be more comfortable to have traffic in both DC all the time [09:18:25] I find it extremely convenient to have codfw where we can serve traffic from time to time [09:18:27] dcausse: once we'll have mediawiki active-active, then we'll have traffic on both elasticsearch clusters (but that's not for tomorrow) [09:18:40] ok [09:19:03] dcausse: yes it is! But the idea is that each DC should be self sufficient, including for maintenance operations. [09:19:15] sure [09:19:52] note that we do use codfw as a convenience, but most (if not all) operations we do could be done in place in eqiad only [09:22:08] gehel: yes in theory, but reindexing often induces bad latencies and we have no easy way to activate specific mw features based on the index state. multi-DC was just extremely in this case :) [09:22:26] *extremely convenient [09:22:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Mostly looks fine, inline comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323079 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [09:23:53] PROBLEM - puppet last run on ms-be1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:03] PROBLEM - puppet last run on maps1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:29:03] (03PS1) 10Elukey: Set systemd dependency correctly for vk statsv/el instances [puppet] - 10https://gerrit.wikimedia.org/r/326904 [09:35:22] (03Abandoned) 10Nikerabbit: Set valid content language for Norwegian wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277519 (https://phabricator.wikimedia.org/T126146) (owner: 10Nikerabbit) [09:39:17] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/4857/cp4018.ulsfo.wmnet/ looks good, the only thing that I don't get is why the varnish_name set to $:" [puppet] - 10https://gerrit.wikimedia.org/r/326904 (owner: 10Elukey) [09:39:53] PROBLEM - puppet last run on db1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:41:58] !log Updating MediaWiki Jenkins jobs to support injecting skin dependencies T151593 [09:42:10] poor morebot [09:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:12] T151593: Add support for skin and extension dependencies in new skin unit test - https://phabricator.wikimedia.org/T151593 [09:47:53] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1001 is OK: OK: Less than 20.00% above the threshold [1200.0] [09:51:53] RECOVERY - puppet last run on ms-be1024 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:53:16] (03PS2) 10Elukey: Set systemd dependency correctly for vk statsv/el instances [puppet] - 10https://gerrit.wikimedia.org/r/326904 [09:56:01] (03PS2) 10Jcrespo: Repool db1051 with low load after maintenance and restart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326900 [09:56:03] RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:56:58] (03CR) 10Elukey: "Ok solved the mistery, the value was overridden in cache::text. I think that sane defaults are better and less misleading (override only i" [puppet] - 10https://gerrit.wikimedia.org/r/326904 (owner: 10Elukey) [09:58:00] (03Abandoned) 10Amire80: Split sql to sql and sqlhost [puppet] - 10https://gerrit.wikimedia.org/r/300862 (https://phabricator.wikimedia.org/T141255) (owner: 10Amire80) [09:58:20] (03PS3) 10Elukey: Set systemd dependency correctly for vk statsv/el instances [puppet] - 10https://gerrit.wikimedia.org/r/326904 [09:58:44] (03CR) 10Marostegui: [C: 031] "looks good for when the maintenance is done!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326900 (owner: 10Jcrespo) [10:00:33] (03PS1) 10Giuseppe Lavagetto: puppetmaster: add puppet-wildcardsign, small fixes to puppet-ecdsacert [puppet] - 10https://gerrit.wikimedia.org/r/326910 (https://phabricator.wikimedia.org/T153042) [10:02:11] (03PS1) 10Jcrespo: mariadb: Depool db1083 for maintenance and reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326912 (https://phabricator.wikimedia.org/T69223) [10:06:53] (03PS1) 10Jcrespo: Point analytics slaves to the right hosts [dns] - 10https://gerrit.wikimedia.org/r/326913 [10:08:42] (03PS2) 10Jcrespo: Point analytics slaves to the right hosts [dns] - 10https://gerrit.wikimedia.org/r/326913 [10:08:53] RECOVERY - puppet last run on db1036 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [10:10:04] jynus: is --^ due to some work that you are doing or a pre-existing issue that you are fixing? I am wondering what kind of impact it could have to people already using those domains [10:10:22] (if any, I am really ignorant about this part) [10:11:01] pre-existing issue that you are fixing [10:11:20] x1 analytics slave was pointing to the master [10:11:40] the other should point to db1047, to offload dbstore1002 when possible [10:12:26] (03CR) 10Marostegui: [C: 031] "I don't have context on why the change is needed (but I can guess why :-) ), the change itself looks good!" [dns] - 10https://gerrit.wikimedia.org/r/326913 (owner: 10Jcrespo) [10:15:20] ^ i guessed right after reading what jaime said :) [10:15:38] 06Operations, 10Wikimedia-Site-requests, 07Bengali-Sites: Create a new wiki for Wikimedia Bangladesh - https://phabricator.wikimedia.org/T33096#2867710 (10MarcoAurelio) [10:17:59] (03CR) 10Jcrespo: [C: 032] Repool db1051 with low load after maintenance and restart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326900 (owner: 10Jcrespo) [10:18:36] (03Merged) 10jenkins-bot: Repool db1051 with low load after maintenance and restart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326900 (owner: 10Jcrespo) [10:20:06] (03CR) 10Volans: [C: 031] "LGTM. The change is hacky, but is one of the cleanest way to allow this in Puppet code and it's used on a separate script that has to be e" [puppet] - 10https://gerrit.wikimedia.org/r/326910 (https://phabricator.wikimedia.org/T153042) (owner: 10Giuseppe Lavagetto) [10:20:59] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1051 with low load (duration: 00m 47s) [10:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:09] (03PS2) 10Giuseppe Lavagetto: puppetmaster: add puppet-wildcardsign, small fixes to puppet-ecdsacert [puppet] - 10https://gerrit.wikimedia.org/r/326910 (https://phabricator.wikimedia.org/T153042) [10:35:11] (03CR) 10Ema: [C: 031] "Great catch, the services were binding to varnish.service instead of varnish-frontend.service and thus were restarted by the weekly backen" [puppet] - 10https://gerrit.wikimedia.org/r/326904 (owner: 10Elukey) [10:36:03] (03CR) 10Elukey: [C: 032] Set systemd dependency correctly for vk statsv/el instances [puppet] - 10https://gerrit.wikimedia.org/r/326904 (owner: 10Elukey) [10:37:01] (03CR) 10Volans: [C: 031] "LGTM, a bit cleaner." [puppet] - 10https://gerrit.wikimedia.org/r/326910 (https://phabricator.wikimedia.org/T153042) (owner: 10Giuseppe Lavagetto) [10:38:08] (03PS3) 10Giuseppe Lavagetto: puppetmaster: add puppet-wildcardsign, small fixes to puppet-ecdsacert [puppet] - 10https://gerrit.wikimedia.org/r/326910 (https://phabricator.wikimedia.org/T153042) [10:38:30] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1083 for maintenance and reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326912 (https://phabricator.wikimedia.org/T69223) (owner: 10Jcrespo) [10:39:11] (03Merged) 10jenkins-bot: mariadb: Depool db1083 for maintenance and reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326912 (https://phabricator.wikimedia.org/T69223) (owner: 10Jcrespo) [10:40:18] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: add puppet-wildcardsign, small fixes to puppet-ecdsacert [puppet] - 10https://gerrit.wikimedia.org/r/326910 (https://phabricator.wikimedia.org/T153042) (owner: 10Giuseppe Lavagetto) [10:41:52] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1083 (duration: 00m 45s) [10:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:56] (03PS5) 10Gehel: tilerator: deploy config with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/324761 (https://phabricator.wikimedia.org/T150021) [10:46:57] (03PS1) 10Jcrespo: Pool db1051 with 100% load after warmup; remove db1052 as api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326920 [10:47:25] zhuyifei1999_ is tweaking storage for its video converstion service, using nfs. I wonder if it would be possible/useful to mount labstore1003.eqiad.wmnet:/scratch on terbium, to process server side uploads, especially as currently we have requests for 2 to 30 Gb of files [10:48:11] * zhuyifei1999_ confirms ^ [10:48:41] !log gehel@tin Starting deploy [tilerator/deploy@2d62722]: (no message) [10:48:50] !log gehel@tin Finished deploy [tilerator/deploy@2d62722]: (no message) (duration: 00m 09s) [10:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:03] !log gehel@tin Starting deploy [tilerator/deploy@2d62722]: (no message) [10:54:08] !log gehel@tin Finished deploy [tilerator/deploy@2d62722]: (no message) (duration: 00m 05s) [10:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:21] (03CR) 10Alexandros Kosiaris: [C: 032] Trending Edits: Add to SCB [puppet] - 10https://gerrit.wikimedia.org/r/326527 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) [10:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:26] (03PS2) 10Alexandros Kosiaris: Trending Edits: Add to SCB [puppet] - 10https://gerrit.wikimedia.org/r/326527 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) [10:54:29] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Trending Edits: Add to SCB [puppet] - 10https://gerrit.wikimedia.org/r/326527 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) [10:55:00] !log gehel@tin Starting deploy [tilerator/deploy@2d62722]: (no message) [10:55:05] !log gehel@tin Finished deploy [tilerator/deploy@2d62722]: (no message) (duration: 00m 05s) [10:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:49] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[trending-edits/deploy] [10:59:09] PROBLEM - trendingedits endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=6699): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [11:00:09] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[trending-edits/deploy] [11:01:09] PROBLEM - trendingedits endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=6699): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [11:01:20] (03CR) 10Elukey: "Looks fine to me, but Andrew have surely more context to +1 :)" [dns] - 10https://gerrit.wikimedia.org/r/326913 (owner: 10Jcrespo) [11:01:59] (03PS2) 10Elukey: Remove no longer needed statistics::migration role [puppet] - 10https://gerrit.wikimedia.org/r/326489 (owner: 10Ottomata) [11:02:11] !log gehel@mira Starting deploy [tilerator/deploy@2d62722]: (no message) [11:02:18] !log gehel@mira Finished deploy [tilerator/deploy@2d62722]: (no message) (duration: 00m 08s) [11:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:50] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[trending-edits/deploy] [11:02:51] !log alter table on db1083 T69223 [11:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:02] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [11:06:00] PROBLEM - puppet last run on scb2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[trending-edits/deploy] [11:06:17] * akosiaris is already aware of scb trending-edits failures [11:06:20] scap issues [11:07:10] PROBLEM - puppet last run on scb2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[trending-edits/deploy] [11:07:59] PROBLEM - puppet last run on scb1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[trending-edits/deploy] [11:07:59] PROBLEM - puppet last run on scb2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[trending-edits/deploy] [11:08:44] (03CR) 10Elukey: [C: 032] Remove no longer needed statistics::migration role [puppet] - 10https://gerrit.wikimedia.org/r/326489 (owner: 10Ottomata) [11:10:16] 06Operations, 10DBA: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#1737574 (10TTO) Do these tables remain on any wikis? From T132837 it seems like at least the `hitcounter` ones have been delet... [11:10:21] !log akosiaris@tin Starting deploy [trending-edits/deploy@758357d]: (no message) [11:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:46] !log akosiaris@tin Finished deploy [trending-edits/deploy@758357d]: (no message) (duration: 00m 25s) [11:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:49] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [11:17:09] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [11:18:50] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [11:18:50] PROBLEM - puppet last run on scb1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 45 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[trending-edits/deploy] [11:18:59] RECOVERY - puppet last run on scb1004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [11:18:59] RECOVERY - puppet last run on scb2004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [11:19:09] RECOVERY - trendingedits endpoints health on scb1001 is OK: All endpoints are healthy [11:19:09] RECOVERY - puppet last run on scb2002 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [11:19:09] RECOVERY - puppet last run on scb2003 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [11:19:39] 06Operations, 10DBA: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#1737574 (10Marostegui) `hitcounter` tables were deleted indeed. I will check the other ones. [11:19:49] RECOVERY - puppet last run on scb1003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [11:21:29] !log akosiaris@tin Starting deploy [trending-edits/deploy@758357d]: (no message) [11:21:39] !log akosiaris@tin Finished deploy [trending-edits/deploy@758357d]: (no message) (duration: 00m 10s) [11:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:56] and trending edits fixed [11:25:10] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [11:28:43] (03CR) 10Elukey: [V: 032 C: 032] Initial debianization [debs/prometheus-apache-exporter] - 10https://gerrit.wikimedia.org/r/325568 (https://phabricator.wikimedia.org/T147316) (owner: 10Elukey) [11:30:52] !log restart zotero on sca1003, sca1004, OOM issues [11:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:14] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [11:39:05] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[trending-edits/deploy] [11:43:04] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [11:54:14] RECOVERY - trendingedits endpoints health on scb1002 is OK: All endpoints are healthy [11:55:14] (03PS1) 10Urbanecm: Fixup for T152490, allow nowikimedia crats to manipulate with translationadmin group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326924 (https://phabricator.wikimedia.org/T152490) [11:57:39] (03PS2) 10Urbanecm: Fixup for T152490, allow nowikimedia sysops to manipulate with translationadmin group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326924 (https://phabricator.wikimedia.org/T152490) [12:00:09] (03PS3) 10Dereckson: Allow bureaucrats to manage translationadmin group on no.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326924 (https://phabricator.wikimedia.org/T152490) (owner: 10Urbanecm) [12:00:47] (03PS4) 10Dereckson: Allow bureaucrats to manage translationadmin group on no.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326924 (https://phabricator.wikimedia.org/T152490) (owner: 10Urbanecm) [12:02:27] (03PS5) 10Urbanecm: Allow bureaucrats to manage translationadmin group on no.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326924 (https://phabricator.wikimedia.org/T152490) [12:03:11] (03PS6) 10Urbanecm: Allow sysops to manage translationadmin group on no.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326924 (https://phabricator.wikimedia.org/T152490) [12:07:02] grafana was updated [12:07:12] and I think it messes up with 0 values [12:07:26] they show as NaN instead of 0 [12:08:22] (03CR) 10Dereckson: [C: 031] "PS2: follow-up changes are better browed by commit, especially as Phabricator Diffusion will cross reference them." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326924 (https://phabricator.wikimedia.org/T152490) (owner: 10Urbanecm) [12:09:04] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:12:04] PROBLEM - puppet last run on es1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:14:37] !log restart and upgrade of db1083 T69223 [12:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:48] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [12:21:41] (03CR) 10Jcrespo: [C: 032] Pool db1051 with 100% load after warmup; remove db1052 as api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326920 (owner: 10Jcrespo) [12:22:37] (03PS1) 10Marostegui: db-eqiad: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326926 (https://phabricator.wikimedia.org/T148967) [12:23:37] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool db1051 with 100% load after warmup (duration: 00m 47s) [12:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:06] (03PS2) 10Marostegui: db-eqiad: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326926 (https://phabricator.wikimedia.org/T148967) [12:24:26] can you hand me db1087 when you are finished, before repooling? [12:24:33] sure thing [12:24:37] I will ping you [12:24:41] thanks [12:24:59] (03CR) 10Marostegui: [C: 032] db-eqiad: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326926 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [12:25:40] (03Merged) 10jenkins-bot: db-eqiad: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326926 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [12:26:44] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 610.23 seconds [12:26:55] that is ok [12:27:10] that is the ongoing maintenance script from community tech [12:27:15] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1087 - T148967 (duration: 00m 46s) [12:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:29] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [12:27:42] will down time for some hours [12:27:52] !log Deploy alter table db1087 dewiki.revision - T148967 [12:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:15] (03PS7) 10Urbanecm: Allow sysops to manage translationadmin group on no.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326924 (https://phabricator.wikimedia.org/T152490) [12:28:53] (03CR) 10Urbanecm: "Thanks for your help and for the previous patch!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326924 (https://phabricator.wikimedia.org/T152490) (owner: 10Urbanecm) [12:31:15] jouncebot: next [12:31:15] In 1 hour(s) and 28 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161213T1400) [12:32:12] (03PS1) 10Marostegui: db-codfw.php: Repool db2064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326927 (https://phabricator.wikimedia.org/T151552) [12:32:35] (03CR) 10Marostegui: [C: 04-2] "Still needs to catch up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326927 (https://phabricator.wikimedia.org/T151552) (owner: 10Marostegui) [12:33:24] PROBLEM - Host labsdb1007 is DOWN: PING CRITICAL - Packet loss = 100% [12:33:26] (03CR) 10Jcrespo: [C: 031] "Will handle later." [puppet] - 10https://gerrit.wikimedia.org/r/325509 (owner: 10Tim Landscheidt) [12:37:30] 06Operations, 06Labs, 10video2commons: Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068#2868444 (10zhuyifei1999) [12:37:49] 06Operations, 10DBA: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#2868461 (10Marostegui) What Jaime posted on T115982#1807646 is still the situation we have with the exception of the `hitcount... [12:38:04] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [12:40:03] (03PS1) 10Jcrespo: Repool db1083 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326928 [12:40:43] (03CR) 10Jcrespo: [C: 04-2] "Waiting for warmup." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326928 (owner: 10Jcrespo) [12:41:04] RECOVERY - puppet last run on es1013 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [12:41:50] !log alter table on db1087 T69223 [12:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:03] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [12:42:51] 06Operations, 10Ops-Access-Requests, 10Analytics, 06Services (watching): Services team should have access to EventBus logs - https://phabricator.wikimedia.org/T153028#2868470 (10mobrovac) [12:42:53] jynus: once db1087 is done I will take db1082, do you need it too? [12:43:49] let me see [12:44:00] yes [12:44:16] you are doing schem changes on wikidata.revision, right? [12:44:22] ok, once you pool db1087 I will depool db1082 and let you know once I am done [12:44:35] no, dewiki (one index missing in a few servers) [12:44:38] ah [12:44:38] dewiki.revision [12:44:41] ok ok [12:45:07] 06Operations, 06Labs, 10video2commons: Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068#2868471 (10Dereckson) Current upload volume is 2 to 30 Gb per week. A dedicated volume for v2c and a read only mount would decrease labs/pro... [12:45:17] I am going to have lunch now, marostegui [12:45:24] if you do not want to wait for me [12:46:12] you will be able to see here when the alter finishes: https://grafana-admin.wikimedia.org/dashboard/db/mysql?panelId=37&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-server=db1087 [12:46:58] ok :) [12:47:03] will keep an eye [12:47:37] (03CR) 10Marostegui: [C: 031] "Server caught up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326927 (https://phabricator.wikimedia.org/T151552) (owner: 10Marostegui) [12:47:50] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326927 (https://phabricator.wikimedia.org/T151552) (owner: 10Marostegui) [12:48:31] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326927 (https://phabricator.wikimedia.org/T151552) (owner: 10Marostegui) [12:50:14] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2064 - T151552 (duration: 00m 45s) [12:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:27] T151552: Import S2,S6,S7,m3 and x1 to dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T151552 [12:52:05] (03PS1) 10Marostegui: Revert "db-eqiad: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326929 [12:52:59] jynus: your alter is done `page_lang` varbinary(35) DEFAULT NULL, [12:53:46] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326929 (owner: 10Marostegui) [12:54:17] (03Merged) 10jenkins-bot: Revert "db-eqiad: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326929 (owner: 10Marostegui) [12:56:00] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1087 - T148967 and T69223 (duration: 00m 47s) [12:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:14] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [12:56:14] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [12:58:23] (03PS1) 10Marostegui: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326931 (https://phabricator.wikimedia.org/T148967) [13:00:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326931 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [13:01:02] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326931 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [13:02:30] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1082 - T148967 and T69223 (duration: 00m 45s) [13:02:31] 06Operations, 10Ops-Access-Requests, 10Analytics, 06Services (watching): Services team should have access to EventBus logs - https://phabricator.wikimedia.org/T153028#2868534 (10mobrovac) a:05Nuria>03None @Pchelolo reading the logs works just fine for me: ``` mobrovac@kafka1001:~$ tail /srv/log/eventl... [13:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:43] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [13:02:43] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [13:03:00] !log Deploy alter table db1082 dewiki.revision - T148967 [13:03:04] PROBLEM - puppet last run on mw1301 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:33] 06Operations, 10Analytics, 10EventBus, 06Services (watching), 15User-mobrovac: Services team should have access to EventBus logs - https://phabricator.wikimedia.org/T153028#2868690 (10mobrovac) a:03mobrovac Oh, you may mean syslog logs. We need to output them just as we do for SCB services. [13:12:23] lag on db1087 [13:12:51] false alarm [13:13:09] I hadn't refreshed grafana [13:14:55] !log alter table on db1082 - wikidatawiki T69223 [13:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:05] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [13:19:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] "looks fine, apart from the IP already being allocated." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/326528 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) [13:21:16] (03PS1) 10Alexandros Kosiaris: Allocate LVS IPs for trendingedits service [dns] - 10https://gerrit.wikimedia.org/r/326933 (https://phabricator.wikimedia.org/T150043) [13:21:47] (03CR) 10Jcrespo: [C: 032] Repool db1083 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326928 (owner: 10Jcrespo) [13:21:54] (03PS2) 10Jcrespo: Repool db1083 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326928 [13:23:36] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326934 [13:24:54] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326934 (owner: 10Marostegui) [13:25:09] (03Draft1) 10Paladox: Up the size for storage.mysql-engine.max-size to 20mb in bytes [puppet] - 10https://gerrit.wikimedia.org/r/326932 (https://phabricator.wikimedia.org/T151544) [13:25:11] (03Draft2) 10Paladox: Up the size for storage.mysql-engine.max-size to 20mb in bytes [puppet] - 10https://gerrit.wikimedia.org/r/326932 (https://phabricator.wikimedia.org/T151544) [13:25:58] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1083 with low load after restart (duration: 00m 44s) [13:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:14] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326934 [13:27:02] 06Operations, 06Labs, 10video2commons: Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068#2868728 (10zhuyifei1999) [13:27:33] (03PS2) 10Alexandros Kosiaris: Allocate LVS IPs for trendingedits service [dns] - 10https://gerrit.wikimedia.org/r/326933 (https://phabricator.wikimedia.org/T150043) [13:28:06] (03PS2) 10Mobrovac: Trending Edits: LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/326528 (https://phabricator.wikimedia.org/T150043) [13:28:21] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce argon and chlorine [dns] - 10https://gerrit.wikimedia.org/r/326445 (https://phabricator.wikimedia.org/T152966) (owner: 10Alexandros Kosiaris) [13:28:48] did labsdb1007 went down while I was out? [13:28:51] 06Operations, 06Labs, 10video2commons: Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068#2868444 (10zhuyifei1999) Security issue found; don't do this yet. [13:29:02] (03CR) 10Mobrovac: Trending Edits: LVS configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/326528 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) [13:29:13] (03CR) 10Alexandros Kosiaris: [C: 032] Allocate LVS IPs for trendingedits service [dns] - 10https://gerrit.wikimedia.org/r/326933 (https://phabricator.wikimedia.org/T150043) (owner: 10Alexandros Kosiaris) [13:29:27] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1082 - T148967 and T69223 (duration: 00m 45s) [13:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:41] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [13:29:42] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [13:31:07] (03PS2) 10Alexandros Kosiaris: Introduce argon and chlorine [dns] - 10https://gerrit.wikimedia.org/r/326445 (https://phabricator.wikimedia.org/T152966) [13:31:29] strange, it looks up on serial [13:31:54] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 288.97 seconds [13:31:54] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [13:32:04] RECOVERY - puppet last run on mw1301 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [13:32:19] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce argon and chlorine [dns] - 10https://gerrit.wikimedia.org/r/326445 (https://phabricator.wikimedia.org/T152966) (owner: 10Alexandros Kosiaris) [13:32:23] (03PS3) 10Alexandros Kosiaris: Introduce argon and chlorine [dns] - 10https://gerrit.wikimedia.org/r/326445 (https://phabricator.wikimedia.org/T152966) [13:32:26] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Introduce argon and chlorine [dns] - 10https://gerrit.wikimedia.org/r/326445 (https://phabricator.wikimedia.org/T152966) (owner: 10Alexandros Kosiaris) [13:33:09] (03PS1) 10Marostegui: db-eqiad.php: Depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326935 (https://phabricator.wikimedia.org/T148967) [13:35:36] (03CR) 10Alexandros Kosiaris: [C: 031] "Looks fine to me, not sure how this slipped through the cracks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323195 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [13:36:46] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326935 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [13:36:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326935 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [13:37:35] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326935 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [13:38:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1045 - T148967 (duration: 00m 45s) [13:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:07] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [13:41:09] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Estimate hardware requirements for WDQS upgrade - https://phabricator.wikimedia.org/T148747#2868758 (10mark) [13:41:37] !log Deploy alter table db1045 dewiki.revision - T148967 [13:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:19] (03CR) 10Jcrespo: [C: 04-1] "Do not store files on MySQL." [puppet] - 10https://gerrit.wikimedia.org/r/326932 (https://phabricator.wikimedia.org/T151544) (owner: 10Paladox) [13:43:11] (03CR) 10Alexandros Kosiaris: [C: 032] "Unless my memory deceives me, this was discussed during the ops meeting yesterday and while it was mentioned that it slipped through the c" [puppet] - 10https://gerrit.wikimedia.org/r/323195 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [13:45:51] (03PS2) 10Alexandros Kosiaris: PDF Render: Create the service's admin group [puppet] - 10https://gerrit.wikimedia.org/r/323195 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [13:50:53] !log gehel@mira Starting deploy [tilerator/deploy@2d62722]: (no message) [13:50:58] !log gehel@mira Finished deploy [tilerator/deploy@2d62722]: (no message) (duration: 00m 05s) [13:51:02] jouncebot: next [13:51:02] In 0 hour(s) and 8 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161213T1400) [13:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:41] hashar: there are a couple of patches for eu swat today, what's the plan? you? me? somebody else? ;) [13:54:02] (03PS8) 10Zfilipin: Allow sysops to manage translationadmin group on no.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326924 (https://phabricator.wikimedia.org/T152490) (owner: 10Urbanecm) [13:56:34] (03CR) 10Alexandros Kosiaris: [C: 032] PDF Render: Create the service's admin group [puppet] - 10https://gerrit.wikimedia.org/r/323195 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [13:58:55] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161213T1400). [14:00:04] dcausse and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:21] o/ [14:01:04] Present [14:02:50] o/ [14:03:12] dcausse: wanna try deploying the change yourself ? [14:03:29] hashar: if you guide me sure [14:03:33] (03CR) 10Hashar: [C: 032] Allow sysops to manage translationadmin group on no.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326924 (https://phabricator.wikimedia.org/T152490) (owner: 10Urbanecm) [14:03:36] lets do that :) [14:03:41] thanks :) [14:03:41] first step: CR+2 da patch :} [14:03:55] I don't have that perms :( [14:03:58] oh man [14:04:04] RECOVERY - Host labsdb1007 is UP: PING WARNING - Packet loss = 80%, RTA = 0.24 ms [14:04:12] like you can't even ssh to tin ? :( [14:04:18] (03Merged) 10jenkins-bot: Allow sysops to manage translationadmin group on no.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326924 (https://phabricator.wikimedia.org/T152490) (owner: 10Urbanecm) [14:04:31] hashar: I have all the rights needed to deploy +2 on mw-config and wmf branches [14:04:51] err: I have all the rights needed to deploy *except* +2 on mw-config and wmf branches [14:04:57] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development, 13Patch-For-Review: tox-jessie is failing on operations/software - https://phabricator.wikimedia.org/T152549#2868835 (10akosiaris) p:05Triage>03Normal [14:05:00] that is rediculous [14:05:06] c'est ridicule [14:05:08] I know :/ [14:05:12] je sais :\ [14:06:06] !log Added dcausse to Gerrit group "mediawiki" effectively granting Code-Review +2 on every mediawiki/* repositories [14:06:07] fixed [14:06:12] \o/ [14:06:14] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:06:14] thanks :) [14:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:33] that gives you well [14:06:42] CR+2 everywhere under mediawiki/ on all branches etc [14:06:45] 06Operations, 10Analytics, 10EventBus, 06Services (watching), 15User-mobrovac: EventBus HTTP proxy service's syslog entries should be readable - https://phabricator.wikimedia.org/T153028#2868847 (10mobrovac) p:05Triage>03Normal [14:07:04] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:07:06] re [14:07:15] (03PS1) 10Alexandros Kosiaris: Fix chlorine, argon IP addresses [dns] - 10https://gerrit.wikimedia.org/r/326940 [14:07:19] 06Operations, 10Analytics, 10EventBus, 06Services (doing), 15User-mobrovac: EventBus HTTP proxy service's syslog entries should be readable - https://phabricator.wikimedia.org/T153028#2866893 (10mobrovac) [14:07:22] zeljkof: are you doing the mwconfig change for Urbanecm or should I ? [14:08:01] (03CR) 10Alexandros Kosiaris: [C: 032] Fix chlorine, argon IP addresses [dns] - 10https://gerrit.wikimedia.org/r/326940 (owner: 10Alexandros Kosiaris) [14:08:24] PROBLEM - Host labsdb1007 is DOWN: PING CRITICAL - Packet loss = 100% [14:08:28] dcausse: so the deploy step is eventually: [14:08:30] hashar: go ahead [14:08:33] a) wait for change to merge [14:08:41] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1045" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326941 [14:08:54] hashar: I'm in node/npm/grunt hell [14:09:04] 🔥 [14:09:04] RECOVERY - Host labsdb1007 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [14:09:06] b) head to deployment.eqiad.wmnet and in /srv/mediawiki-staging/php-1.29.0-wmf.5 grab the latest change (git fetch / verify what is going to be added / git rebase ) [14:09:19] c) git submodule update extensions/CirrusSearch [14:09:40] d) ssh to mwdebug1002.eqiad.wmnet | cd /srv/mediawiki-staging | scap pull [14:09:48] e) test with the X-Wikimedia-Debug header [14:10:19] if all good deploy from tin :} ( scap sync-dir php-1.29.0-wmf.5/extensions/ElasticSsearch "some message" ) [14:10:36] zeljkof: gotta redo my whole apache/php5 config :/ [14:10:55] hashar: install mw-vagrant ;) [14:10:57] (03PS1) 10Mobrovac: EventBus Proxy: Ensure the syslog output file is readable [puppet] - 10https://gerrit.wikimedia.org/r/326942 (https://phabricator.wikimedia.org/T153028) [14:10:57] hashar: scap pull on mwdebug will pull everything right? not only cirrus [14:11:56] Urbanecm: deploying [14:12:05] hashar, I'm ready [14:12:14] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [14:12:16] dcausse: yeah that will rsync what ever is on the deployment server under /srv/mediawiki-staging [14:12:17] (03CR) 10Alexandros Kosiaris: [C: 032] Trending Edits: LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/326528 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) [14:12:22] (03PS3) 10Alexandros Kosiaris: Trending Edits: LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/326528 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) [14:12:26] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Trending Edits: LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/326528 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) [14:12:34] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Allow sysops to manage translationadmin group on no.wikimedia - T152490 (duration: 00m 45s) [14:12:37] dcausse: usually takes only a few seconds with rsync [14:12:42] ok [14:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:45] T152490: Enable the Translate extension on Wikimedia Norge's wiki - https://phabricator.wikimedia.org/T152490 [14:13:07] the whole idea is to use the deployment server as a preparation area where you put the code you will want to eventually push (hence 'staging') [14:13:28] running scap pull on mwdebug1002 effectively deploy that staged code on that server so we can test [14:13:41] hashar: so back to what happened yesterday, you sent multiple patches to mwdebug at the same time right? [14:13:42] and if all fine, scap sync on the deployment server will promote the whole staging area on the whole cluster [14:13:52] yeah that one confused people [14:13:59] !log gehel@tin Starting deploy [tilerator/deploy@661f7ef]: (no message) [14:14:00] should refrain from doing it unless Iam the sole deployer [14:14:00] ok makes sense [14:14:04] the idea is to mass CR+2 everything [14:14:10] then on tin rebase patch by patch [14:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:16] ah I see [14:14:26] !log gehel@tin Finished deploy [tilerator/deploy@661f7ef]: (no message) (duration: 00m 27s) [14:14:32] so you don't wait on jenkins [14:14:33] so I get something like: (master) -> Patch A --> patch B --> patch C (origin/master) [14:14:35] I can: [14:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:47] git rebase patch A ; pull on debug server; test; scap sync [14:14:48] then [14:14:53] git rebase patch B .... etc [14:15:18] and since Jenkins / Zuul process changes in parallel they all get merged roughly in the same amount of time it takes to merge a single change. So less waiting [14:15:27] makes sense [14:15:50] but most probably I should not do that when there is another one doing a deploy simultanously [14:16:00] it makes it hard to sync who is doing what and what is going to be pushed [14:16:08] indeed [14:16:24] (03PS1) 10Jcrespo: mariadb: Pool db1083 with full load after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326945 [14:17:34] (03CR) 10Jcrespo: [C: 04-2] "Wait for full buffer pool warmup." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326945 (owner: 10Jcrespo) [14:17:35] grmblbl Zend PHP5 job :( [14:17:48] dcausse: so your patch got merged [14:17:49] merged [14:17:56] will want to fetch it on tin [14:18:00] review what is going to change [14:18:09] (03PS2) 10Mobrovac: EventBus Proxy: Ensure the syslog output file is readable [puppet] - 10https://gerrit.wikimedia.org/r/326942 (https://phabricator.wikimedia.org/T153028) [14:18:11] then rebase the extensions directory / submodule update [14:18:14] !log gehel@tin Starting deploy [tilerator/deploy@661f7ef]: (no message) [14:18:18] !log gehel@tin Finished deploy [tilerator/deploy@661f7ef]: (no message) (duration: 00m 03s) [14:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:37] the repo is on tin.eqiad.wmnet in cd /srv/mediawiki-staging/php-1.29.0-wmf.5/extensions [14:18:59] sorry /srv/mediawiki-staging/php-1.29.0-wmf.5/ [14:19:12] which is a local checkout of mediawiki/core.git @ branch wmf/1.29.0-wmf.5 [14:19:24] that branch has the deployed extensions registered as submodules [14:19:37] so you want to fetch the latest changes from mediawiki/core with git fetch [14:19:55] hashar, am I required to do something? [14:20:03] Urbanecm: I have pushed you change :} [14:20:11] Urbanecm: and closed the task \O/ [14:20:15] hashar, thanks! [14:21:33] (03PS6) 10Gehel: tilerator: deploy config with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/324761 (https://phabricator.wikimedia.org/T150021) [14:22:10] dcausse: any progress? :} [14:22:33] hashar: I'm confused: I'm on /srv/mediawiki-staging/php-1.29.0-wmf.5 [14:22:51] git status reports: Your branch is ahead of 'origin/wmf/1.29.0-wmf.5' by 4 commits. [14:23:00] yeah that is the local patches/hacks we have [14:23:02] they are not in gerrit [14:23:06] ah [14:23:15] git log --oneline -n4 [14:23:17] will show them [14:23:22] (03CR) 10Gehel: [C: 032] tilerator: deploy config with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/324761 (https://phabricator.wikimedia.org/T150021) (owner: 10Gehel) [14:23:35] and a "git status" will show that some of the submodules have "new commits" [14:23:45] that is local patches/hacks made on the extensions [14:24:02] (03PS3) 10Mobrovac: EventBus Proxy: Ensure the syslog output file is readable [puppet] - 10https://gerrit.wikimedia.org/r/326942 (https://phabricator.wikimedia.org/T153028) [14:24:02] so you want to git fetch [14:24:03] hashar: I did git fetch but git log HEAD..origin/wmf/1.29.0-wmf.5 reports nothing [14:24:14] then you can compare the status of the current checkout with the remote branch [14:24:14] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:24:34] grmblbl [14:25:03] yeah [14:25:10] so Gerrit synchronization is still broken [14:26:16] in theory [14:26:30] when you merge a change for mediawiki/extensions/Foobar @ wmf/1.29.0-wmf.5 [14:26:43] !log restart pybal on lvs1006, lvs1009, lvs1012 [14:26:44] gerrit is supposed to automatically update the submodule in mediawiki/core @ wmf/1.29.0-wmf.5 [14:26:47] but that got broken [14:26:52] oh I see [14:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:12] so we need to manually create a commit to update the submodule? [14:28:27] yeah [14:28:36] https://gerrit.wikimedia.org/r/326947 will fix it [14:28:45] PROBLEM - LVS HTTP IPv4 on trendingedits.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.9 and port 6699: Connection refused [14:29:15] akosiaris: is it you? [14:29:29] yes [14:29:31] sorry :-( [14:29:35] np :) [14:29:44] dcausse: can you do the submodule bump? [14:30:38] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2001.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=trendingedits']) [14:30:41] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2002.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=trendingedits']) [14:30:44] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2003.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=trendingedits']) [14:30:47] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2004.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=trendingedits']) [14:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:00] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1001.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=trendingedits']) [14:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:01] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1002.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=trendingedits']) [14:31:02] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=trendingedits']) [14:31:04] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=trendingedits']) [14:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:15] hashar: git submodule has no effect [14:32:21] beeh [14:32:25] PROBLEM - LVS HTTP IPv4 on trendingedits.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.9 and port 6699: Connection refused [14:32:36] CirrusSearch is still on an old commit [14:32:47] (03CR) 10Mobrovac: "PCC looking good - https://puppet-compiler.wmflabs.org/4861/" [puppet] - 10https://gerrit.wikimedia.org/r/326942 (https://phabricator.wikimedia.org/T153028) (owner: 10Mobrovac) [14:33:02] https://gerrit.wikimedia.org/r/326948 should do it [14:33:48] force merged it [14:33:58] 06Operations, 10Analytics, 10EventBus, 13Patch-For-Review, and 2 others: EventBus HTTP proxy service's syslog entries should be readable - https://phabricator.wikimedia.org/T153028#2868944 (10mobrovac) The above patch fixes the permissions, but further work should be conducted here to bring the logs in a s... [14:34:11] dcausse: so now if you fetch on tin you should see the change :) [14:34:15] hashar: it's better now, git log reports your commit [14:34:36] so once git fetch [14:34:37] I do git log --oneline HEAD..HEAD@{u} [14:34:46] (compare current HEAD with the upstream branch ) [14:34:49] if happy: git rebase [14:35:01] then bump the submodule [14:35:10] git submodule update extensions/CirrusSearch [14:35:13] ok all good, I'm going pull from mwdebug [14:35:26] if there are any live hack / local patches in the extension, the submodule will be rebased [14:35:46] \O/ [14:35:56] so the process is really straightforward [14:36:01] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [14:36:02] until you hit random unrelated issues [14:36:11] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - trendingedits_6699 - Could not depool server scb1004.eqiad.wmnet because of too many down! [14:36:11] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - trendingedits_6699 - Could not depool server scb2004.codfw.wmnet because of too many down! [14:36:19] like deployment tooling / CI / Gerrit issue. Cluster exploding for an unrelated reason etc [14:36:36] :) [14:36:55] PROBLEM - LVS HTTP IPv4 on trendingedits.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:37:01] PROBLEM - puppet last run on mw1293 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:38:11] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - trendingedits_6699 - Could not depool server scb2003.codfw.wmnet because of too many down! [14:40:12] dcausse: we also look at the HHVM error log usually [14:40:18] !log gehel@tin Starting deploy [tilerator/deploy@2ae591a]: (no message) [14:40:23] dcausse: ssh fluorine.eqiad.wmnet then "fatalmonitor" [14:40:27] ok [14:40:29] it tails the hhvm log bucket [14:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:34] that catch exceptions/fatals [14:40:34] !log gehel@tin Finished deploy [tilerator/deploy@2ae591a]: (no message) (duration: 00m 16s) [14:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:24] scap pull done, testing now [14:42:48] (03PS1) 10Alexandros Kosiaris: LVS: Specify a better pybal monitoring URL [puppet] - 10https://gerrit.wikimedia.org/r/326954 [14:43:05] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] LVS: Specify a better pybal monitoring URL [puppet] - 10https://gerrit.wikimedia.org/r/326954 (owner: 10Alexandros Kosiaris) [14:46:21] PROBLEM - tilerator on maps-test2003 is CRITICAL: connect to address 10.192.16.34 and port 6534: Connection refused [14:46:41] PROBLEM - tileratorui on maps1003 is CRITICAL: connect to address 10.64.32.117 and port 6535: Connection refused [14:46:42] PROBLEM - tileratorui on maps1004 is CRITICAL: connect to address 10.64.48.154 and port 6535: Connection refused [14:46:51] PROBLEM - tilerator on maps1004 is CRITICAL: connect to address 10.64.48.154 and port 6534: Connection refused [14:46:52] PROBLEM - tileratorui on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 6535: Connection refused [14:46:56] ^probably me, checking... [14:47:01] PROBLEM - tileratorui on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 6535: Connection refused [14:47:02] PROBLEM - tileratorui on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6535: Connection refused [14:47:03] PROBLEM - tilerator on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 6534: Connection refused [14:47:04] PROBLEM - tilerator on maps-test2004 is CRITICAL: connect to address 10.192.16.35 and port 6534: Connection refused [14:47:04] PROBLEM - tilerator on maps2004 is CRITICAL: connect to address 10.192.48.57 and port 6534: Connection refused [14:47:04] PROBLEM - tilerator on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6534: Connection refused [14:47:05] PROBLEM - tileratorui on maps-test2003 is CRITICAL: connect to address 10.192.16.34 and port 6535: Connection refused [14:47:06] !log testing prometheus-apache-exporter on mw2198 [14:47:06] PROBLEM - tilerator on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 6534: Connection refused [14:47:11] PROBLEM - tilerator on maps1003 is CRITICAL: connect to address 10.64.32.117 and port 6534: Connection refused [14:47:11] PROBLEM - tileratorui on maps2004 is CRITICAL: connect to address 10.192.48.57 and port 6535: Connection refused [14:47:12] PROBLEM - tileratorui on maps-test2004 is CRITICAL: connect to address 10.192.16.35 and port 6535: Connection refused [14:47:12] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [14:47:12] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [14:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:41] PROBLEM - tilerator on maps1002 is CRITICAL: connect to address 10.64.16.42 and port 6534: Connection refused [14:47:51] PROBLEM - tileratorui on maps1002 is CRITICAL: connect to address 10.64.16.42 and port 6535: Connection refused [14:49:11] maps issue is definitely me, nothing user facing, I'm on it... [14:49:47] hashar: am I behind nginx when I test on mwdebug1002? [14:49:56] ACKNOWLEDGEMENT - tilerator on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6534: Connection refused Gehel Issue migrating config to scap3 investigating - gehel [14:49:56] ACKNOWLEDGEMENT - tileratorui on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6535: Connection refused Gehel Issue migrating config to scap3 investigating - gehel [14:49:57] ACKNOWLEDGEMENT - tilerator on maps-test2003 is CRITICAL: connect to address 10.192.16.34 and port 6534: Connection refused Gehel Issue migrating config to scap3 investigating - gehel [14:49:57] ACKNOWLEDGEMENT - tileratorui on maps-test2003 is CRITICAL: connect to address 10.192.16.34 and port 6535: Connection refused Gehel Issue migrating config to scap3 investigating - gehel [14:49:57] ACKNOWLEDGEMENT - tilerator on maps-test2004 is CRITICAL: connect to address 10.192.16.35 and port 6534: Connection refused Gehel Issue migrating config to scap3 investigating - gehel [14:49:58] ACKNOWLEDGEMENT - tileratorui on maps-test2004 is CRITICAL: connect to address 10.192.16.35 and port 6535: Connection refused Gehel Issue migrating config to scap3 investigating - gehel [14:49:59] ACKNOWLEDGEMENT - tilerator on maps1002 is CRITICAL: connect to address 10.64.16.42 and port 6534: Connection refused Gehel Issue migrating config to scap3 investigating - gehel [14:50:00] ACKNOWLEDGEMENT - tileratorui on maps1002 is CRITICAL: connect to address 10.64.16.42 and port 6535: Connection refused Gehel Issue migrating config to scap3 investigating - gehel [14:50:01] ACKNOWLEDGEMENT - tilerator on maps1003 is CRITICAL: connect to address 10.64.32.117 and port 6534: Connection refused Gehel Issue migrating config to scap3 investigating - gehel [14:50:02] ACKNOWLEDGEMENT - tileratorui on maps1003 is CRITICAL: connect to address 10.64.32.117 and port 6535: Connection refused Gehel Issue migrating config to scap3 investigating - gehel [14:50:03] ACKNOWLEDGEMENT - tilerator on maps1004 is CRITICAL: connect to address 10.64.48.154 and port 6534: Connection refused Gehel Issue migrating config to scap3 investigating - gehel [14:50:04] ACKNOWLEDGEMENT - tileratorui on maps1004 is CRITICAL: connect to address 10.64.48.154 and port 6535: Connection refused Gehel Issue migrating config to scap3 investigating - gehel [14:50:05] ACKNOWLEDGEMENT - tilerator on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 6534: Connection refused Gehel Issue migrating config to scap3 investigating - gehel [14:50:06] ACKNOWLEDGEMENT - tileratorui on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 6535: Connection refused Gehel Issue migrating config to scap3 investigating - gehel [14:50:07] ACKNOWLEDGEMENT - tilerator on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 6534: Connection refused Gehel Issue migrating config to scap3 investigating - gehel [14:50:08] ACKNOWLEDGEMENT - tileratorui on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 6535: Connection refused Gehel Issue migrating config to scap3 investigating - gehel [14:50:09] ACKNOWLEDGEMENT - tilerator on maps2004 is CRITICAL: connect to address 10.192.48.57 and port 6534: Connection refused Gehel Issue migrating config to scap3 investigating - gehel [14:50:10] ACKNOWLEDGEMENT - tileratorui on maps2004 is CRITICAL: connect to address 10.192.48.57 and port 6535: Connection refused Gehel Issue migrating config to scap3 investigating - gehel [14:50:35] oops, acknowledge is just as noisy... sorry for that [14:51:28] hashar: sorry it takes, I'm testing timeouts :/ [14:51:33] it takes time [14:52:21] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [14:52:41] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [14:53:39] dcausse: it is all good :} [14:53:49] dcausse: I am idling around until you poke me :D [14:53:49] (03PS1) 10Alexandros Kosiaris: LVS: Also amend the icinga URL for trendingedits [puppet] - 10https://gerrit.wikimedia.org/r/326959 [14:53:56] hashar: I have an issue but not sure what's the cause [14:54:02] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] LVS: Also amend the icinga URL for trendingedits [puppet] - 10https://gerrit.wikimedia.org/r/326959 (owner: 10Alexandros Kosiaris) [14:54:42] I'm seeing an nginx 504 Gateway Time-out from time to time on mwdebug1002 [14:54:57] it's unclear to me if it will happen without X-Debug header [14:55:13] if it'll happen it means our timeout are way too high [14:55:21] I mean our backend timeouts [14:55:50] !log restbase deploy start of 0c06fb7 [14:55:58] (03PS2) 10Giuseppe Lavagetto: role::mediawiki::webserver: add TLS local proxy [puppet] - 10https://gerrit.wikimedia.org/r/325591 (https://phabricator.wikimedia.org/T152074) [14:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:23] RECOVERY - LVS HTTP IPv4 on trendingedits.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 957 bytes in 0.075 second response time [14:56:29] (03Draft2) 10Giuseppe Lavagetto: ssl: add public TLS certs for mw clusters [puppet] - 10https://gerrit.wikimedia.org/r/326921 (https://phabricator.wikimedia.org/T153042) [14:56:43] RECOVERY - LVS HTTP IPv4 on trendingedits.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 957 bytes in 0.002 second response time [14:56:54] and we 're done [14:57:16] (03PS3) 10Giuseppe Lavagetto: ssl: add public TLS certs for mw clusters [puppet] - 10https://gerrit.wikimedia.org/r/326921 (https://phabricator.wikimedia.org/T153042) [14:57:23] dcausse: maybe you can dig in logstash with host:mwdebug1002 ? [14:58:17] hashar: do you know why I see some nginx 504s? is nginx involved somewhere in the stack ? [14:58:24] yeah [14:58:28] to terminate the SSL connection [14:58:33] 06Operations, 10Analytics, 10EventBus, 13Patch-For-Review, and 2 others: EventBus HTTP proxy service's syslog entries should be readable - https://phabricator.wikimedia.org/T153028#2868979 (10Ottomata) Hmmm, is `/srv/log` is the standard location for process logging? I put service output event content (er... [14:58:54] ah... so maybe our timeouts are not sane in respect to the ones used by nginx? [14:59:07] End user ----[ https ] ---> nginx (terminate ssl) ---- [ http with header X-Forwarded-For-Proto: https ] ---> LVS --> Varnish frontend --> Varnish Backend --> LVS --> MediaWiki app [14:59:10] (something like that) [14:59:20] (03PS4) 10Ottomata: EventBus Proxy: Ensure the syslog output file is readable [puppet] - 10https://gerrit.wikimedia.org/r/326942 (https://phabricator.wikimedia.org/T153028) (owner: 10Mobrovac) [14:59:23] iirc that is because Varnish doesn't support SSL/TLS [14:59:28] ok [14:59:30] so nginx does it [14:59:39] do you know if there are some timeouts configured here? [14:59:44] most probably [15:00:01] the folks that supports 99.99% of all the engineering effort would know [15:00:12] namely #wikimedia-traffic eg bblack || ema [15:00:41] ok I'm going to revert then, I think the timeouts we use are not sane compared to the ones used by nginx [15:00:58] the idea was to display partial results and a warning to the user [15:01:01] is the timeout configurable ? [15:01:09] in cirrus yes [15:01:31] 06Operations, 10Wikimedia-Logstash: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#2868987 (10Ottomata) > The best thing ever would be to consume the webrequest topic, filter 5xx and push them back to kafka in another topic, but we are probably not ready for this :) I think kafk... [15:02:26] (03CR) 10Giuseppe Lavagetto: [C: 032] ssl: add public TLS certs for mw clusters [puppet] - 10https://gerrit.wikimedia.org/r/326921 (https://phabricator.wikimedia.org/T153042) (owner: 10Giuseppe Lavagetto) [15:03:27] !log mobrovac@tin Starting deploy [trending-edits/deploy@5d1eb88]: (no message) [15:03:28] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 602 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4525198 keys, up 43 days 6 hours - replication_delay is 602 [15:03:29] dcausse: what timeout values do you have ? [15:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:43] $wgCirrusSearchClientSideSearchTimeout[ 'regex' ] = 80; ? [15:03:44] elukey: lemme check [15:03:51] hashar: yes exactly [15:03:56] !log mobrovac@tin Finished deploy [trending-edits/deploy@5d1eb88]: (no message) (duration: 00m 29s) [15:04:05] (03PS3) 10Giuseppe Lavagetto: role::mediawiki::webserver: add TLS local proxy [puppet] - 10https://gerrit.wikimedia.org/r/325591 (https://phabricator.wikimedia.org/T152074) [15:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:17] though there is in CirrusSearch-production.php some other value for that variable [15:04:24] 'eqiad' => $wmfDatacenter === 'eqiad' ? 5 : 10, [15:04:24] 'codfw' => $wmfDatacenter === 'codfw' ? 5 : 10, [15:04:54] (03PS1) 10Gehel: tilerator: additional configuration for scap3 based configuration [puppet] - 10https://gerrit.wikimedia.org/r/326963 (https://phabricator.wikimedia.org/T150021) [15:05:03] hashar: hm... this should be *Update* no? [15:05:08] RECOVERY - puppet last run on mw1293 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [15:05:23] > var_dump( $wgCirrusSearchClientSideSearchTimeout ); [15:05:23] array(2) { ["default"]=> int(40), ["regex"]=> int(80) } [15:05:26] (on enwiki) [15:06:12] yes so 80sec for insource:// queries [15:06:14] (03CR) 10Gehel: [C: 032] tilerator: additional configuration for scap3 based configuration [puppet] - 10https://gerrit.wikimedia.org/r/326963 (https://phabricator.wikimedia.org/T150021) (owner: 10Gehel) [15:06:24] it's probably higher than nginx defaults [15:06:32] and maybe nginx has a 60 secs timeout [15:06:43] yes I think it's somethign like that [15:07:00] maps cluster right? [15:07:13] I would refers to the god of Hermes, god of travelers and #traffic [15:07:39] hashar: ok, I'll revert, find proper timeouts, fix the config, and redeploy [15:08:00] elukey: ? [15:08:23] (03PS1) 10Giuseppe Lavagetto: Snakeoil secrets for all pools (needed for PCC) [labs/private] - 10https://gerrit.wikimedia.org/r/326964 [15:08:24] gehel: I was wondering what is the cluster that dcausse was talking about [15:08:31] text, maps, etc.. [15:08:33] elukey: search (cirrus) [15:08:36] elukey: it's elasticsearch [15:08:40] hashar: can you +2 https://gerrit.wikimedia.org/r/#/c/326962/ ? :) [15:08:46] dunno why I can't :( [15:08:58] elukey: I'm the one breaking maps at the moment :) [15:09:19] (03PS5) 10Ottomata: EventBus Proxy: Ensure the syslog output file is readable [puppet] - 10https://gerrit.wikimedia.org/r/326942 (https://phabricator.wikimedia.org/T153028) (owner: 10Mobrovac) [15:09:24] (03CR) 10Ottomata: [V: 032 C: 032] EventBus Proxy: Ensure the syslog output file is readable [puppet] - 10https://gerrit.wikimedia.org/r/326942 (https://phabricator.wikimedia.org/T153028) (owner: 10Mobrovac) [15:10:28] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4520408 keys, up 43 days 6 hours - replication_delay is 0 [15:10:39] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Snakeoil secrets for all pools (needed for PCC) [labs/private] - 10https://gerrit.wikimedia.org/r/326964 (owner: 10Giuseppe Lavagetto) [15:10:57] (03PS1) 10Ema: varnishxcps: use varnishncsa to read log entries from the VSM [puppet] - 10https://gerrit.wikimedia.org/r/326965 (https://phabricator.wikimedia.org/T151643) [15:11:37] dcausse: ok so you are talking about a cache:text request that eventually hit elastic search? [15:11:39] dcausse: ah you need another group bah [15:11:54] dcausse: try again? [15:12:13] hashar: perfect! thanks [15:12:22] I am trying to figure out what nginx config you guys are talking about [15:12:26] (03CR) 10jenkins-bot: [V: 04-1] varnishxcps: use varnishncsa to read log entries from the VSM [puppet] - 10https://gerrit.wikimedia.org/r/326965 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [15:12:27] elukey: I don't really know, it hits Special:Search [15:12:28] that's all :D [15:12:30] !log Added dcausse to Gerrit group "wmf-deployment" so he can Code-Review +2 on the wmf/* branches! [15:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:56] dcausse: and you will need a submodule bump :/ [15:13:05] !log gehel@tin Starting deploy [tilerator/deploy@0fe5a1d]: (no message) [15:13:13] yes I'll do exactly the same :/ [15:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:20] !log gehel@tin Finished deploy [tilerator/deploy@0fe5a1d]: (no message) (duration: 00m 15s) [15:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:36] I need to fix these timeouts before the train btw :/ [15:13:38] RECOVERY - tileratorui on maps-test2003 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.092 second response time [15:13:39] RECOVERY - tileratorui on maps-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.094 second response time [15:13:48] RECOVERY - tileratorui on maps1004 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.005 second response time [15:13:49] RECOVERY - tileratorui on maps1003 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.007 second response time [15:13:49] RECOVERY - tilerator on maps1002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.030 second response time [15:13:58] RECOVERY - tileratorui on maps1002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.005 second response time [15:13:59] RECOVERY - tilerator on maps1004 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.019 second response time [15:13:59] RECOVERY - tileratorui on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.078 second response time [15:14:00] RECOVERY - tileratorui on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.078 second response time [15:14:08] RECOVERY - tilerator on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.089 second response time [15:14:08] RECOVERY - tilerator on maps-test2004 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.092 second response time [15:14:09] RECOVERY - tilerator on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.099 second response time [15:14:18] RECOVERY - tilerator on maps1003 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.023 second response time [15:14:19] RECOVERY - tileratorui on maps-test2004 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.079 second response time [15:14:19] RECOVERY - tileratorui on maps2004 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.092 second response time [15:14:19] RECOVERY - tilerator on maps-test2003 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.099 second response time [15:14:28] RECOVERY - tilerator on maps2004 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.095 second response time [15:14:29] RECOVERY - tilerator on maps-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.096 second response time [15:14:34] (03PS2) 10Ema: varnishxcps: use varnishncsa to read log entries from the VSM [puppet] - 10https://gerrit.wikimedia.org/r/326965 (https://phabricator.wikimedia.org/T151643) [15:19:25] !log restbase deploy end of 0c06fb7 [15:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:53] hashar dcausse the swat isn't over yet, right? [15:21:10] marostegui: no not yet, I need to do some cleanups [15:21:20] Sure :-) [15:22:53] hashar: how do you create these commits : https://gerrit.wikimedia.org/r/#/c/326948/ ? [15:24:59] 06Operations, 10Analytics, 10EventBus, 13Patch-For-Review, and 2 others: EventBus HTTP proxy service's syslog entries should be readable - https://phabricator.wikimedia.org/T153028#2869040 (10mobrovac) >>! In T153028#2868979, @Ottomata wrote: > Hmmm, is `/srv/log` is the standard location for process loggi... [15:26:22] 06Operations, 10Analytics, 10EventBus, 13Patch-For-Review, and 2 others: EventBus HTTP proxy service's syslog entries should be readable - https://phabricator.wikimedia.org/T153028#2869044 (10mobrovac) Permissions are now ok: ``` mobrovac@kafka1001:~$ ls -alhF /var/log/eventlogging/ total 2.4G drwxr-xr-x... [15:26:30] can't find the branches with ls-remote, not sure I'm on the right remote tho [15:26:36] 06Operations, 10Analytics, 10EventBus, 13Patch-For-Review, and 2 others: EventBus HTTP proxy service's syslog entries should be readable - https://phabricator.wikimedia.org/T153028#2869045 (10Ottomata) Lower chance of filling up `/`, but higher chance of filling up the same partition on which Kafka stores... [15:26:49] (03CR) 10Alexandros Kosiaris: [C: 032] RESTBase: Add trending edits service config portion [puppet] - 10https://gerrit.wikimedia.org/r/326529 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) [15:27:26] (03CR) 10Alexandros Kosiaris: [C: 032] "Removing mobrovac's -1 per IRC discussion, proceeding with merge and deployment" [puppet] - 10https://gerrit.wikimedia.org/r/326529 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) [15:27:31] (03PS2) 10Alexandros Kosiaris: RESTBase: Add trending edits service config portion [puppet] - 10https://gerrit.wikimedia.org/r/326529 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) [15:27:35] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] RESTBase: Add trending edits service config portion [puppet] - 10https://gerrit.wikimedia.org/r/326529 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) [15:29:50] dcausse: hoo sorry was having a break [15:29:56] dcausse: locally checkout mediawiki/core [15:30:03] and branch wmf/1.29.0-wmf.5 [15:30:14] then: git submodule update extensions/CirrusSearch [15:30:34] hashar: ok thanks, trying [15:30:34] then head to the extension and fetch / checkout the latest origin/wmf/1.29.0-wmf.5 [15:30:43] then git status [15:30:50] will show that extension/CirrusSearch has new commits [15:30:56] gotta git add [15:30:58] and comit / send [15:31:03] ok [15:32:05] that is supposed to be bumped automatically [15:33:14] (03PS5) 10Rush: labsdb: cleanup maintain-meta_p enough to make it viable [puppet] - 10https://gerrit.wikimedia.org/r/325949 [15:33:50] (03PS2) 10Alexandros Kosiaris: k8s::apiserver: Remove unused master_host parameter [puppet] - 10https://gerrit.wikimedia.org/r/326452 [15:33:52] (03PS3) 10Alexandros Kosiaris: kube-scheduler: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326429 [15:33:54] (03PS3) 10Alexandros Kosiaris: k8s::controller: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326430 [15:33:56] (03PS3) 10Alexandros Kosiaris: k8s::apiserver: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326441 [15:33:59] (03PS12) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [15:34:00] (03PS11) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [15:34:02] (03PS1) 10Alexandros Kosiaris: Introduce argon, chlorine as kubernetes masters [puppet] - 10https://gerrit.wikimedia.org/r/326973 [15:36:09] (03PS2) 10Alexandros Kosiaris: Introduce argon, chlorine as kubernetes masters [puppet] - 10https://gerrit.wikimedia.org/r/326973 (https://phabricator.wikimedia.org/T152966) [15:36:11] (03PS3) 10Alexandros Kosiaris: k8s::apiserver: Remove unused master_host parameter [puppet] - 10https://gerrit.wikimedia.org/r/326452 [15:36:13] (03PS4) 10Alexandros Kosiaris: kube-scheduler: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326429 [15:36:15] (03PS4) 10Alexandros Kosiaris: k8s::controller: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326430 [15:36:17] (03PS4) 10Alexandros Kosiaris: k8s::apiserver: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326441 [15:36:19] (03PS13) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [15:36:21] (03PS12) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [15:37:38] (03PS6) 10Rush: labsdb: cleanup maintain-meta_p enough to make it viable [puppet] - 10https://gerrit.wikimedia.org/r/325949 [15:39:39] (03PS3) 10Alexandros Kosiaris: Introduce argon, chlorine as kubernetes masters [puppet] - 10https://gerrit.wikimedia.org/r/326973 (https://phabricator.wikimedia.org/T152966) [15:39:45] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Introduce argon, chlorine as kubernetes masters [puppet] - 10https://gerrit.wikimedia.org/r/326973 (https://phabricator.wikimedia.org/T152966) (owner: 10Alexandros Kosiaris) [15:40:09] hashar: (sorry can't +2 again): https://gerrit.wikimedia.org/r/#/c/326976/ [15:43:37] so I checked on mwdebug1002 and I can see 200s for the URL that returns a 504, but total time taken 80428114 ms (so more or less the 80s that dcausse was talking about) [15:44:22] on text, proxy_read_timeout is 180s but it is clearly not the one that we are hitting [15:44:25] I'm getting 200s after a little longer than 40 secs [15:44:28] some of the default ones are 60s [15:47:32] hashar: sorry I'm stupid, I can +2 in fact [15:48:07] !log restbase restarting RB in prod for https://gerrit.wikimedia.org/r/#/c/326529/ [15:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:35] 06Operations, 06Discovery-Search (Current work): Investigate I/O limits on elasticsearch servers - https://phabricator.wikimedia.org/T153083#2869122 (10Gehel) [15:48:45] (03CR) 10Giuseppe Lavagetto: [C: 032] role::mediawiki::webserver: add TLS local proxy [puppet] - 10https://gerrit.wikimedia.org/r/325591 (https://phabricator.wikimedia.org/T152074) (owner: 10Giuseppe Lavagetto) [15:48:49] 06Operations, 06Discovery-Search (Current work): Investigate I/O limits on elasticsearch servers - https://phabricator.wikimedia.org/T153083#2869135 (10Gehel) p:05Triage>03High [15:48:52] (03PS4) 10Giuseppe Lavagetto: role::mediawiki::webserver: add TLS local proxy [puppet] - 10https://gerrit.wikimedia.org/r/325591 (https://phabricator.wikimedia.org/T152074) [15:50:42] marostegui: should be clean now, sorry for the delay [15:50:59] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::mediawiki::webserver: add TLS local proxy [puppet] - 10https://gerrit.wikimedia.org/r/325591 (https://phabricator.wikimedia.org/T152074) (owner: 10Giuseppe Lavagetto) [15:53:16] elukey: I've captured one such backend request with varnishlog: [15:53:19] - Timestamp Fetch: 1481644304.894215 40.123569 40.123569 [15:53:55] (03PS7) 10Rush: labsdb: cleanup maintain-meta_p enough to make it viable [puppet] - 10https://gerrit.wikimedia.org/r/325949 [15:54:02] (03CR) 10Rush: [V: 032 C: 032] labsdb: cleanup maintain-meta_p enough to make it viable [puppet] - 10https://gerrit.wikimedia.org/r/325949 (owner: 10Rush) [15:54:10] ema: with X-Wikimedia-Debug set to mwdebug1002 ? [15:54:59] elukey: nope [15:55:01] - RespHeader Server: mw1240.eqiad.wmnet [15:56:28] PROBLEM - puppet last run on mw2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:56:34] <_joe_> that's me ^^ [15:56:41] I always get a 504 trying to it mwdebug1002.. but on it, I see a 200 access log entry in apache access logs with 80s as time taken to complete [15:58:52] (03PS1) 10Andrew Bogott: Keystone: Publish credentials for novaobserver account [labs/private] - 10https://gerrit.wikimedia.org/r/326979 (https://phabricator.wikimedia.org/T150092) [15:59:39] elukey, ema I've reverted the change on mwdebug1002 [15:59:50] (03CR) 10Volans: [C: 04-1] "Nice! Mostly minor comments inline." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/326965 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [16:00:44] (03CR) 10Jcrespo: [C: 032] mariadb: Pool db1083 with full load after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326945 (owner: 10Jcrespo) [16:00:57] MW -> elastic will always timeout at 40s now instead of 80s, so you may not be able to reproduce [16:01:19] (03CR) 10Andrew Bogott: [V: 032 C: 032] Keystone: Publish credentials for novaobserver account [labs/private] - 10https://gerrit.wikimedia.org/r/326979 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [16:01:29] (03Merged) 10jenkins-bot: mariadb: Pool db1083 with full load after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326945 (owner: 10Jcrespo) [16:03:20] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1083 with 100% load (duration: 00m 45s) [16:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:29] (03PS1) 10Jcrespo: mariadb: Depool db1089 for schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326980 (https://phabricator.wikimedia.org/T69223) [16:06:17] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: rack/setup/install wdqs2003 - https://phabricator.wikimedia.org/T152644#2869185 (10Papaul) [16:07:10] (03CR) 10Volans: [C: 04-1] "Probably worth a try to move it to Python 3 for better performances if you don't plan to use any external library not yet py3 compatible." [puppet] - 10https://gerrit.wikimedia.org/r/326965 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [16:07:20] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1045" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326941 [16:07:40] jynus: i will wait for you to deploy and then I will deploy myself [16:07:52] I am not deploying that yet [16:07:58] too soon [16:08:08] ok so I can go ahead then? [16:08:10] I was even thinking to wait for tomorrow [16:08:16] yes, please [16:08:20] thanks [16:08:49] at this times enwiki becomes too unamanagable [16:09:36] 45K QPS with 3 servers [16:10:03] not that we cannot handle that with 2 [16:10:19] but 50% would be on an unproven version [16:10:31] !log Shut down db2034 and db2048 for maintenance - T149553 [16:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:45] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [16:13:34] (03CR) 10Marostegui: "check" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326941 (owner: 10Marostegui) [16:14:05] (03PS3) 10Rush: WIP tools: job to copytruncate logs in place [puppet] - 10https://gerrit.wikimedia.org/r/326153 [16:16:11] (03Abandoned) 10Rush: Fixes and improvements for maintain-meta_p [software] - 10https://gerrit.wikimedia.org/r/304425 (owner: 10Alex Monk) [16:16:39] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1045" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326941 (owner: 10Marostegui) [16:18:17] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1045" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326941 (owner: 10Marostegui) [16:19:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1045 - T148967 (duration: 00m 46s) [16:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:52] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [16:23:28] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2869231 (10Papaul) @akosiaris Thanks Firmware update complete, I am waiting on HP to call me so I can provide them with the new log. Before... [16:25:12] (03CR) 10Ottomata: [C: 032] "No, puppet will do that." [puppet] - 10https://gerrit.wikimedia.org/r/326490 (https://phabricator.wikimedia.org/T149438) (owner: 10Ottomata) [16:25:38] (03PS2) 10Ottomata: s/stat1001/throrium [puppet] - 10https://gerrit.wikimedia.org/r/326490 (https://phabricator.wikimedia.org/T149438) [16:26:04] (03CR) 10Ottomata: [V: 032 C: 032] s/stat1001/throrium [puppet] - 10https://gerrit.wikimedia.org/r/326490 (https://phabricator.wikimedia.org/T149438) (owner: 10Ottomata) [16:34:59] (03PS1) 10Ottomata: Remove a few more references to stat1001 [puppet] - 10https://gerrit.wikimedia.org/r/326988 (https://phabricator.wikimedia.org/T149438) [16:35:14] (03PS1) 10DCausse: [cirrus] Reduce regex/default timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326989 [16:35:20] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): setup/install restbase-test100[123] - https://phabricator.wikimedia.org/T151075#2869272 (10Cmjohnson) [16:36:20] (03PS2) 10DCausse: [cirrus] Reduce regex/default timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326989 [16:36:43] (03CR) 10Ottomata: [C: 032] Remove a few more references to stat1001 [puppet] - 10https://gerrit.wikimedia.org/r/326988 (https://phabricator.wikimedia.org/T149438) (owner: 10Ottomata) [16:40:54] (03PS3) 10Ema: varnishxcps: use varnishncsa to read log entries from the VSM [puppet] - 10https://gerrit.wikimedia.org/r/326965 (https://phabricator.wikimedia.org/T151643) [16:44:03] (03CR) 10Ema: varnishxcps: use varnishncsa to read log entries from the VSM (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/326965 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [16:44:28] (03PS2) 10Jcrespo: mariadb: Depool db1089 for schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326980 (https://phabricator.wikimedia.org/T69223) [16:44:48] (03PS1) 10Andrew Bogott: Add comment headers to check_keystone icinga plugins [puppet] - 10https://gerrit.wikimedia.org/r/326991 [16:48:41] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#2869377 (10faidon) After a few back and forths with JTAC, the case was escalated to the Advanced TAC (aka ATAC). The issue was thankfully replicated in their... [16:50:10] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::webserver: fix TLS cert name [puppet] - 10https://gerrit.wikimedia.org/r/326993 [16:53:32] (03PS1) 10Ottomata: Symlink limn-public-data so it is accessible at datasets.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/326994 (https://phabricator.wikimedia.org/T149438) [16:56:08] (03CR) 10Ottomata: [C: 032] Symlink limn-public-data so it is accessible at datasets.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/326994 (https://phabricator.wikimedia.org/T149438) (owner: 10Ottomata) [17:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161213T1700). Please do the needful. [17:00:04] Dereckson and bd808: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:01:03] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326997 [17:01:29] (03CR) 10Marostegui: [C: 04-2] "It is still lagging after being stopped for a few hours" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326997 (owner: 10Marostegui) [17:02:43] Dereckson bd808 ^ puppet swat [17:02:47] * Dereckson there [17:02:55] o/ [17:03:14] (03PS1) 10Ppchelko: Trending-Edits: Fix a typo in service URI. [puppet] - 10https://gerrit.wikimedia.org/r/326998 [17:03:27] ok! taking a look at the patches now, same order as the wiki page [17:03:29] So, one is a DNS change, the other a Puppet change to deploy a software, arc, to new servers, for labs [17:04:57] (03PS4) 10Filippo Giunchedi: Install arcanist in toollabs::dev_environ [puppet] - 10https://gerrit.wikimedia.org/r/297975 (https://phabricator.wikimedia.org/T139738) (owner: 10Dereckson) [17:05:41] that datasets.wm.o name gets me every time [17:06:04] is stat1002 ok? [17:06:32] ottomata: ^^ I ask because I saw several rsync jobs fail to there from a few different hosts [17:06:37] in the last few minutes [17:06:53] (03PS3) 10Filippo Giunchedi: DNS configuration for arbcom-cs.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/323851 (https://phabricator.wikimedia.org/T151731) (owner: 10MarcoAurelio) [17:07:29] (03CR) 10Filippo Giunchedi: [C: 032] Install arcanist in toollabs::dev_environ [puppet] - 10https://gerrit.wikimedia.org/r/297975 (https://phabricator.wikimedia.org/T139738) (owner: 10Dereckson) [17:07:57] apergos: datasets.wm.o was on stat1001, now is on thorium [17:08:03] https://phabricator.wikimedia.org/T149438 [17:08:05] just changed this morning [17:08:06] Dereckson: arcanist one is merged, please check on toollabs if you can [17:08:08] where are jobs failling? [17:08:14] afaik puppetized stuff is good [17:08:24] but there may be manual stuff that was not puppetized (stat1001 is old) that might be breaking [17:08:39] I /usr/bin/rsync -rt --delete stat1002.eqiad.wmnet::srv/aggregate-datasets/* /srv/aggregate-datasets/ [17:08:51] /usr/bin/rsync -rt --delete stat1002.eqiad.wmnet::srv/aggregate-datasets/* /srv/aggregate-datasets/ [17:08:57] and one of mine (from dataset1001) also [17:09:01] all in the last few minutes [17:09:27] bash -c '/usr/bin/rsync -rt --delete --chmod=go-w stat1002.eqiad.wmnet::hdfs-archive/{pageview,projectview}/legacy/hourly/ /data/xmldatadumps/public/other/pageviews/' [17:09:28] there's mine [17:09:34] 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2869435 (10Marostegui) @kaldari how is it going? you reckon it will end soon? Just asking to see if we need to extend the downtime... [17:09:35] godog: what's the server name? [17:10:43] rsync: failed to connect to stat1002.eqiad.wmnet (10.64.5.102): Connection refused (111) these are the errors for each job, ottomata [17:10:59] Dereckson: I guess any labs instance that has class applied, not sure which [17:11:19] HMmmm [17:11:23] the one I tested The last Puppet run was at Tue Dec 13 16:40:53 UTC 2016 (27 minutes ago). [17:11:28] looking, [17:11:29] (03PS4) 10Filippo Giunchedi: DNS configuration for arbcom-cs.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/323851 (https://phabricator.wikimedia.org/T151731) (owner: 10MarcoAurelio) [17:11:33] thanks! [17:11:42] not sure why that would change, hmmm [17:13:12] ok apergos it looks like rsyncd crashed on stat1002 [17:13:24] i've noticed this happens sometimes when there is a pupet refresh [17:13:25] dunno why [17:13:27] i just restarted it [17:13:47] hm ok, wonder why puppet didn't kick it [17:13:52] nayways, thanks for looking [17:14:08] (03CR) 10Filippo Giunchedi: [C: 032] DNS configuration for arbcom-cs.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/323851 (https://phabricator.wikimedia.org/T151731) (owner: 10MarcoAurelio) [17:14:39] apergos: it probably would have on the next puppet run [17:14:50] all righty then [17:14:52] Dereckson: dns is done [17:15:19] godog: can I deploy mediawiki-config during the puppet swat? [17:15:32] marostegui: yeah there are no conflicts afaics [17:15:37] cool! [17:15:42] thanks [17:15:46] godog: thanks for the DNS update [17:15:52] /usr/bin/rsync -rt /a/reportupdater/output/* throrium.eqiad.wmnet::srv/limn-public-data/metrics/ ottomata, this one is for you: [17:15:54] (03PS2) 10Giuseppe Lavagetto: role::mediawiki::webserver: fix TLS cert name [puppet] - 10https://gerrit.wikimedia.org/r/326993 [17:15:56] godog: tested, works [17:16:00] rsync: getaddrinfo: throrium.eqiad.wmnet 873: Name or service not known [17:16:02] just now [17:16:12] (just happened to see the cronspam fire off) [17:17:50] OO typo [17:18:01] mark, hi, I'm supposed to attend the technology management meeting but I can not, the bluejeans room that I got from the calendar is empty [17:18:12] (03CR) 10Filippo Giunchedi: "Minor nit on comment just to make sure, LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323351 (https://phabricator.wikimedia.org/T136849) (owner: 10BryanDavis) [17:18:14] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, and 2 others: wdqs2003 switch port configuration - https://phabricator.wikimedia.org/T153094#2869454 (10Papaul) [17:18:20] mforns: it's not in bluejeans anymore, it's on hangouts [17:18:31] greg-g, thanks, can you pass the link please? [17:18:35] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: rack/setup/install wdqs2003 - https://phabricator.wikimedia.org/T152644#2855908 (10Papaul) [17:18:47] (03PS1) 10Ottomata: Fix thorium typo [puppet] - 10https://gerrit.wikimedia.org/r/327000 (https://phabricator.wikimedia.org/T149438) [17:18:55] godog: looking at that comment [17:18:59] (03PS1) 10Giuseppe Lavagetto: Rename the apaches.svc keys [labs/private] - 10https://gerrit.wikimedia.org/r/327002 [17:19:29] Dereckson: np, let me know about the arcanist verification or I can take a look at the end of swat [17:19:35] (03PS2) 10Ottomata: Fix thorium typo [puppet] - 10https://gerrit.wikimedia.org/r/327000 (https://phabricator.wikimedia.org/T149438) [17:19:59] (03CR) 10BryanDavis: logstash: Add processing rules for MediaWiki's exception channel (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323351 (https://phabricator.wikimedia.org/T136849) (owner: 10BryanDavis) [17:20:04] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Rename the apaches.svc keys [labs/private] - 10https://gerrit.wikimedia.org/r/327002 (owner: 10Giuseppe Lavagetto) [17:20:06] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326997 [17:20:16] (03CR) 10Ottomata: [V: 032 C: 032] Fix thorium typo [puppet] - 10https://gerrit.wikimedia.org/r/327000 (https://phabricator.wikimedia.org/T149438) (owner: 10Ottomata) [17:20:44] (03PS4) 10Ema: varnishxcps: use varnishncsa to read log entries from the VSM [puppet] - 10https://gerrit.wikimedia.org/r/326965 (https://phabricator.wikimedia.org/T151643) [17:21:03] godog: tested on tools-bastion-02, works fine [17:21:18] Dereckson: neat, thanks! [17:21:26] bd808: ok, thanks for looking! [17:21:42] godog: thanks for the deploy [17:21:42] (03CR) 10Marostegui: [C: 031] "server caught up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326997 (owner: 10Marostegui) [17:21:50] (03PS4) 10Filippo Giunchedi: logstash: Add processing rules for MediaWiki's exception channel [puppet] - 10https://gerrit.wikimedia.org/r/323351 (https://phabricator.wikimedia.org/T136849) (owner: 10BryanDavis) [17:22:21] bd808: preferences for logstash config changes to be batched vs separate? [17:22:23] (03PS3) 10Giuseppe Lavagetto: role::mediawiki::webserver: fix TLS cert name [puppet] - 10https://gerrit.wikimedia.org/r/326993 [17:22:50] thanks apergos should be fixed [17:22:59] nice! [17:23:30] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::mediawiki::webserver: fix TLS cert name [puppet] - 10https://gerrit.wikimedia.org/r/326993 (owner: 10Giuseppe Lavagetto) [17:27:12] godog: batched is probably better [17:27:21] only one service restart then [17:27:32] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:27:43] (03CR) 10Filippo Giunchedi: [C: 032] logstash: Add processing rules for MediaWiki's exception channel [puppet] - 10https://gerrit.wikimedia.org/r/323351 (https://phabricator.wikimedia.org/T136849) (owner: 10BryanDavis) [17:27:48] (03PS4) 10Filippo Giunchedi: Use logstash's prune filter for api-feature-usage-sanitized [puppet] - 10https://gerrit.wikimedia.org/r/313035 (owner: 10Anomie) [17:27:54] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326997 (owner: 10Marostegui) [17:27:58] (03PS5) 10Filippo Giunchedi: logstash: Add processing rules for MediaWiki's exception channel [puppet] - 10https://gerrit.wikimedia.org/r/323351 (https://phabricator.wikimedia.org/T136849) (owner: 10BryanDavis) [17:28:09] rebase wars [17:28:17] bd808: ok I'll batch them [17:28:22] PROBLEM - Check systemd state on mw2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:28:32] PROBLEM - puppet last run on mw2017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 42 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[nginx-full] [17:28:32] PROBLEM - DPKG on mw2017 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:28:39] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326997 (owner: 10Marostegui) [17:29:32] RECOVERY - DPKG on mw2017 is OK: All packages OK [17:30:06] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] logstash: Add processing rules for MediaWiki's exception channel [puppet] - 10https://gerrit.wikimedia.org/r/323351 (https://phabricator.wikimedia.org/T136849) (owner: 10BryanDavis) [17:30:06] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2067 - T151552 (duration: 00m 47s) [17:30:16] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Use logstash's prune filter for api-feature-usage-sanitized [puppet] - 10https://gerrit.wikimedia.org/r/313035 (owner: 10Anomie) [17:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:19] T151552: Import S2,S6,S7,m3 and x1 to dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T151552 [17:31:09] (03PS5) 10Filippo Giunchedi: Use logstash's prune filter for api-feature-usage-sanitized [puppet] - 10https://gerrit.wikimedia.org/r/313035 (owner: 10Anomie) [17:31:16] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Use logstash's prune filter for api-feature-usage-sanitized [puppet] - 10https://gerrit.wikimedia.org/r/313035 (owner: 10Anomie) [17:32:22] RECOVERY - Check systemd state on mw2017 is OK: OK - running: The system is fully operational [17:32:32] RECOVERY - puppet last run on mw2017 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [17:34:32] 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2869535 (10kaldari) @Marostegui: Apparently, it's taking a long time to complete the script on loginwiki, which I had completely f... [17:34:35] bd808: logstash done/restarted, can you verify? [17:34:57] godog: I'm watching for errors... [17:36:35] 06Operations, 10Prod-Kubernetes, 10vm-requests, 13Patch-For-Review, 07kubernetes: Site: 2 VM request for kubernetes - https://phabricator.wikimedia.org/T152966#2869538 (10akosiaris) 05Open>03Resolved VMs up and running [17:45:49] (03CR) 10DCausse: "we will try to debug this again with ema tomorrow, apparently nginx should timeout at 180s, it's either a problem with our timeout handlin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326989 (owner: 10DCausse) [17:47:54] bd808: ok thanks, I'm taking a look at the l10n one [17:48:08] 06Operations, 06Labs: Initial OpenStack Neutron PoC deployment in Labtest - https://phabricator.wikimedia.org/T153099#2869590 (10chasemp) [17:56:32] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [17:57:42] RECOVERY - Host elastic2020 is UP: PING OK - Packet loss = 0%, RTA = 36.06 ms [17:58:17] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2869654 (10Papaul) Spend an hour with HP on the phone.The HP person i spoke to name is Chandi. They came to the conclusion that since the syste... [17:59:42] PROBLEM - Check whether ferm is active by checking the default input chain on elastic2020 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [17:59:43] PROBLEM - puppet last run on elastic2020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:00:05] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161213T1800). Please do the needful. [18:00:23] godog: I think the logstash changes are all good [18:00:50] (03CR) 10Filippo Giunchedi: [C: 04-1] "Doubts on flock usage" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/303923 (https://phabricator.wikimedia.org/T72752) (owner: 10BryanDavis) [18:01:07] bd808: awesome! [18:05:48] godog: thanks for the CR on the flock patch. I may or may not ever follow up on that. It seems not highly wanted by releng [18:07:00] bd808: fair enough, I remember of some discussion to whole the either move the whole process inside scap or rethink it [18:07:30] yeah. it would be easier to manage in python *if* we actually even need it in the modern day [18:08:03] wow "some discussion to whole the either move the whole process" [18:08:27] but ok, I think puppet swat is done [18:08:54] bd808: ok! thanks for the context [18:10:47] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2869697 (10faidon) [18:11:22] (03PS18) 10BBlack: cache_misc app_directors/req_handling split [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) [18:11:24] (03PS18) 10BBlack: cache_misc req_handling: sort entries [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [18:11:26] (03PS16) 10BBlack: cache_misc req_handling: add force-pass support [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [18:11:28] (03PS17) 10BBlack: cache_misc req_handling: subpaths and defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300655 (https://phabricator.wikimedia.org/T110717) [18:11:58] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2724967 (10faidon) D1 to D8 was patched with fiber QSFP+s (et-1/1/0 <-> et-8/1/0). The no-name optics we bought in T149726 appear as QSFP+-40G-CU... [18:17:55] andrewbogott: this is a really late response but when icinga fails to get the verbose error it's 'icinga -v /etc/icinga/icinga.cfg [18:18:13] mutante: thanks! I just looked in the syslog :) [18:18:43] alright:) [18:22:42] RECOVERY - puppet last run on elastic2020 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [18:23:49] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port apache httpd metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T147316#2869752 (10elukey) Tried to install the deb package manually on mw2198 and I needed to modify the following in /etc/default/prometheus-apache-exp... [18:26:41] (03CR) 10Thcipriani: "Looks good overall! One comment inline about an old path that's still being used." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/326461 (https://phabricator.wikimedia.org/T129134) (owner: 10Niharika29) [18:28:22] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:29:28] 06Operations: Internal PKI for secure communication - Barcelona Ops offsite 2016 - https://phabricator.wikimedia.org/T150822#2869779 (10fgiunchedi) Another use case for internal certs: {T153042} [18:34:08] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2869811 (10EBernhardson) All search is currently served by codfw, we are expecting to switch it back to eqiad in the next few days (after some... [18:36:26] (03PS2) 10Filippo Giunchedi: Trending-Edits: Fix a typo in service URI. [puppet] - 10https://gerrit.wikimedia.org/r/326998 (owner: 10Ppchelko) [18:37:49] (03CR) 10Filippo Giunchedi: [C: 032] Trending-Edits: Fix a typo in service URI. [puppet] - 10https://gerrit.wikimedia.org/r/326998 (owner: 10Ppchelko) [18:38:22] PROBLEM - NTP on elastic2020 is CRITICAL: NTP CRITICAL: Offset unknown [18:38:32] PROBLEM - puppet last run on elastic1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:40:35] (03PS1) 10Andrew Bogott: Designate: raise recordset/record quotas [puppet] - 10https://gerrit.wikimedia.org/r/327017 [18:44:09] 06Operations, 10ops-eqiad: eqiad: payments1004 failed PSU - https://phabricator.wikimedia.org/T153103#2869845 (10Cmjohnson) [18:44:18] 06Operations, 10ops-eqiad: eqiad: payments1004 failed PSU - https://phabricator.wikimedia.org/T153103#2869864 (10Cmjohnson) 05Open>03Resolved Completed [18:45:17] 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2869872 (10Cmjohnson) a:05Cmjohnson>03jcrespo @jcrespo The new disk is installed. Assigning to you, resolve once complete or back to me if there are any issues. [18:45:31] (03CR) 10Andrew Bogott: [C: 032] Designate: raise recordset/record quotas [puppet] - 10https://gerrit.wikimedia.org/r/327017 (owner: 10Andrew Bogott) [18:45:36] (03PS2) 10Andrew Bogott: Designate: raise recordset/record quotas [puppet] - 10https://gerrit.wikimedia.org/r/327017 [18:45:48] 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2869875 (10Cmjohnson) Return part tracking number 9202 3946 5301 2434 9841 72 [18:47:30] 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2869878 (10fgiunchedi) @kaldari ok! downtimed db1028 for another 12h [18:50:16] (03PS1) 10Elukey: Add upstream source files [debs/prometheus-apache-exporter] - 10https://gerrit.wikimedia.org/r/327020 [18:50:56] !log mobrovac@tin Starting deploy [trending-edits/deploy@84db7b8]: (no message) [18:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:18] !log mobrovac@tin Finished deploy [trending-edits/deploy@84db7b8]: (no message) (duration: 00m 22s) [18:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:08] (03PS1) 10Dzahn: delete wikitech.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/327022 (https://phabricator.wikimedia.org/T120527) [18:56:52] 06Operations: Cron spam caused by ieee-data cron job - https://phabricator.wikimedia.org/T149681#2869947 (10fgiunchedi) [18:56:54] 06Operations, 10media-storage, 13Patch-For-Review: cronspam cleanup: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly ) - https://phabricator.wikimedia.org/T152440#2869950 (10fgiunchedi) [18:57:22] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161213T1900). Please do the needful. [19:01:05] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: add restbase101[678] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/326862 (https://phabricator.wikimedia.org/T150964) (owner: 10Filippo Giunchedi) [19:01:09] 06Operations, 07Puppet, 06Labs, 10Phabricator, and 2 others: Install Arcanist in toollabs::dev_environ - https://phabricator.wikimedia.org/T139738#2869974 (10Dereckson) 05Open>03Resolved Arcanist is available on main tools. server and we don't have a use case for the tasks grid to use arc, so we can co... [19:01:12] (03PS2) 10Filippo Giunchedi: hieradata: add restbase101[678] [puppet] - 10https://gerrit.wikimedia.org/r/326862 (https://phabricator.wikimedia.org/T150964) [19:06:32] RECOVERY - puppet last run on elastic1025 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [19:06:52] (03PS1) 10Papaul: DNS: Add production DNS for ms-fe200[5-8] Bug:T152612 [dns] - 10https://gerrit.wikimedia.org/r/327024 (https://phabricator.wikimedia.org/T152612) [19:09:04] 06Operations, 10ops-codfw: codfw: rack/setup ms-fe200[5-8] - https://phabricator.wikimedia.org/T152612#2869997 (10Papaul) [19:09:52] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:10:42] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [19:14:39] (03PS1) 10Legoktm: Use packaged uprightdiff in testreduce and visualdiff [puppet] - 10https://gerrit.wikimedia.org/r/327028 [19:15:16] bblack: ema: hi! got a sec to advise about how to block Special:BannerLoader from googleweblight's spidery UA? thx in advance! [19:15:31] T152602 [19:15:36] https://phabricator.wikimedia.org/T152602 [19:17:16] Basically we're unintentionally serving banners through that service, which munges the page horribly, and geolocation is broken, and lots of users in countries that are not targeted by a FR campaign now have a pretty bad experience :( [19:17:39] (sorry, I meant, geolocation is just broken for users who visit us through that service) [19:17:53] 06Operations, 07Puppet, 06Analytics-Kanban, 13Patch-For-Review: Refactor eventlogging.pp role into multiple files (and maybe get rid of inheritance) - https://phabricator.wikimedia.org/T152621#2870027 (10Dzahn) I see that we have role/common/eventlogging.yaml adding admin groups, eventlogging-admins and ev... [19:17:58] https://googleweblight.com/?lite_url=https://en.wikipedia.org [19:19:33] (03PS2) 10Dzahn: DNS: Add mgmt DNS entries for ms-fe200[5-8] Bug:T152612 [dns] - 10https://gerrit.wikimedia.org/r/326474 (https://phabricator.wikimedia.org/T152612) (owner: 10Papaul) [19:19:56] 06Operations, 07Puppet, 06Analytics-Kanban, 13Patch-For-Review: Refactor eventlogging.pp role into multiple files (and maybe get rid of inheritance) - https://phabricator.wikimedia.org/T152621#2870028 (10elukey) >>! In T152621#2870027, @Dzahn wrote: > I see that we have role/common/eventlogging.yaml adding... [19:21:30] 06Operations, 10Mobile-Content-Service, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 3 others: New Service Request for Trending Edits Service - https://phabricator.wikimedia.org/T150043#2870029 (10mobrovac) [19:22:01] 06Operations, 10Mobile-Content-Service, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 3 others: New Service Request for Trending Edits Service - https://phabricator.wikimedia.org/T150043#2772430 (10mobrovac) [19:23:42] 06Operations, 10Mobile-Content-Service, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 3 others: New Service Request for Trending Edits Service - https://phabricator.wikimedia.org/T150043#2870041 (10mobrovac) [19:23:51] 06Operations, 07Puppet, 06Analytics-Kanban, 13Patch-For-Review: Refactor eventlogging.pp role into multiple files (and maybe get rid of inheritance) - https://phabricator.wikimedia.org/T152621#2870043 (10Dzahn) ah, ok! (Let's keep using roles to add admin groups instead of host names though. It means les... [19:24:07] AndyRussG: can't read the task, restricted [19:27:24] (03PS1) 10Papaul: DHCP: Add DHCP entries for ms-fe200[5-8] Bug: T152612 [puppet] - 10https://gerrit.wikimedia.org/r/327029 (https://phabricator.wikimedia.org/T152612) [19:27:49] (03CR) 10Dzahn: [C: 032] DNS: Add mgmt DNS entries for ms-fe200[5-8] Bug:T152612 [dns] - 10https://gerrit.wikimedia.org/r/326474 (https://phabricator.wikimedia.org/T152612) (owner: 10Papaul) [19:28:03] AndyRussG: also, do we even want googleweblight? I thought our own m-dot domains were covering that use-case? [19:28:15] AndyRussG: it can be opted out of with "CC: no-transform".... [19:28:51] probably questions for people other than you and me, but still... [19:28:54] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): setup/install restbase-test100[123] - https://phabricator.wikimedia.org/T151075#2870078 (10RobH) a:05Cmjohnson>03RobH I'm stealing the remainder of this task from Chris. I'll update and get these installed later today. [19:29:00] bblack: hey...! sorry I didn't see the task was limited [19:29:08] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2870080 (10Papaul) @EBernhardson Thanks I will leave this task open for now. [19:29:09] PROBLEM - Restbase root url on restbase1016 is CRITICAL: connect to address 10.64.0.31 and port 7231: Connection refused [19:29:17] I think we might want it, but yeah that's a question for mobile folks [19:29:20] A lot of people are using it [19:29:32] Right now we just want to turn off the banners on it... [19:29:39] PROBLEM - cassandra-a CQL 10.64.0.32:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.32 and port 9042: Connection refused [19:29:49] PROBLEM - cassandra-a SSL 10.64.0.32:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:30:02] AndyRussG: how do we turn off banners in general? [19:30:09] PROBLEM - cassandra-a service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [19:30:15] 06Operations, 07Puppet, 06Analytics-Kanban, 13Patch-For-Review: Refactor eventlogging.pp role into multiple files (and maybe get rid of inheritance) - https://phabricator.wikimedia.org/T152621#2870083 (10elukey) I agree, it was a temporary measure to fix the immediate issue of the eventlogging role not pre... [19:30:49] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 1 minute ago with 3 failures. Failed resources (up to 3 shown): Package[cassandra/metrics-collector],Package[restbase/deploy],Package[cassandra/logstash-logback-encoder] [19:31:09] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.31, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [19:31:39] PROBLEM - Check systemd state on restbase1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:32:49] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [19:33:27] restbase1016 is me btw [19:34:21] adding moar ms-fe mgmt [19:34:47] (03PS2) 10Dzahn: DNS: Add production DNS for ms-fe200[5-8] Bug:T152612 [dns] - 10https://gerrit.wikimedia.org/r/327024 (https://phabricator.wikimedia.org/T152612) (owner: 10Papaul) [19:35:27] bblack: our normal way of turning off banners is via some data that's sent via ResourceLoader. The only way to do that would be to have a special cache split based on UA for ResourceLoader URLs, I think [19:35:39] RECOVERY - Check systemd state on restbase1016 is OK: OK - running: The system is fully operational [19:35:49] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[cassandra-a] [19:35:49] RECOVERY - cassandra-a SSL 10.64.0.32:7001 on restbase1016 is OK: SSL OK - Certificate restbase1016-a valid until 2017-12-13 00:15:49 +0000 (expires in 364 days) [19:36:09] RECOVERY - cassandra-a service on restbase1016 is OK: OK - cassandra-a is active [19:36:27] bblack: is there a Varnish thing maybe to just turn of JS, like for old browsers? [19:36:36] That's do it in this case, I think... [19:36:49] that'd [19:36:49] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [19:39:13] AndyRussG: there's no varnish thing to turn off JS, no. [19:39:29] (03CR) 10Dzahn: [C: 032] DNS: Add production DNS for ms-fe200[5-8] Bug:T152612 [dns] - 10https://gerrit.wikimedia.org/r/327024 (https://phabricator.wikimedia.org/T152612) (owner: 10Papaul) [19:39:48] you are looking at a cache split, but you'd need to implement the "don't show banner to UAs matching X" in the application layer, and then we'd also need varnish to make a cache split for it as well [19:39:58] (03CR) 10Tim Landscheidt: "I haven't tested it, but @hashar's explanation sounds solid to me." [software] - 10https://gerrit.wikimedia.org/r/325762 (https://phabricator.wikimedia.org/T152549) (owner: 10Hashar) [19:40:05] (03PS2) 10Dzahn: DHCP: Add DHCP entries for ms-fe200[5-8] Bug: T152612 [puppet] - 10https://gerrit.wikimedia.org/r/327029 (https://phabricator.wikimedia.org/T152612) (owner: 10Papaul) [19:40:33] unless you know of some simpler hack to disable banner outputs (like Varnish detecting the UA and then returning a 404 for /what/banners/to/show/ on its own) [19:40:46] !log on labservices1001 and 1002, fixing ID overflow with alter table records modify id BIGINT AUTO_INCREMENT NOT NULL; [19:40:51] but I don't think the URL schema for this is that simple with RL in the way [19:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:52] !log ms-fe200[5-8] mgmt and servers added to DNS [19:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:45] (03CR) 10MaxSem: [C: 031] "Usage level:" [dns] - 10https://gerrit.wikimedia.org/r/327022 (https://phabricator.wikimedia.org/T120527) (owner: 10Dzahn) [19:47:32] 06Operations, 06Discovery, 06Discovery-Search, 10Monitoring, 07Wikimedia-Incident: Alert when ES indexes are freezed for more than 30 minutes - https://phabricator.wikimedia.org/T110171#2870147 (10Dzahn) >This hasn't been touched in quite a while, so lowering priority I know this is a general Phabricato... [19:47:56] !log bootstrap cassandra-a on restbase1016 T150964 [19:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:08] T150964: eqiad: Rack and setup new restbase nodes - https://phabricator.wikimedia.org/T150964 [19:48:08] bblack: yeah, cache split for special handling on the app layer sounds complex.... there is a pretty simple hack that would definitely work for now [19:48:30] We have an unlisted special page that actually serves the banners [19:48:59] If we make it return an error code for the googleweblight ua we'd be fine for now, fer surz [19:49:02] (03CR) 10Dzahn: [C: 032] DHCP: Add DHCP entries for ms-fe200[5-8] Bug: T152612 [puppet] - 10https://gerrit.wikimedia.org/r/327029 (https://phabricator.wikimedia.org/T152612) (owner: 10Papaul) [19:49:31] I think the mobile team might want to look into more details of how to handle this service, like u said, that's a much larger question.... [19:49:40] Special:BannerLoader [19:49:42] is the page [19:50:13] If we return an error code it'll go to an error handler client-side, which works (no banner and we get notified that that's what happened) [19:51:02] (03PS2) 10Dzahn: Enable Parsoid's linter on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/326232 (owner: 10Arlolra) [19:51:04] it's a pretty big issue--on one day alone in the FR campaign, I found 2600000 hits on beacon/impression with that ua [19:52:25] that's still in the ballpark of 0.3% of pageviews [19:52:27] but yeah [19:52:38] something like that anyways, very very rough math [19:53:40] AndyRussG: can you be specific about the tech details then? The UA string or regex to match, the path (or path regex) to error on, and what HTTP error code to return? [19:54:31] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): setup/install restbase-test100[123] - https://phabricator.wikimedia.org/T151075#2870220 (10RobH) a:05RobH>03Cmjohnson Actually, none of these have their network ports labeled with either their asset tags. I also checked for their old names... [19:55:04] 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2870225 (10Marostegui) Thanks @fgiunchedi! [19:55:45] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): setup/install restbase-test100[123] - https://phabricator.wikimedia.org/T151075#2870240 (10RobH) a:05Cmjohnson>03RobH [20:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161213T2000). [20:00:25] !log uploaded prometheus-apache-exporter 0.3-1 to jessie-wikimedia main [20:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:39] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:03:35] The UA to match would be just 'googleweblight' anywhere in the string... Here's the UA I'm seeing: Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 5 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko; googleweblight) Chrome/38.0.1025.166 Mobile Safari/535.19 [20:04:06] but I'm sure build and stuff will change... I don't see a reason to match on anything more than the googleweblight bit [20:04:19] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [20:04:22] The error code... Yeah that's a question to figure out... [20:04:25] Lemme see [20:05:17] 06Operations, 10Wikimedia-Logstash: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#2870271 (10elukey) >>! In T149451#2868987, @Ottomata wrote: >> The best thing ever would be to consume the webrequest topic, filter 5xx and push them back to kafka in another topic, but we are prob... [20:10:19] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:14:29] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:14:55] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port apache httpd metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T147316#2870318 (10elukey) Package uploaded to `jessie-wikimedia main` and installed (as test) on `mw2198`. Everything works as expected! [20:18:40] 06Operations, 10ops-codfw: codfw: rack/setup ms-fe200[5-8] - https://phabricator.wikimedia.org/T152612#2870330 (10RobH) [20:18:42] 06Operations, 10ops-codfw, 10netops: ms-fe200[5-8] switch port configuration - https://phabricator.wikimedia.org/T152627#2870328 (10RobH) 05Open>03Resolved Done! [20:19:27] (03PS1) 10Andrew Bogott: Designate: fix a copy/paste error in secondary pool_target [puppet] - 10https://gerrit.wikimedia.org/r/327035 [20:20:09] (03CR) 10Dzahn: [C: 032] Enable Parsoid's linter on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/326232 (owner: 10Arlolra) [20:20:31] bblack: I don't actually see any http error codes that seem right. Not intimately familiar with http error stati... I guess maybe 403? https://en.wikipedia.org/wiki/HTTP_403 [20:20:53] Alternately, we could return 200 and a specific string to trigger the error handler on the client [20:21:09] (03PS2) 10Andrew Bogott: Designate: fix a copy/paste error in secondary pool_target [puppet] - 10https://gerrit.wikimedia.org/r/327035 [20:22:05] (03PS3) 10Andrew Bogott: Designate: fix a copy/paste error in secondary pool_target [puppet] - 10https://gerrit.wikimedia.org/r/327035 [20:22:07] For example, we could just return "mw.centralNotice.handleBannerLoaderError( 'Forbidden UA: googleweblight' );" [20:22:50] ok [20:22:56] I can do that [20:23:09] (03CR) 10Andrew Bogott: [C: 032] Designate: fix a copy/paste error in secondary pool_target [puppet] - 10https://gerrit.wikimedia.org/r/327035 (owner: 10Andrew Bogott) [20:23:31] and for only exactly the URL /wiki/Special:BannerLoader ? (no i18n variants for some language wikis on the URL path? no parameters? etc...) [20:23:59] bblack: K! If http doesn't support an error code for forbidding a specific kind of UA, then I guess 200 with that string is best [20:24:08] I think Spezial: and such has bit us in the past [20:24:22] Ah hmm heh could be [20:24:26] 403 makes sense, but it spikes out graphs with pointless 4xx's too [20:24:34] Ah K [20:24:34] and might not have the best behaviors in all UAs, either [20:24:42] K let's do the 200 then [20:24:55] The requests go to meta [20:24:58] They do have parameters [20:25:00] we have existing rules in VCL about banner urls that don't take into account localized "Special:" either [20:25:07] AndyRussG: hi! [20:25:12] that whole thing's a mess, we never should've had Special: localized [20:25:12] awight: boo! [20:25:15] Let's not couple directly to the .js tho [20:25:29] ok [20:25:30] in that light, a 50x code might be the wickedest [20:25:40] 50x would be "wrong", the server is not failing [20:25:40] (awight: ^ see above, 400's will spike out prod's graphs= [20:25:42] ) [20:25:46] yeah [20:25:49] Unless that's hard to detect from the calling code, of course [20:25:50] Heh, #til about code 409 "conflict" [20:26:14] O_O [20:26:17] bblack: the calls are I think in ugly url format [20:26:35] just give me a URL regex then [20:26:37] "Indicates that the request could not be processed because of conflict in the request, such as an edit conflict " [20:27:04] We could send 418 I'm a teapot just to mess with them ;-) [20:27:11] I think 409 is webdav [20:27:41] Yeah, likely what it was intended for. Curious how browsers would behave if we served 419 on edit conflicts.... [20:27:51] hmm not explicitly in webdav RFCs, but still, it's for PUT [20:27:51] Probably not well [20:28:20] bblack: wmf-config/CommonSettings.php: $wgCentralSelectedBannerDispatcher = "//{$wmfHostnames['meta']}/w/index.php?title=Special:BannerLoader"; [20:29:00] So just if the url query has a title param set to Special:BannerLoader. No localiztion [20:29:14] !log twentyafterfour@tin Started scap: mediawiki 1.29.0-wmf.6: build l10n and sync the branch to testwikis. refs T152563 [20:29:15] (03CR) 10Dzahn: "for this to actually be applied you'll also have to do an "include mediawiwiki::maintenance::pageassessments" in modules/role/manifests/me" [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) (owner: 10Kaldari) [20:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:25] T152563: MW-1.29.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T152563 [20:29:38] awight: I was also worried about the hardcoded dependency on the JS. But I think if we put a comment in the Varnish code and in the CN JS that the two must coordinate, we'd be fine [20:29:39] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [20:29:59] AndyRussG: ok... do we know for sure it's always the first param? I could see it somehow ending up as /w/index.php?x=y&title=Special:BannerLoader too [20:30:28] this whole thing needs re-design anyways, I think at this point we're in "minimal production hacks for realtime production issues" mode for this campaign [20:31:00] yep [20:31:10] url = new mw.Uri( [20:31:12] mw.config.get( 'wgCentralSelectedBannerDispatcher' ) [20:31:14] ); [20:31:16] url.extend( [20:31:18] { [20:31:20] campaign: data.campaign, [20:31:22] banner: data.banner, [20:31:24] uselang: data.uselang, [20:31:26] debug: data.debug [20:31:42] yeah but who knows if the url library can re-order params or keeps that one at the start, when it re-parses and re-generates the url string [20:31:55] well someone knows, just not me [20:32:02] I can make it generic anyways [20:32:19] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [20:32:31] It probably is always the first param, but I wouldn't want to rely on it [20:32:42] "^/w/index.php.*[?&]title=Special:BannerLoader" ? [20:32:44] AndyRussG: okay, agreed. It's a rough pass which happens to get us everything we need in one swoop. [20:33:39] awight: cool! K thx :) [20:33:43] bblack: LGTM! [20:34:01] hmmm maybe [20:34:16] "^/w/index.php.*[?&]title=Special:BannerLoader($|&)" ? [20:34:30] so we don't catch whatever i don't know that's at Special:BannerLoaderLoaderator [20:34:35] (03CR) 10Dzahn: "ah, and there is also hieradata/role/codfw/mediawiki/maintenance.yaml (both codfw and eqiad) which is where the crons are enabled/disabled" [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) (owner: 10Kaldari) [20:34:40] yeah that's our next planned feature, in fact [20:34:45] Rolling out on April 1st [20:34:48] :p [20:35:17] yeah, tho, that sounds great :) [20:35:53] (03CR) 10Dzahn: "compare to https://gerrit.wikimedia.org/r/#/c/319892/" [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) (owner: 10Kaldari) [20:36:56] (03PS9) 10Dzahn: Add cronjob for regenerating captchas [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [20:37:13] (03CR) 10Papaul: [C: 032] add gerrit2001.mgmt for WMF6408.mgmt [dns] - 10https://gerrit.wikimedia.org/r/325596 (https://phabricator.wikimedia.org/T148186) (owner: 10Dzahn) [20:38:19] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:39:02] bblack: so ^ maybe put an inline comment, to the effect of, "Must coordinate with the JS method mw.centralNotice.handleBannerLoaderError(), defined in CentralNotice/resources/subscribing/ext.centralNotice.display.js" [20:39:32] and the ticket ref, yeah [20:40:04] (03PS10) 10Dzahn: mediawiki: Add cronjob for regenerating captchas [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [20:40:38] K :) thx! [20:41:01] (03CR) 10Paladox: "@Jcrespo it does that if we store large files, diffusion will host files in MySQL if there large in size." [puppet] - 10https://gerrit.wikimedia.org/r/326932 (https://phabricator.wikimedia.org/T151544) (owner: 10Paladox) [20:41:52] (03CR) 10Dzahn: [C: 032] mediawiki: Add cronjob for regenerating captchas [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [20:42:29] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [20:43:07] Longer term, I think we should treat google web light as a non-js browser [20:43:16] (03CR) 10ArielGlenn: [C: 031] "Haven't tested but would be fine for me." [software] - 10https://gerrit.wikimedia.org/r/325762 (https://phabricator.wikimedia.org/T152549) (owner: 10Hashar) [20:44:09] PROBLEM - puppet last run on elastic1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:44:43] (03CR) 10Paladox: "https://secure.phabricator.com/book/phabricator/article/configuring_file_storage/" [puppet] - 10https://gerrit.wikimedia.org/r/326932 (https://phabricator.wikimedia.org/T151544) (owner: 10Paladox) [20:46:16] ejegg: yeah agreed! how do we do that for other browsers, btw? [20:46:29] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/fancycaptcha/words],File[/etc/fancycaptcha/badwords] [20:46:38] AndyRussG: https://phabricator.wikimedia.org/diffusion/MW/browse/master/resources/src/startup.js$34 [20:46:41] also ^ as mentioned, we may want to pull in mobile team for discssion [20:46:47] (03CR) 10Papaul: [C: 04-1] "can you please put gerrit2001 on the first line and wmf6408 on the second line?" [dns] - 10https://gerrit.wikimedia.org/r/325596 (https://phabricator.wikimedia.org/T148186) (owner: 10Dzahn) [20:48:23] ejegg: ohh... fixing that'd also be quite straightforward [20:48:25] bblack: ^ [20:50:11] ejegg: however, we'd still get the pageview/impressions disconnect... However, that's more an analytics issues... Probably we could filter out that ua on pageviews [20:50:40] AndyRussG: https://gerrit.wikimedia.org/r/327043 [20:50:51] AndyRussG: do we already filter for other non-JS browsers? [20:51:00] AndyRussG: oh hey--this could already account for a huge cluster of dark matter in pv/imps [20:51:10] seems way simpler [20:51:13] * awight +1's a bit late [20:51:21] with this, no need for varnish hacks right? [20:51:35] bblack: right. [20:51:40] +1000 from me! [20:51:53] bblack: sorry for the bother!!! [20:52:00] Let's test this on someone's VPS... [20:55:37] rror: Could not set 'present' on ensure: No such file or directory - /etc/fancycaptcha/words20161213-25436-1sbpxoc.lock at 8:/etc/puppet/modules/mediawiki/manifests/maintenance/generatecaptcha.pp [20:55:46] Reedy: [20:56:42] looking, it's the dependencies on the secret file [20:58:04] awight: ooh yeah, just filter on the same regex when doing hive queries [20:58:26] * AndyRussG hugs the sound of falling silos [20:59:30] (03PS4) 10Dzahn: add gerrit2001.mgmt for WMF6408.mgmt [dns] - 10https://gerrit.wikimedia.org/r/325596 (https://phabricator.wikimedia.org/T148186) [21:01:05] AndyRussG: onice. That should be made into a column in um webrequest[?] in fact [21:01:14] um raw_webrequest [21:01:32] awight: heheh yes better still :) [21:01:33] Hrm I don't see the is_bot column, I must have fabricated it [21:01:43] no it's there.. [21:02:17] agent_type string Categorise the agent making the webrequest as either user or spider [21:02:31] awight: ^ [21:02:40] * ostriches sets his agent_type to secret [21:02:55] * ostriches laughs at his own joke [21:03:02] lol [21:03:10] free? [21:03:49] (03Abandoned) 10Dzahn: add gerrit2001.mgmt for WMF6408.mgmt [dns] - 10https://gerrit.wikimedia.org/r/325596 (https://phabricator.wikimedia.org/T148186) (owner: 10Dzahn) [21:04:09] real estate? [21:04:31] * AndyRussG also self-provides laugh track [21:05:01] hehehe [21:05:05] So many types of agents [21:05:16] bblack: in any case, thanks so much for the timely assistance... really appreciated!!!!! :D [21:05:22] AndyRussG: np! [21:05:23] ostriches: repo man cover Hombre Secreto FTW! [21:05:41] (03PS1) 10RobH: setting up restbase-test 100[123] dns entries [dns] - 10https://gerrit.wikimedia.org/r/327051 [21:05:56] 06Operations, 10Cassandra, 10RESTBase, 06Services (doing): RESTBase k-r-v as Cassandra anti-pattern - https://phabricator.wikimedia.org/T144431#2870529 (10GWicke) [21:06:53] Or [A-Z] [21:06:55] https://en.wikipedia.org/wiki/Agent_K [21:07:07] awight: Johnny Rivers makes funny faces while singing the original.... [21:07:13] mutante: You put them into the secret repo already, right? [21:07:16] Bites his lip a lot [21:08:05] (03CR) 10RobH: [C: 032] setting up restbase-test 100[123] dns entries [dns] - 10https://gerrit.wikimedia.org/r/327051 (owner: 10RobH) [21:09:02] (03CR) 10Dzahn: "I heard that using swift might be an option as well. I suggest we discuss on a general ticket about "storage engine used by phab" or so." [puppet] - 10https://gerrit.wikimedia.org/r/326932 (https://phabricator.wikimedia.org/T151544) (owner: 10Paladox) [21:09:11] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): setup/install restbase-test100[123] - https://phabricator.wikimedia.org/T151075#2870532 (10RobH) [21:09:25] Reedy: yea [21:09:33] got distracted, double chekcing now [21:11:02] modules/secret/secrets/fancycaptcha/words yes [21:12:09] RECOVERY - puppet last run on elastic1034 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [21:14:10] (03PS1) 10Papaul: DHCP:Fix DHCP for ms-fe200[5-8] Had 1G NIC MAC and not 10G MAC Bug:T152627 [puppet] - 10https://gerrit.wikimedia.org/r/327055 (https://phabricator.wikimedia.org/T152627) [21:14:53] Reedy: we defined files in /etc/fancycaptcha/ but not that directory itself, fixing [21:15:08] mutante: so putting files in, doesn't make the dir appear? [21:15:12] no [21:15:29] you have to define the whole tree [21:15:29] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [21:15:40] that is recovered because i did "mkdir" [21:15:53] heh [21:18:12] (03PS1) 10Dzahn: mediawiki: ensure /etc/fancycaptcha/ on maintenance hosts [puppet] - 10https://gerrit.wikimedia.org/r/327057 [21:19:55] (03CR) 10Dzahn: [C: 032] DHCP:Fix DHCP for ms-fe200[5-8] Had 1G NIC MAC and not 10G MAC Bug:T152627 [puppet] - 10https://gerrit.wikimedia.org/r/327055 (https://phabricator.wikimedia.org/T152627) (owner: 10Papaul) [21:20:16] (03PS2) 10Dzahn: mediawiki: ensure /etc/fancycaptcha/ on maintenance hosts [puppet] - 10https://gerrit.wikimedia.org/r/327057 [21:20:58] !log twentyafterfour@tin Finished scap: mediawiki 1.29.0-wmf.6: build l10n and sync the branch to testwikis. refs T152563 (duration: 51m 44s) [21:21:01] (03PS3) 10Dzahn: mediawiki: ensure /etc/fancycaptcha/ on maintenance hosts [puppet] - 10https://gerrit.wikimedia.org/r/327057 [21:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:09] T152563: MW-1.29.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T152563 [21:22:41] (03CR) 10Dzahn: "recheck yourself before you wreck yourself" [puppet] - 10https://gerrit.wikimedia.org/r/327057 (owner: 10Dzahn) [21:23:11] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/327057 (owner: 10Dzahn) [21:24:32] (03CR) 10Dzahn: [C: 032] mediawiki: ensure /etc/fancycaptcha/ on maintenance hosts [puppet] - 10https://gerrit.wikimedia.org/r/327057 (owner: 10Dzahn) [21:24:42] (03PS4) 10Dzahn: mediawiki: ensure /etc/fancycaptcha/ on maintenance hosts [puppet] - 10https://gerrit.wikimedia.org/r/327057 [21:24:58] (03CR) 10Dzahn: [V: 032 C: 032] mediawiki: ensure /etc/fancycaptcha/ on maintenance hosts [puppet] - 10https://gerrit.wikimedia.org/r/327057 (owner: 10Dzahn) [21:36:11] (03PS1) 10Dzahn: mediawiki: disable 'generate captcha' maintenance job [puppet] - 10https://gerrit.wikimedia.org/r/327059 (https://phabricator.wikimedia.org/T150029) [21:37:04] Reedy: ^? [21:37:49] mutante: that'll leave the word lists in place, right? [21:39:13] Reedy: yes [21:39:15] 06Operations, 06Security-Team, 13Patch-For-Review: Create cronjob for regular captcha regeneration - https://phabricator.wikimedia.org/T150029#2772112 (10Dzahn) merged, has been created on terbium (not created on wasat as configured). follow-up https://gerrit.wikimedia.org/r/#/c/327057/ disable for now unt... [21:39:35] sweet [21:39:51] Reedy: no, it will not, heh [21:40:14] just change the cronjob line to hardcoded false? :P [21:40:15] we couple cron and files by using $ensure [21:40:17] for both [21:40:40] yeah [21:41:06] if we do that we just introduced 2 different places to enable crons again :p [21:41:29] let's do the other way [21:41:41] change the files to hardcoded present [21:42:15] or that :) [21:42:54] thcipriani: any idea what this might mean? https://phabricator.wikimedia.org/P4613 [21:43:50] (03PS3) 10Paladox: Gerrit: Fix gitweb (diffusion) file links [puppet] - 10https://gerrit.wikimedia.org/r/326163 [21:44:03] (03PS2) 10Dzahn: mediawiki: disable 'generate captcha' maintenance job [puppet] - 10https://gerrit.wikimedia.org/r/327059 (https://phabricator.wikimedia.org/T150029) [21:44:38] twentyafterfour: All the Unable to find remote tracking branch/tag ? [21:44:40] twentyafterfour: that means it can't find the sha1 for the submodules. IIRC ostriches hit this last week? [21:44:42] Chad said to just ignore them [21:44:51] ok [21:45:00] the "disclosable head" for each of those submodules. [21:45:09] PROBLEM - check_puppetrun on rigel is CRITICAL: CRITICAL: Puppet has 7 failures [21:45:09] PROBLEM - check_puppetrun on saiph is CRITICAL: CRITICAL: Puppet has 7 failures [21:45:09] PROBLEM - check_puppetrun on alnilam is CRITICAL: CRITICAL: Puppet has 17 failures [21:45:39] submodules causing extra problems once again [21:46:07] uff all the check_puppetruns are because I rebooted their puppetmaster [21:46:11] 06Operations, 10Mobile-Content-Service, 10RESTBase, 10RESTBase-API, and 4 others: Refreshing mobile-sections does not purge mobile-sections-lead - https://phabricator.wikimedia.org/T152690#2870728 (10Pchelolo) 05Open>03Resolved We didn't hear about this problem for a while, let's assume separating it t... [21:46:33] (03PS4) 10Paladox: Gerrit: Fix gitweb (diffusion) file links [puppet] - 10https://gerrit.wikimedia.org/r/326163 (https://phabricator.wikimedia.org/T153130) [21:48:16] (03CR) 10Dzahn: [C: 032] mediawiki: disable 'generate captcha' maintenance job [puppet] - 10https://gerrit.wikimedia.org/r/327059 (https://phabricator.wikimedia.org/T150029) (owner: 10Dzahn) [21:48:21] twentyafterfour@tin:/srv/mediawiki$ tig [21:48:22] tig: Not a git repository [21:48:22] (03PS3) 10Dzahn: mediawiki: disable 'generate captcha' maintenance job [puppet] - 10https://gerrit.wikimedia.org/r/327059 (https://phabricator.wikimedia.org/T150029) [21:48:43] thcipriani: Isn't that supposed to be a git repo now or that hasn't gone to prod yet? [21:49:07] twentyafterfour: not in prod yet, only beta [21:49:43] what is "tig"? [21:49:53] just your alias? [21:50:09] RECOVERY - check_puppetrun on saiph is OK: OK: Puppet is currently enabled, last run 162 seconds ago with 0 failures [21:50:09] RECOVERY - check_puppetrun on rigel is OK: OK: Puppet is currently enabled, last run 197 seconds ago with 0 failures [21:50:09] PROBLEM - check_puppetrun on alnilam is CRITICAL: CRITICAL: Puppet has 17 failures [21:50:14] the "textmode interface to git" thing? [21:55:09] RECOVERY - check_puppetrun on alnilam is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [21:55:29] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [21:56:29] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4514979 keys, up 43 days 13 hours - replication_delay is 0 [21:58:39] !log otto@tin Starting deploy [eventstreams/deploy@e1cc638]: (no message) [21:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:00] !log otto@tin Finished deploy [eventstreams/deploy@e1cc638]: (no message) (duration: 00m 21s) [21:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:13] mutante: yeah that [22:05:37] it's my favorite thing. [22:10:47] !log otto@tin Starting deploy [eventstreams/deploy@db5d61f]: (no message) [22:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:36] !log otto@tin Finished deploy [eventstreams/deploy@db5d61f]: (no message) (duration: 00m 49s) [22:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:04] 06Operations, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#2870916 (10Jgreen) By way of scoping: My understanding is that we're using prometheus-node-exporter to on each host to collect local metrics and listen on a TCP port for the HTTP request from t... [22:24:00] !log OS install on ms-fe200[5-8] [22:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:07] (03PS1) 10Ottomata: Add rdkafka_config deployment var to eventstreams service module and role [puppet] - 10https://gerrit.wikimedia.org/r/327113 (https://phabricator.wikimedia.org/T143925) [22:28:00] (03CR) 10jenkins-bot: [V: 04-1] Add rdkafka_config deployment var to eventstreams service module and role [puppet] - 10https://gerrit.wikimedia.org/r/327113 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [22:32:11] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): setup/install restbase-test100[123] - https://phabricator.wikimedia.org/T151075#2870939 (10RobH) It seems that restbase-test1001 doesn't see all of its SSDs, as it only sees 3 of the 4. The other two hosts see all four fine. I'm creating a sub... [22:34:08] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): restbase-test1001 fouth ssd not detected - https://phabricator.wikimedia.org/T153139#2870953 (10RobH) [22:37:42] (03PS1) 10RobH: restbase-test100[123] install_module updates [puppet] - 10https://gerrit.wikimedia.org/r/327118 [22:38:27] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): setup/install restbase-test100[123] - https://phabricator.wikimedia.org/T151075#2870987 (10RobH) [22:38:38] (03CR) 10jenkins-bot: [V: 04] restbase-test100[123] install_module updates [puppet] - 10https://gerrit.wikimedia.org/r/327118 (owner: 10RobH) [22:38:47] (03PS2) 10RobH: restbase-test100[123] install_module updates [puppet] - 10https://gerrit.wikimedia.org/r/327118 [22:39:28] (03CR) 10RobH: [C: 032] restbase-test100[123] install_module updates [puppet] - 10https://gerrit.wikimedia.org/r/327118 (owner: 10RobH) [22:39:38] (03CR) 10jenkins-bot: [V: 04] restbase-test100[123] install_module updates [puppet] - 10https://gerrit.wikimedia.org/r/327118 (owner: 10RobH) [22:39:57] so now gerrit pings when it finishes a successful verificatio as well [22:39:59] that seems new? [22:40:17] it used to only do that on failure. I dont dislike it, but it caught me by surprise since I assumed v:ping = failure, heh [22:40:34] paladox, ^ [22:40:38] helps when zuul gets lagged down rather than just watching the queue, heh [22:40:56] did you revert my commit to stop the gerrit bot doing stuff like that? [22:41:22] ahh, so it was unexpected then, ok. (i dont care either way!) [22:42:08] it might be that my patch didn't fix things fully [22:42:23] but there were some problems earlier so I told paladox he could revert [22:42:34] I just had that moment of 'oh shit i got verfification ping i must have fucked up my patchset' [22:42:34] Oh [22:42:36] heh [22:42:43] lololol [22:42:56] sorry, i know people doint like the new v: 0 and c: 0 [22:43:02] so i had to follow up a patch [22:43:16] that broke commenting on a merge patch (sending that comment to irc). [22:43:18] oh, I don't want my comment as complaint! I don't have any particularly strong feelings about it. [22:43:31] I just noticed change so was commenting =] [22:43:31] ok [22:43:41] 06Operations, 06Security-Team, 13Patch-For-Review: Create cronjob for regular captcha regeneration - https://phabricator.wikimedia.org/T150029#2870998 (10Dzahn) Ok, puppet code merged and done for now. We have the desired situation now, which is: - both terbium and wasat have the word files in /etc/fancyca... [22:43:50] LOL jenkins voted V: [22:44:50] ""Voilà! In view, a humble vaudevillian veteran, cast vicariously as both victim and villain by the vicissitudes of Fate. This visage, no mere veneer of vanity, is a vestige of the vox populi, now vacant, vanished. However, this valorous visitation of a by-gone vexation, stands vivified and has vowed to vanquish these venal and virulent vermin vanguarding vice and vouchsafing the violently [22:44:56] vicious and voracious violation of volition. " [22:45:59] Can someone tell jenkins bot V: isnt a valid option but dont break its heart [22:46:18] mutante: go home shakesphere your drunk xD [22:46:33] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): restbase-test1001 fourth ssd not detected - https://phabricator.wikimedia.org/T153139#2871002 (10Eevans) [22:47:03] :p [22:47:05] you know what would be a nice feature of phabricator? "remind me on 2017-01-01" [22:47:17] and then the task disappears but you get notified on $date again [22:47:24] you could implement that with a bot [22:47:30] Oh it's probaly https://gerrit-review.googlesource.com/#/c/71051/ [22:47:35] there's bots on Reddit which do that [22:47:49] since now it reports your code values even when merged. [22:47:51] yes, so true about Reddit bot [22:48:11] Krenair: or set up an cron thats set to fail on that date and have labs spam your inbox with 2million + emails lol [22:48:32] !log twentyafterfour@tin Synchronized php-1.29.0-wmf.6: (no message) (duration: 07m 23s) [22:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:43] You will probaly use up gmails quoata if you use gmail. [22:48:45] still a smaller volume than Maniphest [22:49:12] Labs spammed me i still have 100% pf my quota (prob cuz im a youtube partner maybe tho) [22:49:28] lol, /me has 1tb. [22:49:39] I enrgilish bery gud tday [22:50:28] [22:50:13] (CR) jenkins-bot: [] Added Scribunto to JsonConfig instead of Kartographer [integration/config] - https://gerrit.wikimedia.org/r/327119 (owner: Yurik) [22:50:32] That's broken too [22:50:46] nope [22:50:50] that's intentional [22:50:58] It looks stupid [22:51:01] yeh [22:51:05] better then the big red 0 [22:51:14] Is it really? [22:51:27] Why do we acare if jenkins removed it's V+1 as part of the merge processed? [22:51:33] to say [V:] better to say nothing imho [22:51:43] thats ^^ a bug [22:51:48] Except it did say nothing [22:53:25] V: doesnt tell me if + or - [22:53:28] neither does a 0 [22:53:38] same :p [22:53:54] [] is even more useless [22:54:25] well blame this on https://gerrit-review.googlesource.com/#/c/71051/ [22:54:28] it jused to say (merged) [22:54:56] merged is not the same action in all repos [22:55:19] Should i revert? [22:55:25] so we get the red 0 back [22:55:51] can't it just behave like before the upgrade? [22:55:57] No [22:56:03] not possible since https://gerrit-review.googlesource.com/#/c/71051/ [22:56:09] !log prometheus1003 was failing to pxe boot over and over (no dhcp entry). powered off. [22:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:05] paladox: abstain. it's the same before and after, it tells me that jenkins bot is done but not what the result was. no difference [22:57:25] what i really want is of course green or red [22:57:47] knowing that it is done has _some_ value at least [22:58:02] It's strange c: 2 works [22:58:05] but v: 2 wont [22:59:07] and once we have green/red +/- again then also add an action "on submit" [22:59:18] that would be better than it was before upgrade [23:02:39] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 19 failures. Last run 2 minutes ago with 19 failures. Failed resources (up to 3 shown): Package[ncdu],Package[dstat],Package[strace],Package[pv] [23:05:45] Ah i figured out a temp fix, it will do v: but wont show v: 0, which is better? [23:05:53] Reedy Krenair muntate ^^ [23:06:08] (CR) Paladox: [V: ] "test" [extensions/test] - https://gerrit.git.wmflabs.org/r/70 (owner: Paladox) [23:06:15] ^^ thats how it will look like [23:06:29] paladox: i dont see a difference [23:06:38] (CR) Paladox: [V: ] "test" [extensions/test] - https://gerrit.git.wmflabs.org/r/70 (owner: Paladox) [23:06:44] (CR) Paladox: [V: 2] Insert the description of the change. [extensions/test] - https://gerrit.git.wmflabs.org/r/70 (owner: Paladox) [23:08:02] paladox: well.. either it tells me the result or it doesn't and is just another way to say "done, go check in browser" [23:08:05] (03PS1) 10RobH: fixing restbase100[123] mac addresses [puppet] - 10https://gerrit.wikimedia.org/r/327126 [23:08:35] (03CR) 10RobH: [C: 032] fixing restbase100[123] mac addresses [puppet] - 10https://gerrit.wikimedia.org/r/327126 (owner: 10RobH) [23:09:40] (03CR) 10Dzahn: [C: 04-1] "[cobalt:~] $ apt-cache show filebeat" [puppet] - 10https://gerrit.wikimedia.org/r/326374 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [23:10:20] (03CR) 10Paladox: [] "I thought this installs it?" [puppet] - 10https://gerrit.wikimedia.org/r/326374 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [23:11:37] (03CR) 10Dzahn: [C: 04-1] "This shows that trying to install it would fail with "not found"." [puppet] - 10https://gerrit.wikimedia.org/r/326374 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [23:12:33] (03CR) 10Dzahn: [C: 04-1] "here's another way to show it, apt-get install with -s for "simulated":" [puppet] - 10https://gerrit.wikimedia.org/r/326374 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [23:12:41] (03CR) 10Paladox: [] "Oh." [puppet] - 10https://gerrit.wikimedia.org/r/326374 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [23:13:30] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:14:08] (03CR) 10jenkins-bot: [V: 04] fixing restbase100[123] mac addresses [puppet] - 10https://gerrit.wikimedia.org/r/327126 (owner: 10RobH) [23:14:10] paladox: there's another random UI thing in Gerrit i noticed.. [23:14:15] (03CR) 10Paladox: [] "Oh, I guess it may be backported from https://www.elastic.co/products/beats/filebeat ?" [puppet] - 10https://gerrit.wikimedia.org/r/326374 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [23:14:19] Oh [23:14:19] ok, i like the pinging now [23:14:22] cuz that took forever for zuul [23:14:24] LOL [23:14:25] so its kinda nice ;D [23:14:45] heh, yea, that's what i meant by some value [23:14:52] at least you know when to check again [23:15:08] paladox: other random thing, let's say i voted -1 on something, like the one above [23:15:28] paladox: the red color is gone in web ui [23:15:34] yep, oh [23:15:36] yeh [23:15:37] so yeah, if we're voting, i like it. ;D [23:15:41] green is still there, red is not [23:15:42] you have to do it as -2 [23:16:00] mutante i have a fix here https://gerrit-review.googlesource.com/#/c/91471/ [23:16:01] but -1 should also have a color, orange or whatnot then [23:16:04] it used to be red [23:16:13] to get colours and images in the reivewer table [23:17:09] your patch does sound like exactly that :) ok, cool [23:17:10] i have a fix im going to deploy [23:17:15] thanks! [23:17:21] https://gerrit.wikimedia.org/r/#/c/327120/2 [23:17:24] yeh [23:17:27] all tested too [23:17:30] nice [23:17:39] mutante checkout https://gerrit-new.wmflabs.org/ [23:17:47] has my patches. [23:18:48] paladox: what is https://gerrit.wikimedia.org/r/#/c/326163/4/modules/gerrit/templates/gerrit.config.erb changing [23:19:20] stares at the diff [23:19:52] It fixes the path to a file [23:19:56] in gerrit's diff [23:20:04] See https://phabricator.wikimedia.org/T153130 [23:20:19] fix is deployed now. [23:21:10] paladox: oh, i tested on your labs instance, lighter red for -1 and darker red for -2? i like it [23:21:21] Yeh [23:21:27] + it has a cross and tick [23:25:02] (03PS5) 10Dzahn: Gerrit: Fix gitweb (diffusion) file links [puppet] - 10https://gerrit.wikimedia.org/r/326163 (https://phabricator.wikimedia.org/T153130) (owner: 10Paladox) [23:25:17] finally get the extra / there [23:25:22] yea, confirmed [23:25:29] yep :) [23:25:32] (03CR) 10Filippo Giunchedi: [] Add hhvm_exporter role and class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323079 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [23:29:30] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [23:30:56] (03CR) 10Dzahn: [V: 032] "erb template that is already verified" [puppet] - 10https://gerrit.wikimedia.org/r/326163 (https://phabricator.wikimedia.org/T153130) (owner: 10Paladox) [23:31:04] (03CR) 10Dzahn: [V: 032 C: 032] Gerrit: Fix gitweb (diffusion) file links [puppet] - 10https://gerrit.wikimedia.org/r/326163 (https://phabricator.wikimedia.org/T153130) (owner: 10Paladox) [23:31:15] Reedy mutante ^^ mutch better [23:31:52] paladox: did that work because i did it as opposed to jenkins-bot? [23:32:07] Nope [23:32:16] it worked because of the fix i did https://gerrit.wikimedia.org/r/#/c/327120/ [23:32:33] :) very nice [23:32:48] yep :) [23:34:09] puppet is disabled on cobalt [23:34:26] so i merged that but it will not get applied yet [23:34:30] Oh. [23:35:12] which is not super nice but that fix is good [23:35:34] mutante is there a way we can edit the file manuly as puppet will just overwrite it and re write the value we put in there? [23:35:57] yes, but i'd prefer to just re-enable puppet [23:36:02] ok [23:36:11] was there a reason why we disabled puppet? [23:36:12] thing is i dont know the reason [23:36:15] oh [23:36:40] paladox: it gives a reason ... P) [23:36:44] "Puppet is disabled. reason not specified" [23:36:51] lol [23:37:28] pretty sure that was during the version upgrade [23:37:35] better to double check with ostriches [23:37:38] yep. [23:38:43] paladox: meanwhile, what about "Gerrit: Enable config localUsernameToLowerCase" [23:38:54] i am so _not_ merging that .. but [23:39:01] what is the status there, i missed out [23:39:07] That will fix the uppercase username problems [23:39:21] i think ostriches plans to do that but not sure when. [23:39:28] ok [23:40:24] heh @ "do not merge without.. break all users"...yep, next [23:41:04] LOL [23:42:07] (03CR) 10Dzahn: [C: 04-2] "just putting a -2 per "Note: Do not merge this without demon / Chad +1 as this will break all users". Tell us when it's time for this." [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox) [23:42:30] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [23:42:38] sees the red X icon [23:42:52] (03CR) 10Paladox: [] "Ok thanks." [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox) [23:43:00] yep [23:43:36] paladox: back to "filebeat". use packages.debian.org to search for it? [23:44:43] Yep, dosent seem to be on there [23:45:18] i downloaded the deb from elastic so i presumed it was on debian. [23:45:55] (03CR) 10Dzahn: [C: 031] openstack: Fix puppet URLs in comments [puppet] - 10https://gerrit.wikimedia.org/r/325479 (owner: 10Tim Landscheidt) [23:47:00] paladox: sigh, i see their install is "curl .. dpkg -i" [23:47:16] not great [23:48:12] maybe needs ticket to get that on our own repo in "thirdparty" [23:50:15] oh [23:50:19] yep [23:50:31] (03CR) 10Andrew Bogott: [C: 031] "This is fine with me, although I'm tempted to say that we should just abolish those 'this file came from...' links since they tend to alwa" [puppet] - 10https://gerrit.wikimedia.org/r/325479 (owner: 10Tim Landscheidt) [23:51:56] (03CR) 10Dzahn: [C: 031] "+1 to Andrew's comments" [puppet] - 10https://gerrit.wikimedia.org/r/325479 (owner: 10Tim Landscheidt) [23:52:41] mutante ^^ two + [23:52:44] = +2 [23:52:46] lol [23:53:10] (03PS3) 10Dzahn: openstack: Fix puppet URLs in comments [puppet] - 10https://gerrit.wikimedia.org/r/325479 (owner: 10Tim Landscheidt) [23:53:45] (03CR) 10Andrew Bogott: [C: 031] "Note that I am probably the #1 producer of those comments, all of which I now regret" [puppet] - 10https://gerrit.wikimedia.org/r/325479 (owner: 10Tim Landscheidt) [23:53:50] paladox: +1 to the comment he added to his +1 :) [23:53:59] but also.. yes [23:53:59] lol [23:54:02] :) [23:54:38] (03CR) 10Dzahn: [C: 032] "comments-only" [puppet] - 10https://gerrit.wikimedia.org/r/325479 (owner: 10Tim Landscheidt) [23:55:22] (03PS1) 1020after4: group0 wikis to 1.29.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327130 [23:55:24] (03CR) 1020after4: [C: 032] group0 wikis to 1.29.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327130 (owner: 1020after4) [23:55:28] (03CR) 10jenkins-bot: [] group0 wikis to 1.29.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327130 (owner: 1020after4) [23:56:21] (03Merged) 10jenkins-bot: group0 wikis to 1.29.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327130 (owner: 1020after4) [23:57:27] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 wikis to 1.29.0-wmf.6 [23:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:39] PROBLEM - puppet last run on elastic1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues