[00:01:39] thcipriani: we could postpone, but we could also just 𝔱𝔒𝔰𝔱 𝔦𝔫 𝔭𝔯𝔬𝔑𝔲𝔠𝔱𝔦𝔬𝔫. It is scoped to tin, so it won't affect users, and if it breaks I can revert or follow up with a fix-up. [00:02:27] 6operations, 7Monitoring: Collect and report nutcracker statistics to Ganglia and/or Graphite - https://phabricator.wikimedia.org/T107381#1573754 (10ori) 5Open>3Resolved I think Graphite-only is fine. [00:10:01] 6operations, 10Traffic, 5Patch-For-Review: Switch codfw caches to tier2, begin pushing some traffic through them to test - https://phabricator.wikimedia.org/T110065#1573764 (10BBlack) codfw switch to tier2 is complete. I don't *think* there's any need to wipe caches down there, either. So we're probably ok... [00:12:36] 6operations, 10Traffic: Clean up DNS/redirects for TLS - https://phabricator.wikimedia.org/T102824#1573767 (10Chmarkine) [00:12:38] 6operations, 7HTTPS, 5Patch-For-Review: download.wiki[mp]edia.org are using an invalid certificate - https://phabricator.wikimedia.org/T107575#1573765 (10Chmarkine) 5Open>3Resolved Confirmed that this issue was fixed. [00:12:47] !log ori@tin Synchronized php-1.26wmf20/extensions/Scribunto/engines/LuaCommon/LuaCommon.php: 2586dd1c7c: Updated mediawiki/core Project: mediawiki/extensions/Scribunto (duration: 00m 13s) [00:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:13:32] (03PS2) 10Ori.livneh: Collection/OCG: Turn on plain text output format in Book Creator. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200038 (owner: 10Cscott) [00:13:38] (03CR) 10Ori.livneh: [C: 032] Collection/OCG: Turn on plain text output format in Book Creator. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200038 (owner: 10Cscott) [00:13:45] (03Merged) 10jenkins-bot: Collection/OCG: Turn on plain text output format in Book Creator. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200038 (owner: 10Cscott) [00:14:39] !log ori@tin Synchronized wmf-config/CommonSettings.php: I79ffa78fa: Collection/OCG: Turn on plain text output format in Book Creator (duration: 00m 12s) [00:14:40] Krenair: ^ [00:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:15:46] ori, thank you1 [00:15:46] ! [00:15:53] It works as expected [00:16:02] nice! [00:19:06] !log ori@tin Synchronized php-1.26wmf19/extensions/Scribunto/engines/LuaCommon/LuaCommon.php: (no message) (duration: 00m 14s) [00:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:20:28] 6operations, 10Wikimedia-Mailing-lists: export config and archive data from sodium - https://phabricator.wikimedia.org/T108071#1573772 (10Dzahn) a:3Dzahn [00:20:36] there's something happening with lua [00:20:46] suddenly returns error [00:20:56] 3 mins ago it was working [00:21:12] (03PS1) 10Dzahn: mailman: mini script to rsync lists [puppet] - 10https://gerrit.wikimedia.org/r/233888 (https://phabricator.wikimedia.org/T109399) [00:23:18] it says module returned value of the type, should return table of exports (translated from cs, idk how's the msg in en exactly) [00:24:05] Danny_B, I was seeing that in Commons too [00:24:12] Script error: The module returned a value. It is supposed to return an export table [00:25:06] ori: please undo the last sync [00:25:12] it obviously broke stuff [00:25:22] 02:12:47 < logmsgbot> !log ori@tin Synchronized php-1.26wmf20/extensions/Scribunto/engines/LuaCommon/LuaCommon.php: 2586dd1c7c: Updated mediawiki/core Project: mediawiki/extensions/Scribunto (duration: 00m 13s) [00:25:26] (03PS2) 10Alex Monk: Redirect most noc.wikimedia.org/conf URLs to git.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/224214 [00:25:27] ok [00:25:35] thx [00:26:33] !log 2586dd1c7c obviously broke many pages [00:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:26:39] !log ori@tin Synchronized php-1.26wmf19/extensions/Scribunto/engines/LuaCommon/LuaCommon.php: (no message) (duration: 00m 13s) [00:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:26:48] (03CR) 10Legoktm: "Why are we killing noc? It's useful." [puppet] - 10https://gerrit.wikimedia.org/r/224214 (owner: 10Alex Monk) [00:26:53] (03PS3) 10Alex Monk: Redirect most noc.wikimedia.org/conf URLs to Diffusion [puppet] - 10https://gerrit.wikimedia.org/r/224214 [00:27:04] (03PS2) 10Dzahn: mailman: mini script to rsync lists [puppet] - 10https://gerrit.wikimedia.org/r/233888 (https://phabricator.wikimedia.org/T109399) [00:28:10] !log ori@tin Synchronized php-1.26wmf20/extensions/Scribunto/engines/LuaCommon/LuaCommon.php: (no message) (duration: 00m 13s) [00:28:13] Danny_B: better? [00:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:28:24] ori: let me check [00:28:54] nope [00:29:01] still getting the samne error [00:29:06] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: go through all directories in /var/lib/mailman and decide if migration is needed - https://phabricator.wikimedia.org/T109399#1573798 (10Dzahn) >>! In T109399#1561586, @JohnLewis wrote: > Below is my final compiled list of what should and shouldn't go... [00:29:12] were you changing anything else realted to lua? [00:29:31] (03CR) 10Alex Monk: "It's a manually-maintained (every time you remove/add/move a file you have to update the symlinks for noc) viewer of a subset of files in " [puppet] - 10https://gerrit.wikimedia.org/r/224214 (owner: 10Alex Monk) [00:30:32] Danny_B: no. check one more time? [00:30:44] now it's off [00:30:49] (03CR) 10Dzahn: "note it still has -n for --dry-run" [puppet] - 10https://gerrit.wikimedia.org/r/233888 (https://phabricator.wikimedia.org/T109399) (owner: 10Dzahn) [00:30:49] perhaps sync lag [00:30:53] thx [00:31:49] (03CR) 10Alex Monk: Redirect most noc.wikimedia.org/conf URLs to Diffusion (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/224214 (owner: 10Alex Monk) [00:33:54] (03CR) 10Legoktm: "So why not fix that? noc shows what's *live* and synced rather than what's in the git repos, which is not always the same thing." [puppet] - 10https://gerrit.wikimedia.org/r/224214 (owner: 10Alex Monk) [00:35:06] (03CR) 10MaxSem: [C: 04-1] "Agree with Lego." [puppet] - 10https://gerrit.wikimedia.org/r/224214 (owner: 10Alex Monk) [00:48:43] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3046_v6, cp4015_v6 [00:50:42] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [00:54:19] (03CR) 10Alex Monk: "> So why not fix that?" [puppet] - 10https://gerrit.wikimedia.org/r/224214 (owner: 10Alex Monk) [00:58:11] (03PS3) 10Dzahn: mailman: mini script to rsync lists [puppet] - 10https://gerrit.wikimedia.org/r/233888 (https://phabricator.wikimedia.org/T109399) [01:00:23] (03PS4) 10Dzahn: mailman: mini script to rsync lists [puppet] - 10https://gerrit.wikimedia.org/r/233888 (https://phabricator.wikimedia.org/T109399) [01:03:39] (03PS5) 10Dzahn: mailman: mini script to rsync lists [puppet] - 10https://gerrit.wikimedia.org/r/233888 (https://phabricator.wikimedia.org/T109399) [01:06:10] (03PS3) 10Tim Landscheidt: WIP: Add BigBrotherMonitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 [01:06:31] (03CR) 10jenkins-bot: [V: 04-1] WIP: Add BigBrotherMonitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 (owner: 10Tim Landscheidt) [01:15:25] (03CR) 10Dzahn: [C: 032] mailman: mini script to rsync lists [puppet] - 10https://gerrit.wikimedia.org/r/233888 (https://phabricator.wikimedia.org/T109399) (owner: 10Dzahn) [01:15:32] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [01:32:26] greg-g, my SWAT (which I added slightly late) did not get done. Can I do it now? [01:39:00] RoanKattouw said he is out, and it is a JS-only change, so I'm just going to do it. [01:41:12] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [01:50:45] !log mattflaschen@tin Synchronized php-1.26wmf19/extensions/Flow/: Sync Flow for reply fix (duration: 00m 15s) [01:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:00:01] !log kafka topic webrequest_upload has finished rebalancing across new brokers. starting move of last topic webrequest_text [02:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:05:33] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-tools/snapshot is not accessible: Permission denied [02:21:52] PROBLEM - puppet last run on ms-fe2003 is CRITICAL puppet fail [02:26:58] (03PS1) 10Awight: Remove deprecated Fundraising thermometer config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233900 [02:34:28] !log l10nupdate@tin Synchronized php-1.26wmf19/cache/l10n: l10nupdate for 1.26wmf19 (duration: 06m 29s) [02:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:52] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf19) at 2015-08-26 02:37:51+00:00 [02:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:43:26] greg-g: Evening! Just notifying you that I'm about to deploy a labs config change. [02:44:00] (03PS1) 10Awight: Metawiki doesn't have a mobile subdomain on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233901 [02:44:31] (03PS2) 10Awight: Metawiki doesn't have a mobile subdomain on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233901 [02:44:40] (03CR) 10Awight: [C: 032] Metawiki doesn't have a mobile subdomain on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233901 (owner: 10Awight) [02:44:46] (03Merged) 10jenkins-bot: Metawiki doesn't have a mobile subdomain on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233901 (owner: 10Awight) [02:47:42] RECOVERY - puppet last run on ms-fe2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:59:17] awight, wouldn't the proper way of fixing this be to set the domain up instead? [03:03:39] Krenair: I'd prefer to do it that way, plus the workaround there didn't work... [03:04:20] (03PS1) 10Awight: Revert "Metawiki doesn't have a mobile subdomain on beta labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233902 [03:04:33] (03PS2) 10Awight: Revert "Metawiki doesn't have a mobile subdomain on beta labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233902 [03:04:42] (03CR) 10Awight: [C: 032] Revert "Metawiki doesn't have a mobile subdomain on beta labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233902 (owner: 10Awight) [03:04:48] (03Merged) 10jenkins-bot: Revert "Metawiki doesn't have a mobile subdomain on beta labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233902 (owner: 10Awight) [03:05:18] !log l10nupdate@tin Synchronized php-1.26wmf20/cache/l10n: l10nupdate for 1.26wmf20 (duration: 10m 45s) [03:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:06:13] !log awight@tin Synchronized wmf-config/InitialiseSettings-labs.php: Push labs config to keep in sync with master (duration: 00m 13s) [03:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:06:30] Krenair: can you point me in the rough direction of the config to do that? [03:07:51] (03PS1) 10Rush: phab: remove deprecated api call [puppet] - 10https://gerrit.wikimedia.org/r/233903 [03:08:00] I'm not quite sure how *.wikimedia.beta.wmflabs.org has been set up [03:08:19] I'm taking a peek at operations/dns [03:08:20] (03PS2) 10Rush: phab: remove deprecated api call [puppet] - 10https://gerrit.wikimedia.org/r/233903 [03:08:52] wmflabs? in there? [03:08:56] donno! [03:09:19] nope :) [03:09:33] (03CR) 10Rush: [C: 032] phab: remove deprecated api call [puppet] - 10https://gerrit.wikimedia.org/r/233903 (owner: 10Rush) [03:11:29] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf20) at 2015-08-26 03:11:29+00:00 [03:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:14:02] https://wikitech.wikimedia.org/w/index.php?title=Special:NovaAddress&action=addhost&id=7&project=deployment-prep®ion=eqiad shows some interesting dns domains [03:15:18] I think only cloudadmins can change those [03:16:11] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [03:16:54] Rats, I can't even see it [03:19:22] The records behind them are nonsensical [03:21:06] Thanks for taking a look! Nothing is jumping out at me, so I'm filing a bug for it: https://phabricator.wikimedia.org/T110273 [03:21:07] well, some of them [03:21:21] (03PS1) 10Rush: phab: remove stale elastic search code [puppet] - 10https://gerrit.wikimedia.org/r/233904 [03:22:01] This page was interesting, https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Add_a_wiki there must be some magic in there which tweaks subdomains... [03:22:23] no [03:22:31] orly? [03:23:41] The domain you pass into that script is used in a couple of initial pages and the email notification about the creation of a new wiki [03:24:18] (03CR) 10Rush: [C: 032] phab: remove stale elastic search code [puppet] - 10https://gerrit.wikimedia.org/r/233904 (owner: 10Rush) [03:24:32] in production when we create a wiki, someone with root has to set up dns and in some cases apache before anything you do on the mediawiki side can be seen [03:25:10] I see... [03:25:54] awight, I read the ticket - if it's breaking something actually urgent, I'm okay with you working around it with the patch you had earlier [03:25:59] (03CR) 10Mxn: [C: 031] Use CodeEditor for HTML templates on Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233665 (https://phabricator.wikimedia.org/T110151) (owner: 10Legoktm) [03:26:04] Well, I'm imagining there's some some manual Nova magic for the labs subdomains, like you were investigating a minute ago [03:26:25] Otherwise, I'd prefer that it gets set up properly and the necessary procedure documented [03:27:42] Krenair: unforunately, the workaround didn't work... It's urgent enough that I would have gone with the workaround, but alas it's not in the cards. [03:28:24] For some reason, setMobileUrl wasn't fooled by that MobileUrlTemplate patch [03:32:04] awight, Okay, progress [03:35:25] Strange business! [03:35:42] so instead of not resolving, we now get Domain not configured [03:35:44] smells like dragons... [03:36:53] oh dear [03:37:00] deployment-puppetmaster puppet repository is corrupt [03:38:06] try paying a small bribe, perhaps... [03:39:55] oh, right [03:40:06] RelEng's SAL basically says they know about this [03:40:22] specifically thcipriani [03:41:46] Gotta do family thing [03:41:52] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [03:41:53] Don't knock yourself out, and thanks again! [03:45:00] (03PS1) 10Rush: phab: persist current mpm_worker tweaks [puppet] - 10https://gerrit.wikimedia.org/r/233906 [03:46:16] I don't want to change anything there while it's in this state [03:46:30] (03CR) 10Rush: [C: 032] phab: persist current mpm_worker tweaks [puppet] - 10https://gerrit.wikimedia.org/r/233906 (owner: 10Rush) [03:47:18] 6operations, 6Phabricator: apache on iridium segfaults (so far this has triggered two phabricator outages in 6 hours) - https://phabricator.wikimedia.org/T109941#1574050 (10chasemp) [03:47:23] Krenair: I think the disk was corrupt? but yeah much todo about that today [03:54:08] I noticed that if you ctrl+c at the wrong moment while connecting to a server via ssh, it doesn't load your .bashrc [03:59:07] (03PS5) 10Rush: Phabricator: Setup git config for all repositories [puppet] - 10https://gerrit.wikimedia.org/r/227488 (owner: 10Chad) [04:00:31] 6operations, 6Phabricator: apache on iridium segfaults (so far this has triggered two phabricator outages in 6 hours) - https://phabricator.wikimedia.org/T109941#1574058 (10chasemp) 5Open>3Resolved Well, let's see how long the levee holds :) [04:07:38] Krenair: fyi, you unblocked the stuff we're feeling all urgent about. Awesome work! [04:32:41] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL 11.11% of data above the critical threshold [100000000.0] [04:44:26] (03PS2) 10Ori.livneh: Enable ParsoidBatchAPI everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230708 (owner: 10Tim Starling) [04:45:11] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [04:45:25] (03CR) 10Ori.livneh: [C: 032] "The extension is in both production branches (1.26wmf19 and 1.26wmf20)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230708 (owner: 10Tim Starling) [04:45:31] (03Merged) 10jenkins-bot: Enable ParsoidBatchAPI everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230708 (owner: 10Tim Starling) [04:47:15] !log ori@tin Synchronized wmf-config: I73721936: Enable ParsoidBatchAPI everywhere (duration: 00m 13s) [04:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:47:50] TimStarling: ^ [04:47:56] looks ok, https://en.wikipedia.org/w/api.php?action=help&modules=parsoid-batch [04:48:28] thanks [04:53:55] \o/ let the testing begin! [05:11:02] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [05:27:01] 10Ops-Access-Requests, 6operations, 10ContentTranslation-Deployments, 3LE-CX6-Sprint 3: Access to /var/log/apertium for Kartik - https://phabricator.wikimedia.org/T108678#1574140 (10Arrbee) [05:44:11] RECOVERY - Incoming network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [05:52:32] PROBLEM - HHVM rendering on mw1239 is CRITICAL - Socket timeout after 10 seconds [05:53:02] PROBLEM - puppet last run on dataset1001 is CRITICAL Puppet has 1 failures [05:54:12] PROBLEM - Apache HTTP on mw1239 is CRITICAL - Socket timeout after 10 seconds [06:10:42] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/233866 (owner: 10BryanDavis) [06:18:52] RECOVERY - puppet last run on dataset1001 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:22:46] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/201880 (owner: 10Dzahn) [06:31:33] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 1 failures [06:31:42] PROBLEM - puppet last run on cp3048 is CRITICAL puppet fail [06:32:02] PROBLEM - puppet last run on cp1068 is CRITICAL Puppet has 1 failures [06:32:32] PROBLEM - puppet last run on mw1135 is CRITICAL Puppet has 1 failures [06:32:43] PROBLEM - puppet last run on db2064 is CRITICAL Puppet has 1 failures [06:32:52] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures [06:33:02] PROBLEM - puppet last run on mw2018 is CRITICAL Puppet has 1 failures [06:33:02] PROBLEM - puppet last run on mw2043 is CRITICAL Puppet has 1 failures [06:33:12] PROBLEM - puppet last run on wtp2017 is CRITICAL Puppet has 2 failures [06:33:12] PROBLEM - puppet last run on mw1170 is CRITICAL Puppet has 2 failures [06:33:32] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 2 failures [06:34:21] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:51:09] 10Ops-Access-Requests, 6operations: Need access for smalyshev to hive queries on stat1002 - https://phabricator.wikimedia.org/T110217#1574413 (10Wwes) I will add my approval if required from the Ops side. [06:55:52] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:56:21] RECOVERY - puppet last run on cp1068 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:51] RECOVERY - puppet last run on mw1135 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:57:02] RECOVERY - puppet last run on db2064 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:02] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:57:21] RECOVERY - puppet last run on mw2043 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:31] RECOVERY - puppet last run on mw1170 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:32] RECOVERY - puppet last run on wtp2017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:02] RECOVERY - puppet last run on cp3048 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:32] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:21] RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:52] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:32] !log restarting mw1239 HHVM, which is unresponsive [07:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:02:01] PROBLEM - puppet last run on mw2024 is CRITICAL puppet fail [07:02:52] RECOVERY - Apache HTTP on mw1239 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.024 second response time [07:03:11] RECOVERY - HHVM rendering on mw1239 is OK: HTTP OK: HTTP/1.1 200 OK - 65958 bytes in 0.139 second response time [07:17:02] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [07:19:33] RECOVERY - puppet last run on nembus is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:19:43] 6operations, 6Labs, 10Labs-Infrastructure: disk space on labvirt1007 - https://phabricator.wikimedia.org/T109752#1574464 (10hashar) @andrew seems any instance on labvirt1007 might have ended up being corrupted. deployment-puppetmaster suffers from the same issue that occurred on the Jenkins slaves: files wr... [07:25:36] (03CR) 10Giuseppe Lavagetto: [C: 031] base: Don't install command-not-found-data either [puppet] - 10https://gerrit.wikimedia.org/r/232867 (owner: 10Tim Landscheidt) [07:28:12] (03PS3) 10Giuseppe Lavagetto: deployment::server: re-puppetize nutcracker config [puppet] - 10https://gerrit.wikimedia.org/r/233751 (https://phabricator.wikimedia.org/T103198) [07:30:01] RECOVERY - puppet last run on mw2024 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:30:19] (03CR) 10Giuseppe Lavagetto: [C: 032] deployment::server: re-puppetize nutcracker config [puppet] - 10https://gerrit.wikimedia.org/r/233751 (https://phabricator.wikimedia.org/T103198) (owner: 10Giuseppe Lavagetto) [07:32:53] (03CR) 10Hashar: "This patch is no more applied on beta cluster puppetmaster (repo has been corrupted). See T110303" [puppet] - 10https://gerrit.wikimedia.org/r/197655 (owner: 10Chad) [07:33:50] 6operations, 5Patch-For-Review: tin doesn't have access to same memcached as terbium and app servers - https://phabricator.wikimedia.org/T103198#1574476 (10Joe) 5Open>3Resolved a:3Joe [07:38:13] 6operations, 6Labs, 10Labs-Infrastructure: disk space on labvirt1007 - https://phabricator.wikimedia.org/T109752#1574486 (10hashar) [07:41:02] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [07:53:31] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Aug 26 07:53:31 UTC 2015 (duration 53m 30s) [07:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:02:07] 6operations, 10MediaWiki-Sites, 10SEO, 5MW-1.26-release, 5Patch-For-Review: URLs for the same title without extra query parameters should have the same canonical link - https://phabricator.wikimedia.org/T67402#1574501 (10Nemo_bis) > I think our efforts to provide canonical urls in the html and the 301 re... [08:08:57] (03PS6) 10Filippo Giunchedi: cassandra: WIP support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) [08:10:04] (03CR) 10Filippo Giunchedi: [V: 032] "technically blocked on T87804 but no reason not to merge at this point" [software/rescue-pxe] - 10https://gerrit.wikimedia.org/r/178845 (https://phabricator.wikimedia.org/T78135) (owner: 10Filippo Giunchedi) [08:10:28] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] debirf.conf: adjust options for this environment [software/rescue-pxe] - 10https://gerrit.wikimedia.org/r/178846 (https://phabricator.wikimedia.org/T78135) (owner: 10Filippo Giunchedi) [08:10:34] spam abound! [08:10:44] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] add hwraid repository and some utilities [software/rescue-pxe] - 10https://gerrit.wikimedia.org/r/178847 (https://phabricator.wikimedia.org/T78135) (owner: 10Filippo Giunchedi) [08:11:07] (03Draft3) 10Filippo Giunchedi: add utility Makefile [software/rescue-pxe] - 10https://gerrit.wikimedia.org/r/178848 (https://phabricator.wikimedia.org/T78135) [08:11:12] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] add utility Makefile [software/rescue-pxe] - 10https://gerrit.wikimedia.org/r/178848 (https://phabricator.wikimedia.org/T78135) (owner: 10Filippo Giunchedi) [08:11:23] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] ignore build artifacts [software/rescue-pxe] - 10https://gerrit.wikimedia.org/r/178849 (https://phabricator.wikimedia.org/T78135) (owner: 10Filippo Giunchedi) [08:15:06] (03CR) 10Filippo Giunchedi: [C: 031] Add ferm rules for swift storage backends [puppet] - 10https://gerrit.wikimedia.org/r/233686 (https://phabricator.wikimedia.org/T104965) (owner: 10Muehlenhoff) [08:16:28] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [08:17:32] (03PS2) 10Muehlenhoff: Add ferm rules for swift storage backends [puppet] - 10https://gerrit.wikimedia.org/r/233686 (https://phabricator.wikimedia.org/T104965) [08:17:40] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for swift storage backends [puppet] - 10https://gerrit.wikimedia.org/r/233686 (https://phabricator.wikimedia.org/T104965) (owner: 10Muehlenhoff) [08:17:48] !log disable puppet on ms-be/ms-fe in preparation for merging firewall changes [08:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:23:47] (03PS1) 10Muehlenhoff: Enable base::firewall for swift storage [puppet] - 10https://gerrit.wikimedia.org/r/233916 [08:26:30] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Enable base::firewall for swift storage [puppet] - 10https://gerrit.wikimedia.org/r/233916 (owner: 10Muehlenhoff) [08:28:28] 6operations: need script that handles all bash worker scripts on a given snapshot, per stage, rerunning failures as appropriate, managing resources as appropriate - https://phabricator.wikimedia.org/T107760#1574541 (10ArielGlenn) script is running on the snapshot hosts for second time, doing a partial run. [08:32:34] 6operations: worker bash script terminates early when there are still more wikis to run - https://phabricator.wikimedia.org/T107759#1574542 (10ArielGlenn) reworked the way jobs are handled when the prerequisite jobs are not complete. https://gerrit.wikimedia.org/r/#/c/233417/ previously the job would be marked... [08:36:13] 6operations, 10MediaWiki-Sites, 10SEO, 5MW-1.26-release, 5Patch-For-Review: URLs for the same title without extra query parameters should have the same canonical link - https://phabricator.wikimedia.org/T67402#1574548 (10Seb35) Thanks for the analysis, I was suspecting such a behaviour and that makes sen... [08:39:12] 6operations: staged dumps: use the "cutoff" option as little as possible - https://phabricator.wikimedia.org/T110305#1574550 (10ArielGlenn) 3NEW a:3ArielGlenn [08:39:41] 6operations: staged dumps: use the "cutoff" option as little as possible - https://phabricator.wikimedia.org/T110305#1574559 (10ArielGlenn) [08:42:20] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:44:50] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1574563 (10Merl) I am a java expert. And i am using a libary which normally needs java 1.8. I rewrote some parts, so that i can use the libary with java 1.7 on labs. Problem is that this li... [08:46:49] (03PS1) 10Muehlenhoff: Swift storage backends also need to be accessed by the proxies [puppet] - 10https://gerrit.wikimedia.org/r/233920 [08:47:13] 6operations: staged dumps: use the "cutoff" option as little as possible - https://phabricator.wikimedia.org/T110305#1574565 (10ArielGlenn) Worker scripts now support a new job "createdirs" which just creates the new dump directory, sets up the status and index.html files, and exists. This can now be used as the... [08:52:31] (03CR) 10Filippo Giunchedi: Swift storage backends also need to be accessed by the proxies (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/233920 (owner: 10Muehlenhoff) [09:01:10] (03PS2) 10Muehlenhoff: Swift storage backends also need to be accessed by the proxies [puppet] - 10https://gerrit.wikimedia.org/r/233920 [09:02:04] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Swift storage backends also need to be accessed by the proxies [puppet] - 10https://gerrit.wikimedia.org/r/233920 (owner: 10Muehlenhoff) [09:07:49] PROBLEM - puppet last run on ms-be2001 is CRITICAL puppet fail [09:09:34] (03PS1) 10Filippo Giunchedi: hieradata: share host lists between swift roles [puppet] - 10https://gerrit.wikimedia.org/r/233922 [09:15:29] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/233922 (owner: 10Filippo Giunchedi) [09:16:38] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [09:19:02] (03CR) 10Filippo Giunchedi: "unfortunately that doesn't seem to behave like I expected it to behave, https://puppet-compiler.wmflabs.org/838/" [puppet] - 10https://gerrit.wikimedia.org/r/233922 (owner: 10Filippo Giunchedi) [09:25:47] thoughts on ^ ? how to share those two lists among roles [09:29:37] <_joe_> godog: is that a global list of swift hosts? [09:29:45] (03PS1) 10Muehlenhoff: remove base::firewall temporarily until the Hiera access is sorted out [puppet] - 10https://gerrit.wikimedia.org/r/233924 [09:29:48] <_joe_> godog: where would that list vary on our cluster? [09:30:25] _joe_: yeah it is global [09:30:26] <_joe_> I mean, should it be different for different hosts or is it common to all of our puppet nodes? [09:30:31] (03CR) 10Muehlenhoff: [C: 032 V: 032] remove base::firewall temporarily until the Hiera access is sorted out [puppet] - 10https://gerrit.wikimedia.org/r/233924 (owner: 10Muehlenhoff) [09:30:34] it is the ferm ACL [09:30:42] <_joe_> ok so it should be something like what I'm going to show you [09:30:48] <_joe_> can I borrow your patch? [09:30:56] sure go ahead, thanks! [09:31:38] <_joe_> where are those lists used? just ferm? [09:31:59] yep [09:34:08] <_joe_> mh I'm not sure I like these lists there. We should probably use exported resources here [09:34:19] <_joe_> but I'm divagating [09:35:52] (03PS2) 10Giuseppe Lavagetto: hieradata: share host lists between swift roles [puppet] - 10https://gerrit.wikimedia.org/r/233922 (owner: 10Filippo Giunchedi) [09:36:57] <_joe_> I'm running the compiler now [09:40:20] <_joe_> godog: https://puppet-compiler.wmflabs.org/839/ [09:40:41] <_joe_> but this might not work given moritzm commit disabling base::firewall [09:40:45] <_joe_> I dunno [09:41:24] let's just flip it back on, puppet is disabled on the swift hosts ATM anyway [09:41:42] I don't think they should be affecting each other though, I mean the ferm rules will be generated anyways [09:41:49] anyhow, looks good! [09:42:39] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [09:42:55] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] hieradata: share host lists between swift roles [puppet] - 10https://gerrit.wikimedia.org/r/233922 (owner: 10Filippo Giunchedi) [09:43:00] (03PS3) 10Filippo Giunchedi: hieradata: share host lists between swift roles [puppet] - 10https://gerrit.wikimedia.org/r/233922 [09:43:08] (03CR) 10Filippo Giunchedi: [V: 032] hieradata: share host lists between swift roles [puppet] - 10https://gerrit.wikimedia.org/r/233922 (owner: 10Filippo Giunchedi) [09:43:16] <_joe_> sorry, back to twisted madness [09:43:23] _joe_: np, thanks for your help [09:44:33] (03PS1) 10Muehlenhoff: Re-enable base::firewall for swift::storage [puppet] - 10https://gerrit.wikimedia.org/r/233926 [09:45:01] (03CR) 10Filippo Giunchedi: [C: 031] Re-enable base::firewall for swift::storage [puppet] - 10https://gerrit.wikimedia.org/r/233926 (owner: 10Muehlenhoff) [09:45:49] RECOVERY - puppet last run on ms-be2001 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:46:01] (03PS2) 10Muehlenhoff: Re-enable base::firewall for swift::storage [puppet] - 10https://gerrit.wikimedia.org/r/233926 [09:46:11] (03CR) 10Muehlenhoff: [C: 032 V: 032] Re-enable base::firewall for swift::storage [puppet] - 10https://gerrit.wikimedia.org/r/233926 (owner: 10Muehlenhoff) [09:58:20] !log test-reboot ms-be2001 [09:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:04:14] 6operations: Simplify hiera lookup model - https://phabricator.wikimedia.org/T106404#1574709 (10Joe) another small problem with what you propose: we have hosts whose FQDN is "wikimedia.org" but they're in eqiad/codfw/ulsfo/whatever, and they could make use of variables pertaining to their own site. I couldn't... [10:07:28] PROBLEM - puppet last run on ms-be2002 is CRITICAL: Timeout while attempting connection [10:08:21] <_joe_> godog: ^^ that you guys, right? [10:08:53] _joe_: yeah that's me [10:09:19] RECOVERY - puppet last run on ms-be2002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:10:24] <_joe_> did my change work? [10:10:36] <_joe_> and btw, we ought to simplify things a bit there. [10:10:49] yeah it did work! [10:14:39] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [10:25:50] (03PS1) 10Filippo Giunchedi: Revert "Re-enable base::firewall for swift::storage" [puppet] - 10https://gerrit.wikimedia.org/r/233928 [10:26:58] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "Re-enable base::firewall for swift::storage" [puppet] - 10https://gerrit.wikimedia.org/r/233928 (owner: 10Filippo Giunchedi) [10:33:39] !log reenable puppet on ms-fe/ms-be, base::firewall still not enabled [10:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:35:50] godog: hello! Do we have a Phabricator project for Graphite/Carbon/whatever? I would like to get rid of some erroneous metrics :-} [10:36:23] hashar: hey, sure go with "graphite" project [10:36:31] hashar: actually graphite and operations [10:36:42] * godog lunch & [10:37:46] 6operations, 7Graphite: Remove graphite metrics trees gerrit.fab2 and gerrit.fab3 - https://phabricator.wikimedia.org/T110312#1574768 (10hashar) 3NEW [10:37:51] done and marked low prio [10:37:52] thanks [10:40:42] 6operations, 7Database: Drop database table "optin_survey" from Wikimedia wikis - https://phabricator.wikimedia.org/T54934#1574776 (10jcrespo) a:5Springle>3jcrespo [10:42:06] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:46:47] hmm, anyone know what happened with scribunto last night ? [10:47:12] https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Script_error [10:47:27] apparently we now have lots of pages cached with script errors [10:53:28] thedj: yes, bad code deploy which was fixed about 9-10 Horus ago. [10:56:10] thedj: https://gerrit.wikimedia.org/r/#/c/233891/ [10:56:16] I guss [11:01:35] Nemo_bis: thx ! [11:02:27] !log dropping optin_survey_old table on all wikis [11:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:03:04] (03CR) 10Alexandros Kosiaris: "This actually can be done better and more consistently. Please take a look at my inline comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/233906 (owner: 10Rush) [11:07:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Could this be moved into a corresponding role ? The idea is to have ferm included in roles so that the module itself can be reused without" [puppet] - 10https://gerrit.wikimedia.org/r/233866 (owner: 10BryanDavis) [11:12:11] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Replace SSH key for mholloway - https://phabricator.wikimedia.org/T110064#1574856 (10akosiaris) ok thanks @mmodell, @chasemp. I worked just fine. @mholloway, you should be able to connect you (WMF) wiki account now so we can proceed with this [11:13:20] 6operations, 7Graphite: Grafana: singlestat / graph panels can not be edited - https://phabricator.wikimedia.org/T110317#1574859 (10hashar) 3NEW [11:13:31] 6operations, 7Graphite: Grafana: singlestat / graph panels can not be edited - https://phabricator.wikimedia.org/T110317#1574867 (10hashar) [11:17:14] 6operations, 7Graphite: Grafana: singlestat / graph panels can not be edited - https://phabricator.wikimedia.org/T110317#1574869 (10hashar) Under Chromium, adding a panel graph throws a Javascript exception: ``` app.cff04fb1.js:9 TypeError: a.datasource.query is not a function at a.get_data (app.cff04fb1.j... [11:21:19] 6operations, 7Graphite: Grafana: singlestat / graph panels can not be edited - https://phabricator.wikimedia.org/T110317#1574873 (10hashar) Looking at the dashboard Json, the graph has: `"datasource": null,`. Should probably be `"graphite"` instead, maybe we are missing a sensible default. [11:34:13] (03PS1) 10Hashar: grafana: graphite is the default datasource [puppet] - 10https://gerrit.wikimedia.org/r/233936 (https://phabricator.wikimedia.org/T110317) [11:34:58] 6operations, 7Graphite, 5Patch-For-Review: Grafana: singlestat / graph panels can not be edited - https://phabricator.wikimedia.org/T110317#1574921 (10hashar) a:3hashar Making 'graphite' the default datasource in config.js might well solve the problem. We might want to live hack it on the Grafana server f... [11:34:58] (03CR) 10jenkins-bot: [V: 04-1] grafana: graphite is the default datasource [puppet] - 10https://gerrit.wikimedia.org/r/233936 (https://phabricator.wikimedia.org/T110317) (owner: 10Hashar) [11:36:42] (03PS2) 10Hashar: grafana: graphite is the default datasource [puppet] - 10https://gerrit.wikimedia.org/r/233936 (https://phabricator.wikimedia.org/T110317) [11:39:40] (03CR) 10Hashar: "@godog You probably want to live hack config.js to verify whether that fix it :-}" [puppet] - 10https://gerrit.wikimedia.org/r/233936 (https://phabricator.wikimedia.org/T110317) (owner: 10Hashar) [11:42:01] 6operations, 7Graphite, 5Patch-For-Review: Grafana: singlestat / graph panels can not be edited - https://phabricator.wikimedia.org/T110317#1574935 (10hashar) a:5hashar>3fgiunchedi [12:06:18] (03PS1) 10Alexandros Kosiaris: backups ferm: replace INTERNAL with ALL_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/233941 [12:06:46] (03PS1) 10Cmjohnson: Adding dns entries for new ES servers [dns] - 10https://gerrit.wikimedia.org/r/233942 [12:07:07] (03CR) 10Alexandros Kosiaris: base::service_unit: ship systemd units in /lib (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/233626 (owner: 10Alexandros Kosiaris) [12:08:27] (03PS3) 10Alexandros Kosiaris: base::service_unit: ship systemd units in /lib [puppet] - 10https://gerrit.wikimedia.org/r/233626 [12:11:40] (03PS4) 10Alexandros Kosiaris: base::service_unit: ship systemd units in /lib [puppet] - 10https://gerrit.wikimedia.org/r/233626 [12:11:46] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] base::service_unit: ship systemd units in /lib [puppet] - 10https://gerrit.wikimedia.org/r/233626 (owner: 10Alexandros Kosiaris) [12:12:59] (03CR) 10Muehlenhoff: [C: 031] backups ferm: replace INTERNAL with ALL_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/233941 (owner: 10Alexandros Kosiaris) [12:16:27] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [12:24:16] (03PS2) 10Alexandros Kosiaris: backups ferm: replace INTERNAL with ALL_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/233941 [12:24:29] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] backups ferm: replace INTERNAL with ALL_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/233941 (owner: 10Alexandros Kosiaris) [12:34:03] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1575084 (10BBlack) Pinged both of those tasks. I'm pretty much out of patience with waiting for PHP to suddenly become a les... [12:35:47] Krenair: any fall out from the cron changes yesterday? [12:40:39] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [12:46:03] bblack: in regards to my https in mediawiki patches. IMO their ready, but getting people to review my patches can be really difficult [12:47:07] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL host 208.80.154.196, interfaces up: 228, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [12:47:56] * bawolff honestly has been rather frustrated lately with the amount of effort it takes to get people to look at my patches :s [12:49:04] (03PS2) 10Alexandros Kosiaris: Enable mask/umask of tilerator and kartotherian services [puppet] - 10https://gerrit.wikimedia.org/r/233620 (https://phabricator.wikimedia.org/T106637) [12:49:14] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Enable mask/umask of tilerator and kartotherian services [puppet] - 10https://gerrit.wikimedia.org/r/233620 (https://phabricator.wikimedia.org/T106637) (owner: 10Alexandros Kosiaris) [12:51:14] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, and 2 others: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1575109 (10akosiaris) [12:51:24] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Replace SSH key for mholloway - https://phabricator.wikimedia.org/T110064#1575110 (10Mholloway) Thanks everyone, and sorry about the hassle. My WMF MediaWiki account is now linked. [12:52:04] <_joe_> ottomata: ping [12:52:57] * YuviPanda hugs bawolff [12:53:08] RECOVERY - Router interfaces on cr1-eqiad is OK host 208.80.154.196, interfaces up: 230, down: 0, dormant: 0, excluded: 0, unused: 0 [12:54:33] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, and 2 others: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1575122 (10akosiaris) Updated the actionables. tilerator/kartotherian can now be disabled/enabled via ``` sudo systemctl mask kartotherian.service ``... [12:54:36] !log git synced kartotherian [12:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:59:38] (03PS1) 10ArielGlenn: dumps: allow user to specify how long dumper sleeps between wikis [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/233953 [12:59:39] (03PS1) 10ArielGlenn: dumps: for "createdirs" job don't touch latest symlinks or rss feeds [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/233954 [13:00:17] (03PS1) 10Alexandros Kosiaris: Allow maps-admins to enable/disable cassandra/postgresql [puppet] - 10https://gerrit.wikimedia.org/r/233955 (https://phabricator.wikimedia.org/T106637) [13:01:04] 6operations: staged dumps: use the "cutoff" option as little as possible - https://phabricator.wikimedia.org/T110305#1575131 (10ArielGlenn) https://gerrit.wikimedia.org/r/233953 and https://gerrit.wikimedia.org/r/233954 for preservation of the md5sum files from previous run and sleep time as a user-definable opt... [13:03:30] (03CR) 10Alexandros Kosiaris: [C: 032] Allow maps-admins to enable/disable cassandra/postgresql [puppet] - 10https://gerrit.wikimedia.org/r/233955 (https://phabricator.wikimedia.org/T106637) (owner: 10Alexandros Kosiaris) [13:04:02] (03PS1) 10Alexandros Kosiaris: Replace mholloway's public ssh key [puppet] - 10https://gerrit.wikimedia.org/r/233958 (https://phabricator.wikimedia.org/T110064) [13:04:24] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, and 2 others: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1575137 (10akosiaris) [13:04:40] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, and 2 others: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1473834 (10akosiaris) Cassandra and postgresql done as well [13:04:43] YuviPanda, not aware of any issues. [13:05:41] ok [13:08:48] (03PS1) 10ArielGlenn: staged dumps: sleep only 5 seconds between wikis for starting job [puppet] - 10https://gerrit.wikimedia.org/r/233961 [13:09:52] (03PS1) 10Muehlenhoff: Disable connection tracking for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/233962 [13:09:54] (03CR) 10ArielGlenn: "dependent on https://gerrit.wikimedia.org/r/#/c/233953/ to be merged and deployed first" [puppet] - 10https://gerrit.wikimedia.org/r/233961 (owner: 10ArielGlenn) [13:10:42] 6operations: staged dumps: use the "cutoff" option as little as possible - https://phabricator.wikimedia.org/T110305#1575169 (10ArielGlenn) https://gerrit.wikimedia.org/r/#/c/233961/ for actual modification of the sleep time in the stages, no need to get it out for this run but when we do the next full run it sh... [13:11:53] !log restbase deploying 1dfba85 [13:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:27] godog: hey! In Grafana, I could not edit a new graph panel because its apparently created with no datasource causing some JS issue [13:17:53] godog: I might have a trivial fix for it :-] Though I am not sure whether it is only me or a general issue. Anyway some details at https://phabricator.wikimedia.org/T110317 [13:18:59] hashar: i have had that problem for a while too [13:19:12] i make new graphs by duplicating and editing old ones :/ [13:19:16] I am wondering how the other dashboards have been created though [13:19:19] ahah [13:19:39] I can confirm a new graph panel has datasource: null, [13:19:47] when proper graphs have datasource: "graphite", [13:19:55] so maybe setting graphite as the default datasource would solve it [13:20:05] hashar: thanks for looking into that! feel free to poke at it more if you wish, I'm not able to support grafana too ATM [13:21:06] godog: I can imagine [13:21:07] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 18.18% of data above the critical threshold [500.0] [13:21:29] hashar: I might get to it, but just to set expectations :D [13:21:29] would you be able to waste a few minutes to try the config change as a live hack ? that might well just fix the problem β„’ [13:22:22] one day we will have a team to supports all those logstash / kibana / grafana / graphite / icinga / shinken stuff [13:22:39] 6operations: worker bash script terminates early when there are still more wikis to run - https://phabricator.wikimedia.org/T107759#1575184 (10ArielGlenn) almost done with stubs for small wikis and all workers still going. need to check again after tables run [13:23:02] 6operations, 7Database: Drop database table "optin_survey" from Wikimedia wikis - https://phabricator.wikimedia.org/T54934#1575185 (10jcrespo) 5Open>3Resolved I've deleted the `optin_survey_old` table from all production wikis. [13:28:53] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Build Debian package ruby-jsduck for Jessie - https://phabricator.wikimedia.org/T95008#1575198 (10hashar) 5Resolved>3declined We ended up NOT building Debian package ruby-jsduck for Jessie. So adjust status... [13:30:14] 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1575202 (10jcrespo) I've closed T54934 by deleting the table `optin_survey_old` from all wikis (also, after performing a backup). These are the *old tables **still on production**: ``` angwikiso... [13:31:17] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [13:33:46] 6operations, 7Graphite: Remove graphite metrics trees gerrit.fab2 and gerrit.fab3 - https://phabricator.wikimedia.org/T110312#1575215 (10Krenair) I've tried making similar requests before in T104091, no luck yet [13:34:39] 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1575217 (10MZMcBride) I'd prefer that we kill them all. [13:35:19] 6operations, 10Continuous-Integration-Infrastructure: Update RDiscount gem/package on jenkins build servers (UbuntuTrusty) - https://phabricator.wikimedia.org/T109005#1575219 (10hashar) We wanted to get a Debian package for JSDuck on Jessie (T95008). The packaging works involved ends up being rather painful s... [13:36:23] 6operations, 10Continuous-Integration-Infrastructure: Update RDiscount gem/package on jenkins build servers (UbuntuTrusty) - https://phabricator.wikimedia.org/T109005#1575225 (10hashar) @cscott mind if we change this task to #continuous-integration-config and retitle it to switch CI from jsduck deb package to... [13:37:34] 6operations, 10Beta-Cluster, 7Graphite, 7Shinken: Delete specific deployment-prep graphite datapoints - https://phabricator.wikimedia.org/T104091#1575227 (10hashar) [13:37:42] T54932#1575217 tl;tr "Burn! Burn them all!" [13:38:31] (03PS2) 10Alexandros Kosiaris: Replace mholloway's public ssh key [puppet] - 10https://gerrit.wikimedia.org/r/233958 (https://phabricator.wikimedia.org/T110064) [13:38:39] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Replace mholloway's public ssh key [puppet] - 10https://gerrit.wikimedia.org/r/233958 (https://phabricator.wikimedia.org/T110064) (owner: 10Alexandros Kosiaris) [13:38:49] 6operations, 7Graphite: Remove graphite metrics trees gerrit.fab2 and gerrit.fab3 - https://phabricator.wikimedia.org/T110312#1575232 (10hashar) >>! In T110312#1575215, @Krenair wrote: > I've tried making similar requests before in T104091, no luck yet @fgiunchedi seems to be the primary Graphite point of con... [13:40:10] (03PS3) 10BBlack: HTTPS: Break insecure POST with 403 [puppet] - 10https://gerrit.wikimedia.org/r/221974 (https://phabricator.wikimedia.org/T105794) [13:43:31] (03PS1) 10Alexandros Kosiaris: backups: Offsite/Migrate Jobs should refer to the production pool [puppet] - 10https://gerrit.wikimedia.org/r/233967 [13:43:44] (03PS4) 10Andrew Bogott: openstack firewall: get designate host from hiera [puppet] - 10https://gerrit.wikimedia.org/r/201880 (owner: 10Dzahn) [13:44:12] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Replace SSH key for mholloway - https://phabricator.wikimedia.org/T110064#1575249 (10akosiaris) 5Open>3Resolved a:3akosiaris And done. Resolving! thanks [13:44:36] (03CR) 10jenkins-bot: [V: 04-1] openstack firewall: get designate host from hiera [puppet] - 10https://gerrit.wikimedia.org/r/201880 (owner: 10Dzahn) [13:44:54] (03PS2) 10Alexandros Kosiaris: Add ip6 address on helium [puppet] - 10https://gerrit.wikimedia.org/r/233584 [13:45:01] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add ip6 address on helium [puppet] - 10https://gerrit.wikimedia.org/r/233584 (owner: 10Alexandros Kosiaris) [13:45:33] (03CR) 10Alexandros Kosiaris: [C: 032] backups: Offsite/Migrate Jobs should refer to the production pool [puppet] - 10https://gerrit.wikimedia.org/r/233967 (owner: 10Alexandros Kosiaris) [13:45:37] (03PS2) 10Alexandros Kosiaris: backups: Offsite/Migrate Jobs should refer to the production pool [puppet] - 10https://gerrit.wikimedia.org/r/233967 [13:45:41] (03CR) 10Alexandros Kosiaris: [V: 032] backups: Offsite/Migrate Jobs should refer to the production pool [puppet] - 10https://gerrit.wikimedia.org/r/233967 (owner: 10Alexandros Kosiaris) [13:46:18] (03PS5) 10Andrew Bogott: openstack firewall: get designate host from hiera [puppet] - 10https://gerrit.wikimedia.org/r/201880 (owner: 10Dzahn) [13:47:28] !log rebooting/reimaging labnet1001 [13:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:48:41] (03CR) 10Andrew Bogott: [C: 032] openstack firewall: get designate host from hiera [puppet] - 10https://gerrit.wikimedia.org/r/201880 (owner: 10Dzahn) [13:49:14] (03PS6) 10Andrew Bogott: Add labs config files for Openstack version Juno [puppet] - 10https://gerrit.wikimedia.org/r/192483 [13:50:58] bblack: Hello, _joe_ sent me to you with a question about varnishes. Do you have a spare minute? [13:51:56] 6operations, 10RESTBase, 6Services, 10service-template-node, 7Monitoring: [Discussion] Consider validating JSON schemas when running x-ample tests? - https://phabricator.wikimedia.org/T110240#1575292 (10mobrovac) p:5Triage>3Normal [13:52:22] <_joe_> bblack: don't shoot the messenger! [13:52:39] (03PS1) 10Alexandros Kosiaris: backup: Add IPv6 address of director in firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/233971 [13:52:42] (03CR) 10Andrew Bogott: [C: 032] Add labs config files for Openstack version Juno [puppet] - 10https://gerrit.wikimedia.org/r/192483 (owner: 10Andrew Bogott) [13:52:45] <_joe_> bblack: Pchelolo is implementing purging in Restbase, to give a little context [13:53:22] (03CR) 10jenkins-bot: [V: 04-1] backup: Add IPv6 address of director in firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/233971 (owner: 10Alexandros Kosiaris) [13:53:47] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1575306 (10BBlack) >>! In T105794#1574563, @Merl wrote: > I don't understand why i should investigate so much work although i have a currently working solution and one day when labs changes... [13:54:36] Pchelolo: ? [13:55:09] (03PS2) 10Alexandros Kosiaris: backup: Add IPv6 address of director in firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/233971 [13:55:36] basically, I need to know what's used as cache keys in our varnises. Is that a relative uri, absolute uri, is Host header involved etc. Could you please give me a little context on this? [13:58:08] Pchelolo: the PURGE request should be basically identical to the matching GET request you want to purge the content of [13:58:38] (03PS2) 10Andrew Bogott: Add labnet1002 hiera host file [puppet] - 10https://gerrit.wikimedia.org/r/233854 [13:58:41] hmm wait that's not the answer you're looking for, you're sending HTCP, not PURGE [13:58:47] (03PS2) 10Andrew Bogott: Switch labs controller to openstack juno [puppet] - 10https://gerrit.wikimedia.org/r/233855 [13:58:49] (03PS1) 10Alex Monk: Add affcom wiki domain to apache config [puppet] - 10https://gerrit.wikimedia.org/r/233972 (https://phabricator.wikimedia.org/T41482) [13:59:53] bblack: In HTCP I can only supply a 'url', so the question is if it should contain the host part (e.g. http://rest.wikipedia.org/) [14:00:08] I think the answer is yes [14:00:14] but I'm looking at the code to be sure [14:01:10] could you please point me where the code is, I'll look too [14:01:15] (03CR) 10DCausse: [C: 031] Disable connection tracking for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/233962 (owner: 10Muehlenhoff) [14:01:53] (03PS3) 10Alexandros Kosiaris: backup: Add IPv6 address of director in firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/233971 [14:02:17] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] backup: Add IPv6 address of director in firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/233971 (owner: 10Alexandros Kosiaris) [14:02:37] Pchelolo: the code is here: https://github.com/wikimedia/operations-software-varnish-vhtcpd/tree/master/src [14:02:48] basically, yes, send it like a proxy URL [14:03:13] (03PS1) 10Muehlenhoff: Enable ferm on elastic1023 first [puppet] - 10https://gerrit.wikimedia.org/r/233973 [14:03:18] bblack: in the case of RB, if I'm understanding correctly, you are saying we just need to send PURGE https://{domain}/api/rest_v1/page/html/{title} ? [14:03:30] (03Abandoned) 10Matanya: access: update key for Mholloway [puppet] - 10https://gerrit.wikimedia.org/r/233582 (owner: 10Matanya) [14:04:18] mobrovac: well, that's a different question, and I don't know the answer to that... [14:04:42] euh? how so? [14:04:43] it would not be a question if RB wouldn't require varnish to transform URLs for it, and would handle Host: on its own :P [14:04:59] heh [14:05:23] but that's exactly what i'm asking - should the original or the transformed URI be used? [14:05:43] the service should be fixed to not require the transform, so that we never face this confusion again [14:06:19] not sure i understand what you mean - we kind of can't put restbase to answer on en.wp.org/ reqs [14:06:37] (03PS2) 10Muehlenhoff: Disable connection tracking for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/233962 [14:06:42] s/on/on all/ [14:06:51] (03CR) 10Muehlenhoff: [C: 032 V: 032] Disable connection tracking for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/233962 (owner: 10Muehlenhoff) [14:09:09] mobrovac: what I mean is the original inbound request to 'https://en.wikipedia.org/api/rest_v1/...' already contains all the information restbase needs. The requirement that varnish transform this to 'http://host-does-not-matter/en.wikipedia.org/v1/...' before handing it to the RB service isn't a caching requirement [14:09:23] it's just doing a transform that restbase could do for itself by looking at the Host header it was already passed [14:09:58] (03CR) 10Rush: "totally, thanks for hitting me up. this was a late night fear that I had forgotten to get this out of my head a few times and my brain fr" [puppet] - 10https://gerrit.wikimedia.org/r/233906 (owner: 10Rush) [14:10:10] and then the URLs would be consistent at all layers of our architecture, instead of having one canoical URL for the request for the outside world and a different one internally. [14:10:58] bblack: ah ok, got it [14:11:02] in any case, lacking that fix, I think you have to purge both variants, using the correct domain for both [14:11:09] (03PS1) 10Dzahn: mailman: adjust rsync script, exclude bounces [puppet] - 10https://gerrit.wikimedia.org/r/233974 [14:11:16] (03CR) 10jenkins-bot: [V: 04-1] mailman: adjust rsync script, exclude bounces [puppet] - 10https://gerrit.wikimedia.org/r/233974 (owner: 10Dzahn) [14:11:27] because the frontend layer probably caches it with the external URL, and the backends probably cache it as https://{domain}/{domain}/v1/... [14:11:39] (03PS2) 10Dzahn: mailman: adjust rsync script, exclude bounces [puppet] - 10https://gerrit.wikimedia.org/r/233974 [14:12:02] bblack: i like the amount of speculation here :D [14:12:08] bblack: thnx for the clarifications [14:12:11] bblack, mobrovac: IIRC backend caching is explicitly disabled currently [14:12:13] :) [14:12:29] (03PS2) 10Rush: Enable ferm on elastic1023 first [puppet] - 10https://gerrit.wikimedia.org/r/233973 (owner: 10Muehlenhoff) [14:12:32] (03PS3) 10Dzahn: mailman: adjust rsync script, exclude bounces [puppet] - 10https://gerrit.wikimedia.org/r/233974 [14:12:34] it really shouldn't be, though, once we're contemplating caching at all [14:12:41] (03CR) 10Rush: [C: 031] "as reasonable a canary as any :)" [puppet] - 10https://gerrit.wikimedia.org/r/233973 (owner: 10Muehlenhoff) [14:12:51] bblack: ok, great, thank you [14:13:16] gwicke: bblack: k, but even so, there are still two variants that can hit the front-end [14:13:29] and so need to be covered [14:13:34] are there? [14:13:43] * gwicke is curious too [14:13:43] actually, there are 3 [14:14:28] (a) //rest.wm.org/{domain}/v1 ; (b) //restbase.wm.org/{domain}/v1 ; (c) //{domain}/api/rest_v1/ [14:14:46] all of these are valid external reqs [14:14:51] a and b don't matter really [14:15:22] gwicke: you seem to keep forgetting consistency is high on my list :) [14:15:28] (03PS1) 10Alexandros Kosiaris: admin: remove duplicate line [puppet] - 10https://gerrit.wikimedia.org/r/233975 [14:15:35] we should just redirect those asap [14:15:44] since we are offering those entry points as well, we should keep them correct [14:15:48] well more importantly, (a) and (b) don't pass through varnish either. [14:15:57] not those varnishes, yeah [14:16:02] well, yeah [14:16:07] ah yes right [14:16:09] but the other varnish is dying any day now, right? :P [14:16:11] damn, more problems [14:16:12] (03CR) 10Dzahn: phab: persist current mpm_worker tweaks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/233906 (owner: 10Rush) [14:16:21] bblack: I hope so [14:16:22] bblack: amen to that! [14:16:45] really the more-important question is whether parsoid usage through the other varnish is dead yet [14:16:58] parsoidcache? iirc, yup [14:17:02] if it is, I could probably move the others to text working as-is without much trouble until their eventual demise [14:17:05] I haven't checked yet [14:17:31] (03PS4) 10Dzahn: mailman: adjust rsync script, exclude bounces [puppet] - 10https://gerrit.wikimedia.org/r/233974 (https://phabricator.wikimedia.org/T108071) [14:17:32] parsoid is "special" though, and holds up decomming that cluster basically [14:18:00] (03CR) 10Dzahn: [C: 032] mailman: adjust rsync script, exclude bounces [puppet] - 10https://gerrit.wikimedia.org/r/233974 (https://phabricator.wikimedia.org/T108071) (owner: 10Dzahn) [14:18:06] gwicke: afaik, all of the known clients have switched to RB [14:18:21] I haven't checked googlebot, for example [14:18:35] easy enough to tail the varnish logs to find out [14:18:57] all our own clients have definitely switched [14:19:16] was about to ask kindly to bblack to compile us a list of clients getting cache hits :P [14:20:03] (03PS2) 10Alexandros Kosiaris: admin: remove duplicate line [puppet] - 10https://gerrit.wikimedia.org/r/233975 [14:20:06] (03CR) 10Filippo Giunchedi: "minor things except for.. naming!" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [14:20:19] I have a shell on cp10{45,58}, so can have a look as well [14:20:29] (03CR) 10Alexandros Kosiaris: [C: 032] admin: remove duplicate line [puppet] - 10https://gerrit.wikimedia.org/r/233975 (owner: 10Alexandros Kosiaris) [14:20:33] (03CR) 10Alexandros Kosiaris: [V: 032] admin: remove duplicate line [puppet] - 10https://gerrit.wikimedia.org/r/233975 (owner: 10Alexandros Kosiaris) [14:20:34] mobrovac: you could try logging in as well [14:21:18] can look into it on Friday [14:22:28] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: export config and archive data from sodium - https://phabricator.wikimedia.org/T108071#1575384 (10Dzahn) sent 4397116660 bytes received 1973579 bytes 2133926.87 bytes/sec total size is 4389356412 speedup is 1.00 done rsyncing\n real 721m16.384... [14:22:35] 6operations, 10Traffic, 7network: Requests from a specific network are blocked - https://phabricator.wikimedia.org/T110208#1575386 (10akosiaris) p:5Triage>3Unbreak! a:3akosiaris [14:22:50] (03PS5) 10Dzahn: mailman: adjust rsync script, exclude bounces [puppet] - 10https://gerrit.wikimedia.org/r/233974 (https://phabricator.wikimedia.org/T108071) [14:23:53] gwicke: nope, denied :( [14:29:06] (03PS2) 10Cmjohnson: Adding dns entries for new ES servers [dns] - 10https://gerrit.wikimedia.org/r/233942 [14:30:36] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for new ES servers [dns] - 10https://gerrit.wikimedia.org/r/233942 (owner: 10Cmjohnson) [14:32:57] (03PS3) 10Muehlenhoff: Enable ferm on elastic1023 first [puppet] - 10https://gerrit.wikimedia.org/r/233973 [14:33:07] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on elastic1023 first [puppet] - 10https://gerrit.wikimedia.org/r/233973 (owner: 10Muehlenhoff) [14:40:45] 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1575467 (10Krenair) devwikiinternal.old, rel13testwiki.old, zh_cnwiki.old I'm not sure these ones should be deleted. These wikis are 'deleted' in the sense that MediaWiki no longer exposes them to... [14:44:03] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: export config and archive data from sodium - https://phabricator.wikimedia.org/T108071#1575481 (10Dzahn) plus about another half hour to re-sync the heldmsg-* files (which used to be >500k but now are just over 100k [14:44:18] (03Abandoned) 10Chad: Phabricator: Setup git config for all repositories [puppet] - 10https://gerrit.wikimedia.org/r/227488 (owner: 10Chad) [14:44:26] (03Abandoned) 10Chad: Phabricator: Fetch all gerrit references in Git [puppet] - 10https://gerrit.wikimedia.org/r/227489 (owner: 10Chad) [14:45:00] (03Abandoned) 10Chad: Hiera-ize the mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/204331 (owner: 10Chad) [14:45:05] (03Abandoned) 10Chad: Move beta's mediawiki-installation dsh group into hiera [puppet] - 10https://gerrit.wikimedia.org/r/206131 (owner: 10Chad) [14:45:27] 6operations, 6Services, 10Wikidata, 7service-deployment-requests: Deploy wikibase usage tracking on all client wikis on the wikimedia cluster - https://phabricator.wikimedia.org/T110339#1575485 (10daniel) 3NEW [14:51:45] 6operations, 6Services, 10Wikidata, 7service-deployment-requests: Deploy wikibase usage tracking on all client wikis on the wikimedia cluster - https://phabricator.wikimedia.org/T110339#1575550 (10mobrovac) #service-deployment-requests is a project used for deploying new services in production (cf . [its p... [14:51:54] 6operations, 10Wikidata: Deploy wikibase usage tracking on all client wikis on the wikimedia cluster - https://phabricator.wikimedia.org/T110339#1575553 (10mobrovac) [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150826T1500). Please do the needful. [15:00:04] aude: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:02:23] I'm available to SWAT this morningβ€” aude ping? [15:04:46] !puppetswat [15:08:57] PROBLEM - Host nembus is DOWN: PING CRITICAL - Packet loss = 100% [15:09:38] (03PS1) 10Ottomata: Improve description for statsistics-privatedata-users group in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/233985 [15:09:43] akosiaris: ^ [15:10:20] (03CR) 10jenkins-bot: [V: 04-1] Improve description for statsistics-privatedata-users group in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/233985 (owner: 10Ottomata) [15:10:24] psh! [15:10:39] (03CR) 10BryanDavis: "I theory, yes this could all be done at a role level. This particular change follows the already committed changes from Ic1b73d4 and I92fc" [puppet] - 10https://gerrit.wikimedia.org/r/233866 (owner: 10BryanDavis) [15:11:32] (03PS2) 10Ottomata: Improve description for statsistics-privatedata-users group in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/233985 [15:11:34] thcipriani: sorry, here [15:11:50] aude: np, okie doke, merging [15:11:52] (03PS3) 10Ottomata: Improve description for statsistics-privatedata-users group in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/233985 [15:12:07] thanks [15:14:40] 6operations, 7Graphite, 5Patch-For-Review: Grafana: singlestat / graph panels can not be edited - https://phabricator.wikimedia.org/T110317#1575653 (10hashar) @Ottomata encountered the same issue. [15:14:51] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1575657 (10BBlack) The patch at https://gerrit.wikimedia.org/r/#/c/221974/ has been updated to go straight to 403 when it's merged, as the 307 redirect doesn't buy us much really (zero in te... [15:14:57] RECOVERY - Host nembus is UPING OK - Packet loss = 0%, RTA = 51.92 ms [15:18:05] 6operations: re-seat power cord for Nembus - https://phabricator.wikimedia.org/T110202#1575684 (10Andrew) 5Open>3Resolved After pulling and re-plugging, nembus looks fine. [15:19:25] (03PS1) 10Cmjohnson: Adding dchpd entrires for es1011/12 and es1017/18 [puppet] - 10https://gerrit.wikimedia.org/r/233986 [15:20:18] (03CR) 10Cmjohnson: [C: 032] Adding dchpd entrires for es1011/12 and es1017/18 [puppet] - 10https://gerrit.wikimedia.org/r/233986 (owner: 10Cmjohnson) [15:26:20] (03PS1) 10Alexandros Kosiaris: Move smalyshev to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/233987 (https://phabricator.wikimedia.org/T110217) [15:26:27] aude: can these all go out with a sync-dir? [15:26:33] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Need access for smalyshev to hive queries on stat1002 - https://phabricator.wikimedia.org/T110217#1575706 (10akosiaris) @wwes no need. This was a misconfiguration on our part. https://gerrit.wikimedia.org/r/233987 fixes that [15:26:50] thcipriani: yes [15:27:39] 6operations, 10Continuous-Integration-Infrastructure: Update RDiscount gem/package on jenkins build servers (UbuntuTrusty) - https://phabricator.wikimedia.org/T109005#1575709 (10cscott) Go ahead! I was just guesing when I categorized it. [15:28:10] (03CR) 10Alexandros Kosiaris: [C: 032] Move smalyshev to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/233987 (https://phabricator.wikimedia.org/T110217) (owner: 10Alexandros Kosiaris) [15:28:34] !log thcipriani@tin Synchronized php-1.26wmf20/extensions/Wikidata: SWAT: Update Wikidata - wrap usage tracking batch updates in transaction [[gerrit:233970]] (duration: 00m 23s) [15:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:28:40] ^ aude check please [15:29:45] 6operations, 10Continuous-Integration-Config: Switch CI from jsduck deb package to a gemfile/bundler system - https://phabricator.wikimedia.org/T109005#1575715 (10cscott) [15:33:09] (03CR) 10Cscott: "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200038 (owner: 10Cscott) [15:33:12] thcipriani: checking [15:33:51] looks ok [15:34:05] aude: cool, thanks! [15:35:08] (03PS1) 10Yuvipanda: k8s: Add reccomended admission controllers to apiserver [puppet] - 10https://gerrit.wikimedia.org/r/233988 [15:35:25] thanks thcipriani [15:37:52] (03PS2) 10Yuvipanda: k8s: Add reccomended admission controllers to apiserver [puppet] - 10https://gerrit.wikimedia.org/r/233988 [15:38:01] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Add reccomended admission controllers to apiserver [puppet] - 10https://gerrit.wikimedia.org/r/233988 (owner: 10Yuvipanda) [15:40:27] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1575788 (10fgiunchedi) the way it works now in https://gerrit.wikimedia.org/r/#/c/231512/ is that an instance named "default" will behave exactly... [15:41:41] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1575795 (10BBlack) Regarding the Dalvik, Kindle, and also the "Java/phoneme_advanced" (also mobile stuff), the POST requests seem to be hitting the [[ https://en.wikipedia.org/w/api.php?acti... [15:44:39] 6operations, 10Wikidata: Deploy wikibase usage tracking on all client wikis on the wikimedia cluster - https://phabricator.wikimedia.org/T110339#1575820 (10matej_suchanek) [15:45:06] !log repool restbase1009 in pybal [15:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:47:01] godog: re: https://gerrit.wikimedia.org/r/#/c/200625/, I just need swift::storage now? [15:48:40] (03PS1) 10Chad: Whitespace tidy up. Git keeps ending up with a dirty copy on me [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/233991 [15:48:58] are the swatters still swatting? [15:49:05] that's aude? [15:49:11] swatters gonna swat swat swat... [15:50:48] cscott: I'm done swatting, just the wikidata thing this morning. [15:51:16] thcipriani: I was wondering about throwing https://gerrit.wikimedia.org/r/233439 on the morning SWAT pile. [15:51:21] but maybe it's best to wait [15:52:35] (03CR) 10Cscott: "Since we've got a +1, I'll schedule this for a SWAT tomorrow (Aug 27). Since there's a lot of other Parsoid/VE stuff going out today, I t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233439 (owner: 10Cscott) [15:53:06] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1575846 (10Legoktm) The `format` parameter applies to all API requests: https://en.wikipedia.org/w/api.php?action=help&modules=main [15:53:07] ostriches: err I was talking about the role class, so role::swift::storage ideally would Just Work in labs, so yeah, there might be still some tweaking required [15:53:29] Gotcha. Yeah I'll poke again and see. [15:53:55] !log ferm enabled on elastic1023 [15:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:55:38] 6operations, 10Traffic, 7network: Requests from a specific network are blocked - https://phabricator.wikimedia.org/T110208#1575853 (10akosiaris) Yes, there has been indeed a block due to a user abusing the API. There has been an effort to report this and the hosting company never answered. In good faith, we... [15:56:18] (03PS8) 10Dzahn: add IPv6 for antimony (git web) [puppet] - 10https://gerrit.wikimedia.org/r/214432 (https://phabricator.wikimedia.org/T37540) [15:56:33] godog: While I have you trapped (err, yer attention). Whitespace only :p https://gerrit.wikimedia.org/r/#/c/233991/ [15:57:42] (03CR) 10Dzahn: [C: 032] add IPv6 for antimony (git web) [puppet] - 10https://gerrit.wikimedia.org/r/214432 (https://phabricator.wikimedia.org/T37540) (owner: 10Dzahn) [15:59:05] ostriches: ahah sure [15:59:31] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "*rubberstamp*" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/233991 (owner: 10Chad) [16:00:04] andrewbogott: Dear anthropoid, the time has come. Please deploy Labs OpenStack upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150826T1600). [16:00:34] godog: My clone kept adding the newline at the end and stripping the leading newline :p [16:01:01] ostriches: curious, git by itself? I guess it is fine to be picked up with the next submodule update? [16:01:25] Yeah I think git itself...I might have a misconfigured line-ending setting or somesuch. [16:01:34] And yeah, it's trivial. Can just go whenever. [16:02:09] quick! when you install something with dpkg or apt-get do you expect the service you install to start automatically? do you expect it to register itself to start on restart? I know the default for rpms for both of these is "no". is there a standard for debs? [16:03:00] manybubbles: Expect? Perhaps. There's wild inconsistency there though. I can't speak to standard tho. [16:03:22] thanks ostriches! [16:03:31] anytime :) [16:03:35] manybubbles: my understanding is start on install is default [16:03:47] I've seen a post where rackspace goes around this also... [16:03:50] https://major.io/2014/06/26/install-debian-packages-without-starting-daemons/ [16:04:23] thanks! [16:04:29] 6operations, 10Traffic, 7network: Requests from a specific network are blocked - https://phabricator.wikimedia.org/T110208#1575878 (10Ironholds_backup) As the person who initially started the conversation about the block and reported this ISP for hosting abusers, I would very much like to be in on any conver... [16:04:52] (03PS4) 10Dzahn: add AAAA record for antimony [dns] - 10https://gerrit.wikimedia.org/r/214504 (https://phabricator.wikimedia.org/T37540) [16:07:51] 6operations, 5Patch-For-Review, 7Swift: swift eqiad capacity planning - https://phabricator.wikimedia.org/T1268#1575879 (10fgiunchedi) [[ https://graphite.wikimedia.org/render/?width=714&height=339&_salt=1415878787.645&from=-4months&target=alias(secondYAxis(swift.eqiad-prod.stats.AUTH_mw.bytes)%2C%22media%20... [16:08:38] (03CR) 10Dzahn: [C: 032] add AAAA record for antimony [dns] - 10https://gerrit.wikimedia.org/r/214504 (https://phabricator.wikimedia.org/T37540) (owner: 10Dzahn) [16:09:00] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: test importing of mailing list configs and archives on staging VM - https://phabricator.wikimedia.org/T108073#1575884 (10Dzahn) [16:09:01] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1575885 (10Dzahn) [16:09:03] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: export config and archive data from sodium - https://phabricator.wikimedia.org/T108071#1575883 (10Dzahn) 5Open>3Resolved [16:11:06] (03PS3) 10Andrew Bogott: Add labnet1002 hiera host file [puppet] - 10https://gerrit.wikimedia.org/r/233854 [16:11:14] (03PS3) 10Andrew Bogott: Switch labs controller to openstack juno [puppet] - 10https://gerrit.wikimedia.org/r/233855 [16:11:40] !log starting labs openstack update to Juno [16:11:41] !log backing up labs openstack databases into /home/andrew/openstackdbbackups on db1009 [16:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:54] 6operations, 10Traffic, 7network: Requests from a specific network are blocked - https://phabricator.wikimedia.org/T110208#1575891 (10Ireas) Thanks. I informed our contact at the ISP and asked him to contact you to solve this issue. [16:11:59] (03PS2) 10Dzahn: Kibana: Fix vhost name in hiera [puppet] - 10https://gerrit.wikimedia.org/r/233838 (owner: 10BryanDavis) [16:12:17] (03CR) 10Dzahn: [C: 032] Kibana: Fix vhost name in hiera [puppet] - 10https://gerrit.wikimedia.org/r/233838 (owner: 10BryanDavis) [16:14:22] 6operations, 10Traffic, 7network: Requests from a specific network are blocked - https://phabricator.wikimedia.org/T110208#1575904 (10Ironholds_backup) Excellent; thank you! [16:14:46] andrewbogott: I'm around if you need anything for then ext hour or so btw [16:15:06] YuviPanda: thanks. With any luck this will go smoothly... [16:15:18] (03PS4) 10Andrew Bogott: Add labnet1002 hiera host file [puppet] - 10https://gerrit.wikimedia.org/r/233854 [16:15:27] (03PS4) 10Andrew Bogott: Switch labs controller to openstack juno [puppet] - 10https://gerrit.wikimedia.org/r/233855 [16:15:55] (03CR) 10Andrew Bogott: [C: 032] Add labnet1002 hiera host file [puppet] - 10https://gerrit.wikimedia.org/r/233854 (owner: 10Andrew Bogott) [16:16:32] (03CR) 10Andrew Bogott: [C: 032] Switch labs controller to openstack juno [puppet] - 10https://gerrit.wikimedia.org/r/233855 (owner: 10Andrew Bogott) [16:18:08] !log switching labcontrol1001 hiera to Juno which will add the cloud-archive repo for Juno. [16:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:19:57] (03PS1) 10Andrew Bogott: Add labnet1002 host file [puppet] - 10https://gerrit.wikimedia.org/r/233996 [16:19:59] (03PS1) 10Andrew Bogott: Remove refs to dc-migrate. Unused and bitrotted. [puppet] - 10https://gerrit.wikimedia.org/r/233997 [16:22:32] (03Abandoned) 10Andrew Bogott: Add labnet1002 host file [puppet] - 10https://gerrit.wikimedia.org/r/233996 (owner: 10Andrew Bogott) [16:22:32] (03PS2) 10Andrew Bogott: Remove refs to dc-migrate. Unused and bitrotted. [puppet] - 10https://gerrit.wikimedia.org/r/233997 [16:22:32] (03PS3) 10Andrew Bogott: Remove refs to dc-migrate. Unused and bitrotted. [puppet] - 10https://gerrit.wikimedia.org/r/233997 [16:22:46] !log running dist-upgrade on labcontrol1001 [16:22:46] PROBLEM - puppet last run on labcontrol1001 is CRITICAL Puppet has 2 failures [16:23:40] !log correction on that last: Upgrading nova and glance services piecemeal [16:23:52] (03CR) 10Andrew Bogott: [C: 032] Remove refs to dc-migrate. Unused and bitrotted. [puppet] - 10https://gerrit.wikimedia.org/r/233997 (owner: 10Andrew Bogott) [16:23:56] PROBLEM - nova-scheduler process on labcontrol1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-scheduler [16:24:35] PROBLEM - nova-conductor process on labcontrol1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-conductor [16:26:23] (03PS1) 10Andrew Bogott: Added scheduler_pool_filter.py to Juno files [puppet] - 10https://gerrit.wikimedia.org/r/233998 [16:27:11] (03CR) 10jenkins-bot: [V: 04-1] Added scheduler_pool_filter.py to Juno files [puppet] - 10https://gerrit.wikimedia.org/r/233998 (owner: 10Andrew Bogott) [16:27:41] (03PS1) 10ArielGlenn: dumps: move scheduler args out of dump arg list [puppet] - 10https://gerrit.wikimedia.org/r/233999 [16:28:32] (03PS2) 10Andrew Bogott: Added scheduler_pool_filter.py to Juno files [puppet] - 10https://gerrit.wikimedia.org/r/233998 [16:28:41] (03PS2) 10ArielGlenn: dumps: move scheduler args out of dump arg list [puppet] - 10https://gerrit.wikimedia.org/r/233999 [16:29:16] (03CR) 10ArielGlenn: [C: 032] dumps: move scheduler args out of dump arg list [puppet] - 10https://gerrit.wikimedia.org/r/233999 (owner: 10ArielGlenn) [16:30:16] (03PS3) 10Andrew Bogott: Added scheduler_pool_filter.py to Juno files [puppet] - 10https://gerrit.wikimedia.org/r/233998 [16:32:14] (03CR) 10Andrew Bogott: [C: 032] Added scheduler_pool_filter.py to Juno files [puppet] - 10https://gerrit.wikimedia.org/r/233998 (owner: 10Andrew Bogott) [16:34:04] !log stopping keystone, updating db, restarting [16:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:36:21] AndyRussG, do you know why wmgMobileUrlTemplate has a setting for mediawikiwiki in InitialiseSettings-labs.php? [16:36:29] There is no mediawikiwiki in beta. [16:36:50] (03PS1) 10Jcrespo: Changing es101[1278] to install jessie instead of trusty [puppet] - 10https://gerrit.wikimedia.org/r/234001 [16:38:04] Krenair: no idea... Looks like that's a mobile skin variable? [16:38:04] looks like it was added in https://gerrit.wikimedia.org/r/#/c/44278/2/wmf-config/InitialiseSettings-labs.php [16:39:04] Hmmm let's ask MaxSem? ^ [16:39:42] git blame? [16:40:02] 6operations, 10Traffic, 7HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#1575970 (10Chmarkine) [16:40:45] hehm it's just a copy of prod settings [16:41:01] git is always to blame [16:41:53] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: for "createdirs" job don't touch latest symlinks or rss feeds [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/233954 (owner: 10ArielGlenn) [16:43:01] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: allow user to specify how long dumper sleeps between wikis [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/233953 (owner: 10ArielGlenn) [16:43:11] the commonswiki setting is weird too, but I don't want to set up commons.m.wikimedia.beta.wmflabs.org until Andrew is done [16:44:56] Krenair: heh thanks! we're just smoke testing a new feature, don't think you need to worry :) [16:46:14] (03PS1) 10Andrew Bogott: Replaced the [default] header on nova.conf [puppet] - 10https://gerrit.wikimedia.org/r/234003 [16:46:28] (03PS2) 10Andrew Bogott: Replaced the [default] header on nova.conf [puppet] - 10https://gerrit.wikimedia.org/r/234003 [16:48:15] (03CR) 10Andrew Bogott: [C: 032] Replaced the [default] header on nova.conf [puppet] - 10https://gerrit.wikimedia.org/r/234003 (owner: 10Andrew Bogott) [16:50:47] RECOVERY - nova-conductor process on labcontrol1001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/nova-conductor [16:51:06] RECOVERY - puppet last run on labcontrol1001 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:51:09] (03PS4) 10Dzahn: deployment servers: use role keyword for role [puppet] - 10https://gerrit.wikimedia.org/r/230965 [16:51:23] _joe_: ^ there is no reason not to use the "role" keyword, right [16:51:43] i want to do it so we can use role based lookup here too [16:52:15] RECOVERY - nova-scheduler process on labcontrol1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-scheduler [16:52:18] and then include admin groups like this: https://gerrit.wikimedia.org/r/#/c/230966/ [16:52:58] (03PS2) 10ArielGlenn: staged dumps: sleep only 5 seconds between wikis for starting job [puppet] - 10https://gerrit.wikimedia.org/r/233961 [16:53:21] ori, any comment on https://gerrit.wikimedia.org/r/#/c/232672/2/multiversion/MWWikiversions.php ? [16:53:29] (03PS1) 10Rush: elasticsearch: add firewalling to elastic102[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/234005 [16:54:01] (03CR) 10ArielGlenn: [C: 032] staged dumps: sleep only 5 seconds between wikis for starting job [puppet] - 10https://gerrit.wikimedia.org/r/233961 (owner: 10ArielGlenn) [16:54:15] Krenair: looks good. Let me test it quickly. [16:54:26] I'm not about to deploy this, no rush :) [16:55:26] PROBLEM - nova-network process on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-network [16:56:25] PROBLEM - nova-network process on labnet1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-network [16:57:19] (03CR) 10DCausse: [C: 031] elasticsearch: add firewalling to elastic102[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/234005 (owner: 10Rush) [16:57:33] (03PS2) 10Andrew Bogott: Move labnet1001 and 1002 to openstack Juno [puppet] - 10https://gerrit.wikimedia.org/r/233856 [16:57:35] andrewbogott: network issues? [16:57:51] (03PS1) 10Giuseppe Lavagetto: Add unit tests for FileConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/234008 [16:58:09] (03CR) 10jenkins-bot: [V: 04-1] Add unit tests for FileConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/234008 (owner: 10Giuseppe Lavagetto) [16:58:13] <_joe_> mutante: I guess not [16:58:44] (03CR) 10Andrew Bogott: [C: 032] Move labnet1001 and 1002 to openstack Juno [puppet] - 10https://gerrit.wikimedia.org/r/233856 (owner: 10Andrew Bogott) [16:58:50] _joe_: :) [17:00:36] RECOVERY - nova-network process on labnet1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-network [17:01:13] YuviPanda: nova-network crashed during the upgrade… should be back and happy now [17:01:16] (03PS5) 10Dzahn: deployment servers: use role keyword for role [puppet] - 10https://gerrit.wikimedia.org/r/230965 [17:01:18] (03PS2) 10Giuseppe Lavagetto: Add unit tests for FileConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/234008 [17:01:21] andrewbogott: yeah, seems to be [17:01:35] does that explain users cant ssh to bastion? [17:01:36] RECOVERY - nova-network process on labnet1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-network [17:01:58] mutante: yes, should be back now [17:01:59] mutante: labs bastion? [17:02:07] yes and yes [17:02:13] saw reports on -wikitech [17:02:41] mutante: yes, the network was routing incorrectly for a moment, probably caused host-key mismatches [17:02:44] fixed now [17:02:57] (That’s the failure case when nova-network service dies. Pretty unclear!) [17:02:58] cool, relayed the message [17:03:29] !log upgraded labnet1002 nova services to Juno [17:03:34] indeed the error was a host key mismatch [17:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:03:53] which seems to be a normal thing when labsnet breaks? [17:04:22] Krenair: yeah, since they start routing to the labnet hosts themselves [17:05:44] (03PS1) 10Shanmugamp7: Temporary lift of IP cap on ta.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234009 (https://phabricator.wikimedia.org/T110352) [17:07:42] (03PS2) 10Jcrespo: Changing es101[1278] to install jessie instead of trusty [puppet] - 10https://gerrit.wikimedia.org/r/234001 [17:08:29] (03PS2) 10Rush: elasticsearch: add firewalling to elastic102[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/234005 [17:08:41] (03CR) 10Rush: [C: 032] elasticsearch: add firewalling to elastic102[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/234005 (owner: 10Rush) [17:08:50] (03CR) 10Rush: [V: 032] elasticsearch: add firewalling to elastic102[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/234005 (owner: 10Rush) [17:09:18] !log adding firewall to elasticsearch2[4-6] (3 was just done as a pilot) [17:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:09:55] (03PS6) 10Dzahn: deployment servers: use role keyword for role [puppet] - 10https://gerrit.wikimedia.org/r/230965 [17:11:46] (03CR) 10Glaisher: Temporary lift of IP cap on ta.wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234009 (https://phabricator.wikimedia.org/T110352) (owner: 10Shanmugamp7) [17:11:59] !log ok, /now/ I’m running a dist-upgrade on labcontrol1001, to sort out weird oslo dependencies [17:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:12:14] (03PS3) 10Ori.livneh: Allow dblist filenames containing +/- to be used in dblist expressions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232672 (owner: 10Alex Monk) [17:12:20] Krenair: added a unit test :) ^^ [17:12:25] (03PS3) 10Jcrespo: Changing es101[1278] to install jessie instead of trusty [puppet] - 10https://gerrit.wikimedia.org/r/234001 [17:12:38] ori, thanks :) [17:13:33] (03CR) 10Ori.livneh: [C: 031] Allow dblist filenames containing +/- to be used in dblist expressions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232672 (owner: 10Alex Monk) [17:13:43] (03CR) 10Jcrespo: [C: 032] Changing es101[1278] to install jessie instead of trusty [puppet] - 10https://gerrit.wikimedia.org/r/234001 (owner: 10Jcrespo) [17:14:02] ok, with that frequency of merging i give up the rebasing dance for now [17:15:33] (03PS1) 10Rush: elasticsearch: fix fw canary regex [puppet] - 10https://gerrit.wikimedia.org/r/234013 [17:15:39] (03CR) 10jenkins-bot: [V: 04-1] elasticsearch: fix fw canary regex [puppet] - 10https://gerrit.wikimedia.org/r/234013 (owner: 10Rush) [17:15:45] (03PS2) 10Rush: elasticsearch: fix fw canary regex [puppet] - 10https://gerrit.wikimedia.org/r/234013 [17:16:11] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: export config and archive data from sodium - https://phabricator.wikimedia.org/T108071#1576056 (10Dzahn) repeating a run of ./rsync_lists.sh right after it was finished takes: 63m23s so it takes a minimum of an hour to run but actually more dependi... [17:16:49] (03CR) 10DCausse: [C: 031] elasticsearch: fix fw canary regex [puppet] - 10https://gerrit.wikimedia.org/r/234013 (owner: 10Rush) [17:17:36] PROBLEM - puppet last run on restbase1001 is CRITICAL Puppet last ran 1 day ago [17:19:01] (03PS3) 10Rush: elasticsearch: fix fw canary regex [puppet] - 10https://gerrit.wikimedia.org/r/234013 [17:19:06] (03CR) 10jenkins-bot: [V: 04-1] elasticsearch: fix fw canary regex [puppet] - 10https://gerrit.wikimedia.org/r/234013 (owner: 10Rush) [17:19:09] (03PS4) 10Rush: elasticsearch: fix fw canary regex [puppet] - 10https://gerrit.wikimedia.org/r/234013 [17:19:36] RECOVERY - puppet last run on restbase1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:20:08] (03CR) 10Dzahn: [C: 031] "looks right to activate firewalling on elastic1023 thru 1026" [puppet] - 10https://gerrit.wikimedia.org/r/234013 (owner: 10Rush) [17:21:52] (03PS7) 10Dzahn: deployment servers: use role keyword for role [puppet] - 10https://gerrit.wikimedia.org/r/230965 [17:22:01] (03CR) 10Rush: [C: 032 V: 032] elasticsearch: fix fw canary regex [puppet] - 10https://gerrit.wikimedia.org/r/234013 (owner: 10Rush) [17:22:26] ... and again [17:22:30] (03PS8) 10Dzahn: deployment servers: use role keyword for role [puppet] - 10https://gerrit.wikimedia.org/r/230965 [17:22:36] (03PS2) 10Shanmugamp7: Lift of IP cap on ta.wikipedia for IP 218.248.16.20 for the next 6 months as there are series of events planned in partnership with an organisation Removed expired exceptions as well. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234009 (https://phabricator.wikimedia.org/T110352) [17:23:28] (03CR) 10Dzahn: [C: 032] deployment servers: use role keyword for role [puppet] - 10https://gerrit.wikimedia.org/r/230965 (owner: 10Dzahn) [17:30:57] !log bouncing Cassandra on restbase1001 to apply temporary GC setting [17:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:37:44] (03PS3) 10Dzahn: mailman: SSL settings to Apache 2.4 and "mid" [puppet] - 10https://gerrit.wikimedia.org/r/232420 (https://phabricator.wikimedia.org/T90351) [17:40:15] search is broken on wikitech? :( [17:40:38] Somehow I doubt it's actually too busy [17:40:58] wfm [17:41:01] same [17:41:07] and it's working again for me as well [17:41:10] oh well [17:42:45] did the upstream debian installer changed recently? I found some extra screens that partman (not me) didn't expect [17:43:55] (03PS2) 10Andrew Bogott: Move labvirt1005 to Juno [puppet] - 10https://gerrit.wikimedia.org/r/233857 [17:45:31] (03CR) 10Andrew Bogott: [C: 032] Move labvirt1005 to Juno [puppet] - 10https://gerrit.wikimedia.org/r/233857 (owner: 10Andrew Bogott) [17:50:20] (03PS1) 10Alex Monk: Get rid of weird labs wmgMobileUrlTemplate settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234024 [17:51:25] (03PS1) 10Dzahn: analytics: do not use node inheritance [puppet] - 10https://gerrit.wikimedia.org/r/234025 [17:53:47] (03CR) 10Dzahn: [C: 032] "no diff on sodium: http://puppet-compiler.wmflabs.org/841/" [puppet] - 10https://gerrit.wikimedia.org/r/232420 (https://phabricator.wikimedia.org/T90351) (owner: 10Dzahn) [17:53:53] (03PS4) 10Dzahn: mailman: SSL settings to Apache 2.4 and "mid" [puppet] - 10https://gerrit.wikimedia.org/r/232420 (https://phabricator.wikimedia.org/T90351) [17:56:04] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1576248 (10BBlack) >>! In T105794#1575846, @Legoktm wrote: > The `format` parameter applies to all API requests: https://en.wikipedia.org/w/api.php?action=help&modules=main Heh, I clicked "... [17:56:42] (03PS4) 10Dzahn: deployment: include admin groups in role, not nodes [puppet] - 10https://gerrit.wikimedia.org/r/230966 [17:57:50] (03PS1) 10BryanDavis: Use configured bin_dir to find refreshCdbJsonFiles [tools/scap] - 10https://gerrit.wikimedia.org/r/234028 [17:58:45] I got excited there for a second, I thought maybe I had missed that sodium was already dead [17:59:03] bblack: i added an "if" so i could merge it already :p [17:59:08] (03CR) 10John F. Lewis: [C: 031] deployment: include admin groups in role, not nodes [puppet] - 10https://gerrit.wikimedia.org/r/230966 (owner: 10Dzahn) [18:00:05] twentyafterfour greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150826T1800). Please do the needful. [18:00:22] (03CR) 10Dzahn: [C: 032] deployment: include admin groups in role, not nodes [puppet] - 10https://gerrit.wikimedia.org/r/230966 (owner: 10Dzahn) [18:01:32] I'm going to take my name off that announce, I think :) [18:01:41] 7Blocked-on-Operations, 7Puppet, 6Reading-Infrastructure-Team, 6Release-Engineering, and 3 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1576289 (10akosiaris) Not sure were we are with this. @Tgr apart from T96054 which I 'll comment on, is there anything else that... [18:03:24] (03CR) 10Chad: [C: 032] Use configured bin_dir to find refreshCdbJsonFiles [tools/scap] - 10https://gerrit.wikimedia.org/r/234028 (owner: 10BryanDavis) [18:03:43] (03Merged) 10jenkins-bot: Use configured bin_dir to find refreshCdbJsonFiles [tools/scap] - 10https://gerrit.wikimedia.org/r/234028 (owner: 10BryanDavis) [18:07:06] 7Puppet: Puppet resource for creating a postgresql database - https://phabricator.wikimedia.org/T96054#1576312 (10akosiaris) That's because we don't really need one. the `postgresql::spatialdb` is there more in order to make the database spatially enabled and less to create a database. We don't have puppet roles... [18:07:30] 7Puppet: Puppet resource for creating a postgresql database - https://phabricator.wikimedia.org/T96054#1576313 (10akosiaris) p:5Triage>3High [18:09:56] 6operations, 10Traffic, 7network: Requests from a specific network are blocked - https://phabricator.wikimedia.org/T110208#1576320 (10akosiaris) 5Open>3stalled The block has been removed, setting this to stalled while awaiting to be contacted by the hosting company/ISP [18:10:05] 6operations, 10Traffic, 7network: Requests from a specific network are blocked - https://phabricator.wikimedia.org/T110208#1576322 (10akosiaris) p:5Unbreak!>3Low [18:12:13] (03CR) 10Florianschmidtwelzow: [C: 031] Remove auto-redirection from 404 page. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233664 (https://phabricator.wikimedia.org/T37052) (owner: 10MZMcBride) [18:12:24] 7Blocked-on-Operations, 6operations, 10Parsoid, 6Scrum-of-Scrums: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#1576341 (10akosiaris) @ArielGlenn, is there anything we can do about this? I am stumped to be honest give... [18:15:38] 6operations, 10Traffic, 7network: Requests from a specific network are blocked - https://phabricator.wikimedia.org/T110208#1576359 (10Ironholds_backup) For reference if I don't see a contact email in my inbox by EOW I'm going to ask that it be reinstated. [18:15:52] (03PS13) 10Dzahn: contint: move zuul_merger_hosts to hiera, use in ferm [puppet] - 10https://gerrit.wikimedia.org/r/201882 (https://phabricator.wikimedia.org/T87519) [18:16:01] (03CR) 10jenkins-bot: [V: 04-1] contint: move zuul_merger_hosts to hiera, use in ferm [puppet] - 10https://gerrit.wikimedia.org/r/201882 (https://phabricator.wikimedia.org/T87519) (owner: 10Dzahn) [18:19:32] Jeff_Green, is the fdb* naming convention still used? I guess that would be in frack somewhere if anywhere... [18:19:50] am trying to add some details about usage to wikitech's page on naming conventions [18:20:06] Krenair: yep, it's in use [18:20:08] 6operations, 6Discovery, 5Incident-20150825-Redis, 3Discovery-Cirrus-Sprint, 7Elasticsearch: Update Elasticsearch for missing updates from outage on 20150825 - https://phabricator.wikimedia.org/T110179#1576381 (10EBernhardson) a:5dcausse>3EBernhardson [18:20:21] ooh, they're referenced in the public dns repo [18:20:21] okay [18:20:27] thanks anyway Jeff_Green [18:20:34] Krenair: np [18:21:08] PROBLEM - nova-compute process on labvirt1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [18:22:07] (03PS1) 1020after4: group1 wikis to 1.26wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234029 [18:22:22] (03CR) 1020after4: [C: 032] group1 wikis to 1.26wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234029 (owner: 1020after4) [18:22:27] (03Merged) 10jenkins-bot: group1 wikis to 1.26wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234029 (owner: 1020after4) [18:23:04] andrewbogott: ^ nova-compute, you're still working on it? [18:23:16] YuviPanda: yes [18:23:22] and, that’s a test box anyway [18:23:28] ah, right [18:23:28] ok [18:24:06] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Need access for smalyshev to hive queries on stat1002 - https://phabricator.wikimedia.org/T110217#1576384 (10akosiaris) 5Open>3Resolved a:3akosiaris This has been merged. access is fine now, resolving [18:27:12] RECOVERY - nova-compute process on labvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [18:28:15] (03PS7) 10Thcipriani: Add deploy-service user [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) [18:28:18] (03PS1) 10Andrew Bogott: Update the custom-hacked libvirt driver for Juno. [puppet] - 10https://gerrit.wikimedia.org/r/234030 [18:28:21] (03CR) 10jenkins-bot: [V: 04-1] Add deploy-service user [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [18:29:06] (03PS2) 10Ori.livneh: Remove auto-redirection from 404 page. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233664 (https://phabricator.wikimedia.org/T37052) (owner: 10MZMcBride) [18:29:15] 6operations, 10Traffic, 7network: Requests from a specific network are blocked - https://phabricator.wikimedia.org/T110208#1576417 (10akosiaris) >>! In T110208#1576359, @Ironholds_backup wrote: > For reference if I don't see a contact email in my inbox by EOW I'm going to ask that it be reinstated. Let's h... [18:29:21] (03CR) 10Ori.livneh: [C: 032] Remove auto-redirection from 404 page. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233664 (https://phabricator.wikimedia.org/T37052) (owner: 10MZMcBride) [18:29:27] (03Merged) 10jenkins-bot: Remove auto-redirection from 404 page. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233664 (https://phabricator.wikimedia.org/T37052) (owner: 10MZMcBride) [18:30:47] (03PS1) 10Dzahn: mailman: adjust import_list.sh for private lists [puppet] - 10https://gerrit.wikimedia.org/r/234032 (https://phabricator.wikimedia.org/T110131) [18:31:04] !log ori@tin Synchronized w/404.php: Ided1facc0: Remove auto-redirection from 404 page. (duration: 00m 13s) [18:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:34:11] 6operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#1576445 (10Mattflaschen) It sounds like either will work for Flow. If there's anything we need to do (e.g. during the time window when switching to the clone), let us know. We will have to go thro... [18:34:15] ori, is there no way to purge logos in varnish like we can do for wiki pages? [18:35:00] MZMcBride: deploying a config change? [18:35:03] (03PS8) 10Thcipriani: Add deploy-service user [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) [18:35:05] (03PS6) 10Thcipriani: Create ssh-agent-proxy internal permissions [puppet] - 10https://gerrit.wikimedia.org/r/233850 [18:35:45] twentyafterfour, MZMcBride cannot deploy config changes :) [18:36:25] hmm there was a commit that got pulled in with my deployment commit in mediawiki-staging [18:36:42] twentyafterfour: was deployed by ori [18:36:50] written by MZ [18:36:52] yeah [18:37:06] i didn't rebase yours, twentyafterfour [18:37:19] ok so it's safe to continue I assume? [18:37:37] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: tig [18:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:38:06] ori: I would have loved for that to have gone out during SWAT, fwiw [18:38:07] 1log ^ stupid typo. That sync was group1 to 1.26wmf20 [18:38:21] (03PS1) 10Alexandros Kosiaris: maps: Grant redis stop/start/enable/disable sudo rights [puppet] - 10https://gerrit.wikimedia.org/r/234034 (https://phabricator.wikimedia.org/T106637) [18:38:22] grr I can't type today [18:38:25] !log ^ stupid typo. That sync was group1 to 1.26wmf20 [18:38:27] twentyafterfour: yes. greg-g: fair, ok. [18:38:30] ori: but yes yes, I know :) [18:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:40:07] (03PS1) 10Jcrespo: Adding new External Storage nodes as MariaDB::core [puppet] - 10https://gerrit.wikimedia.org/r/234035 (https://phabricator.wikimedia.org/T105843) [18:41:13] ^I really need a santity check here [18:41:23] not santity [18:41:25] lol [18:41:29] sanity [18:42:23] (03CR) 10Chad: Create ssh-agent-proxy internal permissions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/233850 (owner: 10Thcipriani) [18:46:35] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL 100.00% of data above the critical threshold [5000000.0] [18:46:35] ohhhhhh downtime expirey [18:46:35] almost done, but not quite! [18:46:35] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL 100.00% of data above the critical threshold [5000000.0] [18:46:36] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL 100.00% of data above the critical threshold [5000000.0] [18:46:54] bblack, should it be possible for us to purge static images from varnish the same way as wiki pages? [18:48:41] kinda? [18:48:51] twentyafterfour: tin 1.26wmf19 AbuseFilter/maintenance/addMissingLoggingEntries.php has a local modification, uncommitted [18:48:52] but not in exactly the same way or with the same effects [18:48:54] Yours? [18:49:20] Oh [18:49:23] That'd be mine, Krinkle [18:49:45] Krinkle, Gone :) [18:50:29] !log krinkle@tin Synchronized php-1.26wmf19/maintenance/deleteEqualMessages.php: (no message) (duration: 00m 13s) [18:50:33] It was a couple of changes now in master + an extra echo statement, single use maintenance script that's already been run on all wikimedia sites [18:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:50:56] !log krinkle@tin Synchronized php-1.26wmf20/maintenance/deleteEqualMessages.php: (no message) (duration: 00m 11s) [18:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:51:05] bblack, okay... what's the difference in the method and effects? [18:51:14] Krenair: so there's a couple issues around that: [18:51:58] 1) wiki pages are "special", in that while MW serves CC headers that allow varnish to cache them for a while, varnish strips those and send no-cache headers to clients, so that we can control purging [18:52:17] so, purging other objects may not work as well in the sense that 3rd parties could be caching for some time after we purge them, too [18:52:56] for example, I just tried: https://www.wikiversity.org/static/images/wikimedia-button.png and observed in the public response: Cache-Control: max-age=31536000 [18:53:02] (03PS2) 10Andrew Bogott: Move holmium/designate to openstack Juno [puppet] - 10https://gerrit.wikimedia.org/r/233858 [18:53:06] (03CR) 10Ottomata: [C: 031] analytics: do not use node inheritance [puppet] - 10https://gerrit.wikimedia.org/r/234025 (owner: 10Dzahn) [18:53:37] Krinkle: hmm... [18:53:56] Krinkle: I'm pretty sure it wasn't anything I did. [18:54:06] twentyafterfour: Krenair claimed it already [18:54:18] twentyafterfour: the filesystem indicated you as the owner, and I guess changing a file doesn't change the owner [18:54:20] (03CR) 10Andrew Bogott: [C: 032] Move holmium/designate to openstack Juno [puppet] - 10https://gerrit.wikimedia.org/r/233858 (owner: 10Andrew Bogott) [18:54:24] PROBLEM - nova-compute process on labvirt1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [18:54:41] There are some weird Wikidata entries https://logstash.wikimedia.org/#dashboard/temp/AU9rXJKtOkQDz4dSsGFt [18:54:42] Krenair: 2) for the /static/ stuff, we're internally normalizing req.http.host to "www.wikimedia.org" regardless of request-domain. I think that means you have to purge using that hostname, not other random hostnames. [18:54:50] (03PS2) 10Dzahn: mailman: adjust import_list.sh for private lists [puppet] - 10https://gerrit.wikimedia.org/r/234032 (https://phabricator.wikimedia.org/T110131) [18:55:01] (03PS1) 10Deskana: Remove files from Commons from search results on wikimediafoundation.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234040 (https://phabricator.wikimedia.org/T76957) [18:55:06] It seems every page title the maintenance script instanties in Wikipage results in an INFO UpdateRepo Couldn't find an item for MediaWiki:Privacy log entry [18:56:09] https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/3e1de68805adb8a9d7d66a0bfd33272b0deaf8d6/client/includes/UpdateRepo/UpdateRepo.php#L95 [18:56:16] that triggers for every page delete? [18:56:22] Even MediaWiki namespace [18:56:28] bblack, ahh! [18:56:38] that might be the trick.. [18:58:02] (03CR) 10Jcrespo: [C: 031] Adding new External Storage nodes as MariaDB::core [puppet] - 10https://gerrit.wikimedia.org/r/234035 (https://phabricator.wikimedia.org/T105843) (owner: 10Jcrespo) [18:58:43] Krenair: the normalized hostname for statics is puppet-controlled and varies for beta cluster as well [18:58:55] bblack-mba:puppet bblack$ git grep static_host [18:58:56] hieradata/labs.yaml:role::cache::text::static_host: 'deployment.wikimedia.beta.wmflabs.org' [18:58:59] modules/role/manifests/cache/configuration.pp: $static_host = hiera('role::cache::text::static_host', 'www.wikimedia.org') [18:59:01] I tried SquidUpdate::purge( array( 'https://www.wikimedia.org/static/images/project-logos/knwikiquote.org' ) ); [18:59:02] templates/varnish/text-frontend.inc.vcl.erb: if (req.url ~ "^/static/") { set req.http.host = "<%= scope['role::cache::configuration::static_host'] %>"; } [18:59:16] oops. [18:59:21] it's supposed to be .png, not .org... [18:59:27] heh [18:59:38] Aha! [18:59:39] That did it. [18:59:45] Thanks bblack [18:59:47] np [18:59:50] * Krenair will document this somewhere [19:00:25] PROBLEM - puppet last run on palladium is CRITICAL puppet fail [19:01:15] (03PS2) 10Andrew Bogott: Update the custom-hacked libvirt driver for Juno. [puppet] - 10https://gerrit.wikimedia.org/r/234030 [19:02:20] (03CR) 10Dzahn: "it seems es1010 is not covered by the regexes but it exists in racktables" [puppet] - 10https://gerrit.wikimedia.org/r/234035 (https://phabricator.wikimedia.org/T105843) (owner: 10Jcrespo) [19:02:22] RECOVERY - Disk space on labstore1002 is OK: DISK OK [19:02:55] (03CR) 10Andrew Bogott: [C: 032] Update the custom-hacked libvirt driver for Juno. [puppet] - 10https://gerrit.wikimedia.org/r/234030 (owner: 10Andrew Bogott) [19:03:34] 7Blocked-on-Operations, 7Puppet, 6Reading-Infrastructure-Team, 6Release-Engineering, and 3 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1576572 (10Tgr) Only code review. The patch does include a puppet role to provision a DB. [19:04:01] (03CR) 10Dzahn: [C: 031] "never mind, i see it already existed as node /es10(08|10)" [puppet] - 10https://gerrit.wikimedia.org/r/234035 (https://phabricator.wikimedia.org/T105843) (owner: 10Jcrespo) [19:06:03] RECOVERY - nova-compute process on labvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [19:06:43] (03PS2) 10Jcrespo: Adding new External Storage nodes as MariaDB::core [puppet] - 10https://gerrit.wikimedia.org/r/234035 (https://phabricator.wikimedia.org/T105843) [19:06:58] (03CR) 10Jcrespo: [C: 032] Adding new External Storage nodes as MariaDB::core [puppet] - 10https://gerrit.wikimedia.org/r/234035 (https://phabricator.wikimedia.org/T105843) (owner: 10Jcrespo) [19:07:23] 7Puppet: Puppet resource for creating a postgresql database - https://phabricator.wikimedia.org/T96054#1576579 (10Tgr) Ideally, you should be able to tick a checkbox when provisioning a Labs instance and get a running application by the time the puppet run ends. It's reasonable to have more complex installation... [19:08:54] 6operations, 7Varnish: Figure out purging of static logos for updates - https://phabricator.wikimedia.org/T106620#1576582 (10Krenair) a:5ori>3BBlack Krenair: 2) for the /static/ stuff, we're internally normalizing req.http.host to "www.wikimedia.org" regardless of request-domain. I think that mea... [19:09:01] 6operations, 7Varnish: Figure out purging of static logos for updates - https://phabricator.wikimedia.org/T106620#1576592 (10Krenair) 5Open>3Resolved [19:10:52] (03PS3) 10Dzahn: mailman: adjust import_list.sh for private lists [puppet] - 10https://gerrit.wikimedia.org/r/234032 (https://phabricator.wikimedia.org/T110131) [19:11:28] (03PS1) 10Alex Monk: Actually use knwikiquote.png as the logo for that wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234042 (https://phabricator.wikimedia.org/T104260) [19:11:42] (03PS1) 10John F. Lewis: mailman: Don't store bad messages in qfiles [puppet] - 10https://gerrit.wikimedia.org/r/234043 [19:12:00] (03PS2) 10John F. Lewis: mailman: Don't store bad messages in qfiles [puppet] - 10https://gerrit.wikimedia.org/r/234043 [19:12:02] (03CR) 10Dzahn: [C: 032] mailman: adjust import_list.sh for private lists [puppet] - 10https://gerrit.wikimedia.org/r/234032 (https://phabricator.wikimedia.org/T110131) (owner: 10Dzahn) [19:12:03] ACKNOWLEDGEMENT - puppet last run on es1011 is CRITICAL Puppet has 4 failures Jcrespo fresh install, working on it [19:12:25] twentyafterfour, are you done with the deployment? [19:12:55] (03PS3) 10Dzahn: mailman: Don't store bad messages in qfiles [puppet] - 10https://gerrit.wikimedia.org/r/234043 (https://phabricator.wikimedia.org/T110131) (owner: 10John F. Lewis) [19:13:07] (03CR) 10John F. Lewis: "http://bazaar.launchpad.net/~mailman-coders/mailman/2.1/view/head:/Mailman/Defaults.py.in#L880 as an fyi for this variable and what it doe" [puppet] - 10https://gerrit.wikimedia.org/r/234043 (https://phabricator.wikimedia.org/T110131) (owner: 10John F. Lewis) [19:14:57] (03CR) 10Dzahn: "also https://mail.python.org/pipermail/mailman-users/2011-April/071486.html et al" [puppet] - 10https://gerrit.wikimedia.org/r/234043 (https://phabricator.wikimedia.org/T110131) (owner: 10John F. Lewis) [19:15:21] JohnFLewis: that guy is "omg, i have hundreds of files there" [19:15:29] we would laugh about hundreds.. [19:15:38] it's 67k [19:16:02] yep.. [19:16:03] i wonder why they would keep it enabled, but then: [19:16:18] "make sure that Mailman's crontab is running cron/cull_bad_shunt." [19:16:29] only to delete them with cron later [19:16:53] which brings us to that other task to double check all the mailman crons [19:17:22] also: BAD_SHUNT_STALE_AFTER = days(7) and stuff [19:17:42] RECOVERY - puppet last run on palladium is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [19:18:12] it's better just not not store them either way [19:18:20] we're not using them :) we have logs for this stuff :) [19:18:28] (03CR) 10Alex Monk: [C: 032] Allow dblist filenames containing +/- to be used in dblist expressions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232672 (owner: 10Alex Monk) [19:18:32] (03PS4) 10Dzahn: mailman: Don't store bad messages in qfiles [puppet] - 10https://gerrit.wikimedia.org/r/234043 (https://phabricator.wikimedia.org/T110131) (owner: 10John F. Lewis) [19:18:53] (03Merged) 10jenkins-bot: Allow dblist filenames containing +/- to be used in dblist expressions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232672 (owner: 10Alex Monk) [19:19:31] (03CR) 10Dzahn: [C: 032] mailman: Don't store bad messages in qfiles [puppet] - 10https://gerrit.wikimedia.org/r/234043 (https://phabricator.wikimedia.org/T110131) (owner: 10John F. Lewis) [19:20:05] !log krenair@tin Synchronized multiversion/MWWikiversions.php: https://gerrit.wikimedia.org/r/#/c/232672/ (duration: 00m 12s) [19:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:21:15] JohnFLewis: i'm deleting them on new server manually [19:21:19] (03CR) 10Alex Monk: [C: 032] Get rid of weird labs wmgMobileUrlTemplate settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234024 (owner: 10Alex Monk) [19:21:27] okay [19:21:38] so now we only have a massive shunt directory :) [19:21:44] (03Merged) 10jenkins-bot: Get rid of weird labs wmgMobileUrlTemplate settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234024 (owner: 10Alex Monk) [19:22:18] another case of "too many to use rm" [19:22:19] !log krenair@tin Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/#/c/234024/ (duration: 00m 12s) [19:22:25] (03PS1) 10Cmjohnson: Adding the remainder of mac addresses for ES servers [puppet] - 10https://gerrit.wikimedia.org/r/234046 [19:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:22:34] (03CR) 10Alex Monk: [C: 032] Actually use knwikiquote.png as the logo for that wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234042 (https://phabricator.wikimedia.org/T104260) (owner: 10Alex Monk) [19:22:40] (03Merged) 10jenkins-bot: Actually use knwikiquote.png as the logo for that wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234042 (https://phabricator.wikimedia.org/T104260) (owner: 10Alex Monk) [19:23:13] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/234042/ (duration: 00m 12s) [19:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:23:42] JohnFLewis: well, that "BAD_SHUNT_STALE_AFTER" stuff might be it for 'shunt' [19:24:14] JohnFLewis: there are 2 types of "bad" files, ending in .psv and ending in .pck [19:24:52] (03PS1) 10Andrew Bogott: Leave designate config with icehouse settings. [puppet] - 10https://gerrit.wikimedia.org/r/234047 (https://phabricator.wikimedia.org/T104587) [19:24:54] (03PS1) 10Andrew Bogott: Move labvirt1004 to Juno, and limit schedule pool to just labvirt1004. [puppet] - 10https://gerrit.wikimedia.org/r/234048 (https://phabricator.wikimedia.org/T104587) [19:25:08] (03PS2) 10Cmjohnson: Adding the remainder of mac addresses for ES servers [puppet] - 10https://gerrit.wikimedia.org/r/234046 [19:25:16] bad as a directory can go regardless of ending. things go there and just get deleted eventually by mailman, they're not interacted with in any way [19:25:30] (03PS2) 10Andrew Bogott: Leave designate config with icehouse settings. [puppet] - 10https://gerrit.wikimedia.org/r/234047 (https://phabricator.wikimedia.org/T104587) [19:25:40] (03PS2) 10Andrew Bogott: Move labvirt1004 to Juno, and limit schedule pool to just labvirt1004. [puppet] - 10https://gerrit.wikimedia.org/r/234048 (https://phabricator.wikimedia.org/T104587) [19:26:28] (03CR) 10Cmjohnson: [C: 032] Adding the remainder of mac addresses for ES servers [puppet] - 10https://gerrit.wikimedia.org/r/234046 (owner: 10Cmjohnson) [19:26:38] JohnFLewis: 2.5 gigabytes of shunt... argggg [19:26:44] mutante: clearly cron isn't being ran [19:27:09] default is set to kill after 7 days; oldest file in shunt is about lily->sodium [19:27:11] (03PS3) 10Andrew Bogott: Leave designate config with icehouse settings. [puppet] - 10https://gerrit.wikimedia.org/r/234047 (https://phabricator.wikimedia.org/T104587) [19:27:14] so 2012 :) [19:27:17] JohnFLewis: lol, nice [19:27:23] (03PS3) 10Andrew Bogott: Move labvirt1004 to Juno, and limit schedule pool to just labvirt1004. [puppet] - 10https://gerrit.wikimedia.org/r/234048 (https://phabricator.wikimedia.org/T104587) [19:27:54] JohnFLewis: another find job to delete all older than 7 days then.. only caring about it on $new [19:28:08] (03CR) 10Andrew Bogott: [C: 032] Leave designate config with icehouse settings. [puppet] - 10https://gerrit.wikimedia.org/r/234047 (https://phabricator.wikimedia.org/T104587) (owner: 10Andrew Bogott) [19:28:13] sure I guess [19:29:12] PROBLEM - puppet last run on holmium is CRITICAL Puppet has 2 failures [19:29:22] (03CR) 10Andrew Bogott: [C: 032] Move labvirt1004 to Juno, and limit schedule pool to just labvirt1004. [puppet] - 10https://gerrit.wikimedia.org/r/234048 (https://phabricator.wikimedia.org/T104587) (owner: 10Andrew Bogott) [19:32:16] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: export config and archive data from sodium - https://phabricator.wikimedia.org/T108071#1576681 (10Dzahn) deleted all files in qfiles/bad , tons of files and we don't use them also patch by John https://gerrit.wikimedia.org/r/#/c/234043/4 to stop th... [19:32:22] (03PS1) 10Andrew Bogott: Add a couple of missing files from designate/Juno [puppet] - 10https://gerrit.wikimedia.org/r/234050 (https://phabricator.wikimedia.org/T104587) [19:33:21] (03CR) 10Andrew Bogott: [C: 032] Add a couple of missing files from designate/Juno [puppet] - 10https://gerrit.wikimedia.org/r/234050 (https://phabricator.wikimedia.org/T104587) (owner: 10Andrew Bogott) [19:33:58] 6operations, 10Wikimedia-Mailing-lists: mailman cronjobs not running? - https://phabricator.wikimedia.org/T110382#1576689 (10Dzahn) 3NEW a:3Dzahn [19:34:09] 6operations, 10Wikimedia-Mailing-lists: mailman cronjobs not running? - https://phabricator.wikimedia.org/T110382#1576689 (10Dzahn) a:5Dzahn>3JohnLewis [19:34:54] 6operations, 10Wikimedia-Mailing-lists: mailman cronjobs not running? - https://phabricator.wikimedia.org/T110382#1576689 (10Dzahn) [sodium:~] $ sudo crontab -u list -l no crontab for list [19:35:30] 6operations, 10Wikimedia-Mailing-lists: mailman cronjobs not running? - https://phabricator.wikimedia.org/T110382#1576709 (10Dzahn) meanwhile running find to delete all those files older than 7 days on the target on fermium [19:35:39] (03PS1) 10Tim Landscheidt: Tools: Decommission bigbrother [puppet] - 10https://gerrit.wikimedia.org/r/234051 [19:35:47] RECOVERY - puppet last run on holmium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:38:29] 6operations, 10Wikimedia-Mailing-lists: mailman cronjobs not running? - https://phabricator.wikimedia.org/T110382#1576716 (10Dzahn) ``` [sodium:/etc/cron.d] $ cat mailman # At 8AM every day, mail reminders to admins as to pending requests. # They are less likely to ignore these reminders if they're mailed # e... [19:39:14] 6operations, 10Wikimedia-Mailing-lists: mailman cronjobs not running? - https://phabricator.wikimedia.org/T110382#1576719 (10JohnLewis) So crons do run - just not all jobs. [19:39:23] (03CR) 10Tim Landscheidt: [C: 04-1] "Not to be merged yet. This also needs explanation in the commit message what replaced bigbrother." [puppet] - 10https://gerrit.wikimedia.org/r/234051 (owner: 10Tim Landscheidt) [19:40:37] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: export config and archive data from sodium - https://phabricator.wikimedia.org/T108071#1576721 (10Dzahn) deleted all files in shunt older than 7 days, also see T110382 [19:41:45] JohnFLewis: 66M of qfiles instead of 2.5GB [19:41:53] 6operations, 10Wikimedia-Mailing-lists: mailman cronjobs not running? - https://phabricator.wikimedia.org/T110382#1576723 (10JohnLewis) ``` johnflewis@fermium:/var/lib/mailman/cron$ cat crontab.in # At 8AM every day, mail reminders to admins as to pending requests. # They are less likely to ignore these remind... [19:42:11] bah, shell username pinged me :( [19:42:28] mutante: great, that's what we want [19:42:41] * JohnFLewis checks crontab.in upstream [19:42:49] handles it on sodium [19:44:09] mutante: http://bazaar.launchpad.net/~mailman-coders/mailman/2.1/revision/1086 add the cron we're after [19:44:24] for the record, also: http://wiki.list.org/DOC/4.35%20What%20do%20I%20do%20with%20a%20shunt%20%28qfiles-shunt%29%20directory%20full%20of%20files%3F [19:44:42] 2008, lily was moved 2012. I think the crons were never used in the new version; migrated old from lily [19:45:08] mutante: don't unshunt it :) [19:45:28] " a message CAN both be delivered and end up in the shunt queue" [19:45:29] heh, no [19:45:51] (03PS1) 10Jdlrobson: Enable banners on user pages for labs and production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234072 (https://phabricator.wikimedia.org/T109886) [19:46:56] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1576754 (10JohnLewis) [19:46:58] 6operations, 10Wikimedia-Mailing-lists: mailman cronjobs not running? - https://phabricator.wikimedia.org/T110382#1576752 (10JohnLewis) 5Open>3Resolved http://bazaar.launchpad.net/~mailman-coders/mailman/2.1/revision/1086 adds the cron we're interested. My bet is we're using the crons lily had and not wha... [19:47:05] mutante: ^ closed it [19:47:34] !log sodium - deleting shunted messages older than 7 days [19:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:49:07] JohnFLewis: so we'd get it back with 2.1.18 ? [19:49:16] well, "back" [19:49:19] (03PS1) 10Jcrespo: depool es1005 in order to clone it to db1011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234112 [19:49:26] mutante: we'd get it doing things correctly :) [19:49:36] good:) [19:49:55] nice that you added the link where it actually got added [19:49:56] (03CR) 10Jcrespo: [C: 031] depool es1005 in order to clone it to db1011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234112 (owner: 10Jcrespo) [19:49:56] mar.k may have done it correctly, maybe things were wrongly packaged or maybe he just missed a step - who knows :) [19:51:02] so next we also kill all the "bad" files on sodium, and then let's see how much faster an rsync is [19:51:14] 5 minutes! :P [19:51:44] heh, no:) still every single .mbox that gets touched [19:51:50] be great if it was; we could afford to put an hours downtime and everyone is [sort of] happy :D [19:52:34] !log krenair@tin Synchronized php-1.26wmf20/extensions/Echo/includes/mapper/EventMapper.php: https://gerrit.wikimedia.org/r/#/c/234082/ (duration: 00m 12s) [19:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:54:05] JohnFLewis: interesting how we don't have arbcom, but we do have Koreak arbcom :p [19:54:09] Korean [19:54:25] arbcom-ko [19:55:42] Korean Arbcom = arbcom-ko Spanish Arbcom = crc-es-el Other Arbcoms = not in public list ... so consistent [19:56:32] more funny is arbcom-l has so much protection but every other 'sensitive' list like legal, board and so on, don't :) [19:57:49] subscriptions are limited to arbcom members and the bdfl [19:58:18] usual irrational outcomes of en.wiki drama [19:58:18] it anything it should be -en instead of -l [19:58:27] and in the list [19:59:09] cough, -wikipedia-en, cough [19:59:17] Nemo_bis: esp. considering the age of it and that its still done [19:59:31] Did you know that English Wikinews has an arbitration committee? :) [19:59:48] why they all hiding though :) [19:59:58] let's make things easier; new mailman install en-wikipedia@arbcom.wikimedia.org -- solved! [20:00:04] ugh [20:00:04] gwicke cscott arlolra subbu: Respected human, time to deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150826T2000). Please do the needful. [20:00:36] JohnFLewis: wikipedia-en please :) [20:00:46] arbcom@en.wikipedia.org ? [20:00:58] deploying new version of parsoid [20:01:00] ^ and send it into OTRS :P [20:01:05] :D [20:01:14] Why does modules/mediawiki/files/apache/beta/sites/wmflabs.conf contain that block of chapters? [20:01:36] chapters also gotta test somewhere? :p [20:01:49] twentyafterfour: did the train deploy finish? [20:02:01] cscott: yes [20:02:35] mutante, but those wikis don't exist in beta.. [20:02:57] maybe they should [20:03:00] but they exist in apache; RIA [Rest In Apache] [20:03:06] *in our apaches [20:03:07] And *.labs.wikimedia.org? [20:03:29] well, that part sounds like pre-labs labs [20:03:36] indeed [20:03:37] labs labs labs labs [20:03:48] except it's being served by beta labs in non-pre-labs labs :) [20:04:08] at least, according to the beta apaches it is :/ [20:04:55] I don't think the production nameservers would agree. [20:05:59] do those files really have to be all separate? could it be the same templates and only the ServerNames are replaced [20:06:08] (03PS9) 10Thcipriani: Add deploy-service user [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) [20:06:10] (03PS7) 10Thcipriani: Create ssh-agent-proxy internal permissions [puppet] - 10https://gerrit.wikimedia.org/r/233850 [20:06:12] wouldn't that also be better for actually testing changes [20:06:32] http://www.wikimedia.beta.wmflabs.org/ is useful [20:07:50] it sounds like changes in production config will not get to beta this way [20:08:10] http://zero.wikimedia.beta.wmflabs.org/wiki/Special:ZeroPortal - what. [20:08:19] http://zero.wikimedia.beta.wmflabs.org/wiki/Special:ZeroPortal?useformat=mobile - wtf? [20:10:17] (03CR) 10Ori.livneh: [C: 032] Create ssh-agent-proxy internal permissions [puppet] - 10https://gerrit.wikimedia.org/r/233850 (owner: 10Thcipriani) [20:11:35] !log deployed parsoid version 44d657de [20:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:18:25] thcipriani: your change applied correctly on tin (and kudos for that); the keyholder services restarted successfully, and i was able to arm the agent with the mwdeploy key. the agent sock is not allowing me to authenticate to app servers, though. i'm debugging that. [20:18:57] ori: the agent socket? Or the proxy socket? [20:20:34] thcipriani: [20:20:35] [tin:/var/log] $ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@mw1041 [20:20:35] Agent admitted failure to sign using the key. [20:20:35] Permission denied (publickey). [20:21:04] oh! [20:21:05] yup, just tried it, too. [20:21:08] i'm not in the wikidev group [20:21:12] oh, but it doesn't work for you either [20:21:13] hmmm [20:21:24] !log krinkle@tin Synchronized php-1.26wmf20/includes/poolcounter/PoolWorkArticleView.php: (no message) (duration: 00m 03s) [20:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:21:33] is the keyfingerprint right in /etc/keyholder-auth.d ? [20:22:17] Aye, that sync failed I think [20:22:53] sync-common: 100% (ok: 0; fail: 466; left: 0) [20:22:57] Krinkle: yeah, some keyholder futzing going on at the moment [20:23:03] k :) [20:23:16] I'll wait for SWAT [20:23:38] PROBLEM - Keyholder SSH agent on mira is CRITICAL Keyholder is not armed. Run keyholder arm to arm it. [20:24:14] !log armed ssh-agent key on mira [20:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:25:19] thcipriani: dunno, sec [20:25:29] !log Disabling puppet on tin and hacking some debug logging into ssh-agent-proxy [20:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:25:38] RECOVERY - Keyholder SSH agent on mira is OK Keyholder is armed with all configured keys. [20:28:38] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK Less than 1.00% above the threshold [1000000.0] [20:30:37] !log ori@tin Synchronized README: testing ssh-agent-proxy changes (duration: 00m 13s) [20:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:31:34] thcipriani: when i restarted keyholder-proxy so that it runs with the debug logging, it started working. I ran Puppet to reset my local modifications, which caused keyholder-proxy to restart again. It is still working correctly. [20:31:51] I can't explain why it wasn't working earlier. Possibly the services needed to be restarted in a different order. [20:32:13] Krinkle: try your sync again? [20:32:22] oh! I bet what happened is the service started before the keyholder-auth.d files were in place in the puppet run [20:32:33] ori: not now, debugging production RL issue affecting Citoid issue [20:32:40] Krinkle: ok, np [20:32:59] thcipriani: aha! cool. let's add a dependency there so we don't hit this again [20:33:20] okie doke: doing now. Thanks for your help! [20:34:06] thanks for doing this! [20:35:46] sure thing, ssh-agent-proxy is pretty neat. [20:43:03] 6operations, 10Traffic: SSL certificate for policy.wikimedia.org - https://phabricator.wikimedia.org/T110197#1576927 (10RobH) 5stalled>3Open [20:43:58] https://scrutinizer-ci.com/pricing I wonder if the WMF actually uses scrutinizer-ci? Their logo is listed under "Trusted by more than 10,000 projects and companies around the world".. [20:44:48] 6operations, 10Traffic: SSL certificate for policy.wikimedia.org - https://phabricator.wikimedia.org/T110197#1571514 (10RobH) p:5Triage>3High This order has been approved via email discussion between @Slaporte and myself, with a forwarded approval from Geoff. As such, I've unstalled this and set to high p... [20:45:20] SPF|Cloud: CI likely does [20:45:53] Really? [20:46:12] or did possibly. greg-g ^ would be the guy who'd know [20:46:40] SPF|Cloud: some projects like https://github.com/wikimedia/composer-merge-plugin use it [20:48:05] Okay, cool [20:50:44] no, it doesn't [20:52:20] 6operations: migrate policy.wikimedia.org from WMF cluster to Wordpress - https://phabricator.wikimedia.org/T110203#1576986 (10RobH) I've gotten the approvals for the certificate from @Slaporte and my out of band email discussion. Additionally, WordPress support is now contacting me via email for the support of... [20:53:14] WMF CI doesn't [20:54:27] * greg-g just created https://phabricator.wikimedia.org/T110396 [20:54:32] * greg-g is in a meeting [21:04:31] (03PS1) 10Dzahn: mailman: also import held messages and qfiles [puppet] - 10https://gerrit.wikimedia.org/r/234138 (https://phabricator.wikimedia.org/T110131) [21:05:09] (03PS1) 10Rush: icinga: watch for the existence of certain html [puppet] - 10https://gerrit.wikimedia.org/r/234139 [21:05:39] (03CR) 10John F. Lewis: [C: 031] mailman: also import held messages and qfiles [puppet] - 10https://gerrit.wikimedia.org/r/234138 (https://phabricator.wikimedia.org/T110131) (owner: 10Dzahn) [21:06:13] (03CR) 10jenkins-bot: [V: 04-1] icinga: watch for the existence of certain html [puppet] - 10https://gerrit.wikimedia.org/r/234139 (owner: 10Rush) [21:07:28] (03PS2) 10Rush: icinga: watch for the existence of certain html [puppet] - 10https://gerrit.wikimedia.org/r/234139 [21:13:28] PROBLEM - Host mw2140 is DOWN: PING CRITICAL - Packet loss = 100% [21:14:27] (03PS1) 10Rush: phab: use apache::conf for mpm_prefork [puppet] - 10https://gerrit.wikimedia.org/r/234142 [21:14:31] I don't know if it's just me yet, but I see strange JS behavior with Chrome when visiting https://www.wikivoyage.org/ it never really finishes rendering the page or loading images, even dev console is somewhat unresponsive, etc [21:14:43] doesn't affect e.g. en.wikivoyage.org, just the www portal [21:15:11] I can't even kill the tab or chrome itself, other than via cmdline kill -9 [21:15:31] uh not just you, although it's not as bad in ff [21:15:41] Loads for me [21:15:55] it was ok for me in FF, just not chrome [21:16:01] loads quickly for me in chrome [21:16:10] stable channel? [21:16:17] could be cookie-sensitive too for all I know [21:16:52] the css has done funny things with the logo there (bottom right) [21:17:03] yeah stable channel [21:17:04] (03PS2) 10Dzahn: mailman: also import held messages and qfiles [puppet] - 10https://gerrit.wikimedia.org/r/234138 (https://phabricator.wikimedia.org/T110131) [21:17:09] I seem to get some variation of whatever the issue is on all the portals I check, e.g. www.wikibooks.org. They're at least missing some obvious images [21:17:16] yeah, https://www.wikivoyage.org/ finishes loading in Chrome 43 here [21:17:57] I'll debug more and get back heh [21:18:15] (03CR) 10Rush: [C: 032] phab: use apache::conf for mpm_prefork [puppet] - 10https://gerrit.wikimedia.org/r/234142 (owner: 10Rush) [21:21:03] (03CR) 10Rush: "it was simple enough I just did it, hopefully is cool?" [puppet] - 10https://gerrit.wikimedia.org/r/233906 (owner: 10Rush) [21:21:58] (03PS3) 10Rush: icinga: watch for the existence of certain html [puppet] - 10https://gerrit.wikimedia.org/r/234139 [21:23:40] 6operations, 7Database, 7Tracking: Migrate MySQLs to use ROW-based replication (tracking) - https://phabricator.wikimedia.org/T109179#1577146 (10jcrespo) [21:24:04] 6operations: migrate policy.wikimedia.org from WMF cluster to Wordpress - https://phabricator.wikimedia.org/T110203#1577147 (10RobH) Simon with WP support is advising he can do a skype session to personally ID his key. Since I don't really know him outside of the WP ticketing system, I requested info if his gpg... [21:27:17] maybe it's just the OSX beta, I can't reproduce on another machine, but I can reproduce it here in incognito [21:32:39] !log Disabling Puppet on tin again to test an ssh-agent-proxy change [21:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:38:51] (03PS1) 10Andrew Bogott: Comment out nova-network on labnet1001. [puppet] - 10https://gerrit.wikimedia.org/r/234153 [21:38:56] (03PS3) 10Dzahn: mailman: also import held messages and qfiles [puppet] - 10https://gerrit.wikimedia.org/r/234138 (https://phabricator.wikimedia.org/T110131) [21:39:17] (03PS2) 10Andrew Bogott: Comment out nova-network on labnet1001. [puppet] - 10https://gerrit.wikimedia.org/r/234153 [21:40:11] (03CR) 10Jcrespo: [C: 032] depool es1005 in order to clone it to db1011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234112 (owner: 10Jcrespo) [21:40:16] ori: Want me to try another sync? [21:40:22] (I just started one) [21:40:43] !log krinkle@tin Synchronized php-1.26wmf20/includes/poolcounter/PoolWorkArticleView.php: (no message) (duration: 01m 12s) [21:40:45] (03CR) 10Andrew Bogott: [C: 032] Comment out nova-network on labnet1001. [puppet] - 10https://gerrit.wikimedia.org/r/234153 (owner: 10Andrew Bogott) [21:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:41:00] Only mw2140.codfw failed [21:41:14] I was going to depool a database [21:41:36] can I? [21:42:10] (03PS1) 10Ori.livneh: ssh-agent-proxy improvements [puppet] - 10https://gerrit.wikimedia.org/r/234155 [21:42:15] jynus: yes, afaik [21:42:44] thcipriani: ^ some follow-ups to your patch, tested on tin [21:43:04] jynus the remainder of the es servers are ready for install....i have to put in racktables still [21:43:38] cmjohnson1, thank you, I have enough work with the first 4 :-) [21:44:15] "How Wikipedia responds to breaking news". Did they mean "How Wikinews responds to breaking news"? http://blog.wikimedia.org/2015/08/17/wikipedia-breaking-news/ [21:44:16] heh..yeah...just fyi pm'd you the rack locations jic you need them [21:44:34] the term "wikinews" doesn't even appear in an article about wiki and breaking news [21:44:56] * thcipriani looks [21:46:21] (03PS1) 10Gilles: Send image varnish frontend data from logs to statsd [puppet] - 10https://gerrit.wikimedia.org/r/234157 (https://phabricator.wikimedia.org/T105681) [21:46:47] (03PS2) 10Ori.livneh: ssh-agent-proxy improvements [puppet] - 10https://gerrit.wikimedia.org/r/234155 [21:46:50] (03PS2) 10Gilles: Send image varnish frontend data from logs to statsd [puppet] - 10https://gerrit.wikimedia.org/r/234157 (https://phabricator.wikimedia.org/T105681) [21:47:59] (03CR) 10Gilles: "The amount of copypasta is a bit meh, do you think it's worth making a python module to hold all the comment code?" [puppet] - 10https://gerrit.wikimedia.org/r/234157 (https://phabricator.wikimedia.org/T105681) (owner: 10Gilles) [21:48:10] (03CR) 10Gilles: "*common code" [puppet] - 10https://gerrit.wikimedia.org/r/234157 (https://phabricator.wikimedia.org/T105681) (owner: 10Gilles) [21:48:20] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1005 (duration: 01m 12s) [21:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:49:59] i'm done with tin [21:52:58] (03CR) 10Dzahn: "the python script itself looks like it works, tested on neon:" [puppet] - 10https://gerrit.wikimedia.org/r/234139 (owner: 10Rush) [21:55:52] (03Abandoned) 10Andrew Bogott: Move californium/Horizon to openstack Juno [puppet] - 10https://gerrit.wikimedia.org/r/233859 (owner: 10Andrew Bogott) [21:55:56] !log krinkle@tin Synchronized php-1.26wmf19/includes/poolcounter/PoolWorkArticleView.php: (no message) (duration: 01m 12s) [21:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:58:46] Krinkle, ping [21:59:04] jynus: pong [22:00:13] would it be possible to restart deleteEqualMessages on terbium? [22:00:31] jynus: OK [22:00:37] aborted [22:00:51] jynus: OK to restart? [22:01:05] yes, it should get the new config for mw automatically [22:01:19] jynus: Holy wow, it's so fast now [22:01:42] It waits for slave catch up after each action [22:02:14] not sure if serious or sarcastic? [22:02:22] jynus: serious [22:02:32] jynus: It's much much faster now than before the restart [22:02:36] there are actually less servers now [22:02:43] Suggesting you depooled a slave that was lagging? [22:02:59] must be all the CPU cores :) [22:03:01] no, depooled on because I need to clone it [22:03:08] Ok [22:03:13] but it is the worst slave in the world [22:03:18] 4 degraded disks [22:03:21] 1 failed [22:03:25] I see [22:03:33] Good work :) [22:03:34] Thanks [22:03:54] you will see some greater differences in some days! [22:04:36] (03PS4) 10Dzahn: mailman: also import held messages and qfiles [puppet] - 10https://gerrit.wikimedia.org/r/234138 (https://phabricator.wikimedia.org/T110131) [22:04:49] (03CR) 10Dzahn: [C: 032] mailman: also import held messages and qfiles [puppet] - 10https://gerrit.wikimedia.org/r/234138 (https://phabricator.wikimedia.org/T110131) (owner: 10Dzahn) [22:06:22] jynus: What're they to be replaced with? [22:06:30] (also curious what you mean by degraded) [22:07:07] disks are alive, but only because they have a mirror, normally it means that they are continously sending io errors [22:08:15] replace by nothing fancy, but still brand new hardware, I am expecting doubling the throughput [22:09:44] jynus: Ah, you mean one of the two drives part of a RAID situation? [22:10:26] yes, someting like that, the RAID mostly handles what it can on its own [22:11:12] when I say RAID, I actually mean "RAID controller" [22:11:15] RAID 1 I presume? [22:11:28] 10, it is a standard for dbs [22:12:11] a RAID 1 of RAID 0, basically [22:12:17] Hm.. for read/write separation? [22:13:06] it is hard to me to explain without knowing your background :-) [22:13:20] so you have both, more performance but still also mirrors [22:13:32] but apparently all of the mirros died :p [22:13:44] ? [22:14:22] jynus: I built my own PC when I was 14 and played with Windows 3.1. I then departed from computers for a few years doing graphic design education, and then came back as a JavaScript developer slowly expanding my scope to backend dev and devops stuff. [22:14:47] because you said 4 disks are marked as having issues [22:14:50] so you are (or were) mostly a fronted devel [22:14:56] Indeed. [22:15:20] mutante, they are in critical state, but only 1 failed [22:15:42] so availability is not yet compromised [22:15:52] alright [22:15:56] Krinkle: RAID 0 = use two disks and write a few bytes to one, then a few bytes to the other, alternating. Improves read perf because for reads, 1/2 the data will be on each disk and you can read those halves in parallel. Hurts availability because if one of the two disks dies, you're screwed [22:15:58] But I've done my fair share of devops (whatever that means these days) managing domains, VPSes and backend development (e.g. Java, PHP). But little to no networking or hardware knowledge. [22:16:13] but it was the first one I wanted to replace, for sure [22:16:25] Krinkle: RAID 1 = use two disks that mirror each other completely. Small perf gains (disks can race each other, first response wins), availability win because if one dies, you have the other [22:16:39] RAID 10 is RAID 1 on top of RAID 0 (so 4 disks) [22:16:49] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK Less than 1.00% above the threshold [1000000.0] [22:17:17] RoanKattouw: Yeah, https://en.wikipedia.org/wiki/Nested_RAID_levels#RAID_10 gave me as much. I recall there being another one, but I don't find the exact words I'm looking for. [22:17:29] Oh there's the one with 3 disks using parity bits, I forget what that's called [22:17:33] actually 12 disks, but yeah, minimum of 4 disks, with 2 copies [22:17:35] RAID 2 or 4 or 5 or one of those [22:17:59] what are "arbcom clerks" [22:18:05] both 5 and 6 are based on parity [22:18:17] but that amplifies writes [22:18:25] mutante: AIUI they're people who perform checkuser and related things on behalf of arbcom, I think? [22:18:27] not a fan, and my boss either [22:18:34] and/or do some other bureaucracy-like stuff [22:18:50] So RAID 10 gives us faster reads like RAID 0. But I guess you still have to keep them in sync so a write is blocking a read? [22:19:02] RoanKattouw: thanks, i just keep seeing all these mailing lists while working on it [22:19:06] (unless it's a flash drive, but Im not sure that's common enough yet for us?) [22:19:11] well, writes can happen in parallel, at least in theory [22:19:22] not with optical disks though I guess? [22:19:28] so no penalty, but also no gain [22:19:31] Yeah [22:19:46] RoanKattouw: they do the bureaucratic management things, they don't get CheckUser :( [22:19:48] in any case, there is a layer of cache in front of it in most cases [22:20:00] like this one in particular [22:20:15] jynus: you mean the OS file system cache in RAM? [22:20:18] no [22:20:21] (or MySQL memory) [22:20:23] hardware cache [22:20:36] Interesting [22:21:16] in the old one, for example, it has 1/2 GB of RAID cache [22:21:55] 1 GB in the new one [22:22:07] * Krinkle has not heard of RAID cache before [22:22:28] This is for actual files stored on disk? [22:22:39] raid *controller* cache [22:23:06] (03CR) 10Thcipriani: [C: 031] "Fancy pythonin'β€”like it" [puppet] - 10https://gerrit.wikimedia.org/r/234155 (owner: 10Ori.livneh) [22:24:17] it is just in between the os and the disks [22:24:39] when you fsync(), you actually write to the raid controller cache [22:24:57] 6operations, 10MediaWiki-File-management, 6Multimedia, 5MW-1.26-release, and 2 others: Thumbnail render throttling should not result in HTTP 500 - https://phabricator.wikimedia.org/T110109#1577364 (10Tgr) 5Open>3Resolved a:3Tgr [22:25:08] and then you have faith on the battery of the raid [22:25:25] and proper monitoring if the BBU fails :-P [22:27:16] the file system cache, broadly speaking, is something that you want to avoid for mysql, because you want rows and indexes either on the main application memory (which has its own caching policy) or on the RAID controller cache [22:32:46] (03PS1) 10Alex Monk: Remove apache vhost for non-existent beta incubator wiki [puppet] - 10https://gerrit.wikimedia.org/r/234165 [22:33:09] YuviPanda, if I rewrite all of the apache config, can that be put up for puppet swat? :) [22:34:48] Krenair: that can definitely be put up for puppet swat :) [22:34:54] probably won't get merged in puppet swat, tho [22:35:36] (03CR) 10Ori.livneh: [C: 032] ssh-agent-proxy improvements [puppet] - 10https://gerrit.wikimedia.org/r/234155 (owner: 10Ori.livneh) [22:36:39] (03PS1) 10BBlack: Apache: do not allow R=301 rewrites of 503 or 404 errors [puppet] - 10https://gerrit.wikimedia.org/r/234166 (https://phabricator.wikimedia.org/T109226) [22:37:26] mailing lists that get all phabricator notifications for a project and then archive it.. can it scale [22:37:26] (03CR) 10BBlack: "Completely untested, may not work as advertised, needs input from someone who knows this better" [puppet] - 10https://gerrit.wikimedia.org/r/234166 (https://phabricator.wikimedia.org/T109226) (owner: 10BBlack) [22:37:45] (03PS1) 10Alex Monk: Remove apache vhost for *.labs.wikimedia.org from beta [puppet] - 10https://gerrit.wikimedia.org/r/234167 [22:38:01] mutante: only with mongodb [22:38:42] jynus: Ah, I see. That makes a lot of sense. [22:38:45] YuviPanda: :o haha [22:39:10] jynus: Yeah, that's to ensure consistency between drives is maintained. [22:39:28] Which I suppose still qualifies as 'eventually consistent' but hopefully a bit quicker than how databases replicate. [22:40:07] !log Disabled Puppet on mw1017 for 2hrs and applied I059b0c96c9 for testing. [22:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:41:28] PROBLEM - Keyholder SSH agent on tin is CRITICAL Keyholder is not armed. Run keyholder arm to arm it. [22:41:51] !log armed keyholder on tin [22:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:42:09] (03PS2) 10BBlack: Apache: do not allow R=301 rewrites of 503 or 404 errors [puppet] - 10https://gerrit.wikimedia.org/r/234166 (https://phabricator.wikimedia.org/T109226) [22:43:28] RECOVERY - Keyholder SSH agent on tin is OK Keyholder is armed with all configured keys. [22:45:45] (03PS1) 10Alex Monk: Remove other *.wikimedia.org stuff from beta apache config [puppet] - 10https://gerrit.wikimedia.org/r/234169 [22:45:55] bblack: staged PS2 on mw1017 (test.wikipedia.org) if you want to test it [22:45:58] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [22:48:29] (03CR) 10Ori.livneh: "I think a better approach may be to ditch the ErrorDocument directives entirely: http://www.askapache.com/htaccess/crazy-advanced-mod_rewr" [puppet] - 10https://gerrit.wikimedia.org/r/234166 (https://phabricator.wikimedia.org/T109226) (owner: 10BBlack) [22:49:19] Krenair: Interesting. Looks like the www portals don't work but they are attempted [22:49:22] http://www.wikipedia.beta.wmflabs.org/ [22:49:22] http://www.wikimedia.beta.wmflabs.org/ [22:49:54] yeah [22:49:58] unless.. [22:50:00] * Krinkle tries something [22:50:09] I know why as well. [22:50:18] PROBLEM - puppet last run on lvs1004 is CRITICAL Puppet has 1 failures [22:50:20] Do you know how these pages work? [22:50:21] new portals? [22:50:24] I do [22:50:27] not new, no bblack [22:50:50] Hn.. [22:50:53] Hm... http://meta.wikimedia.beta.wmflabs.org/wiki/Www.wikipedia.org_portal [22:50:54] Krinkle, I think http://meta.wikimedia.beta.wmflabs.org/wiki/Www.wikipedia.org_template needs to be created for the wikipedia one [22:50:57] I figured that would work [22:51:00] Oh, template [22:51:02] not portal [22:51:04] what I am doing [22:51:07] :D [22:51:12] Those used to exist [22:51:14] as a template [22:51:18] and then we merged them [22:51:20] into the wrong name [22:51:38] http://www.wikipedia.beta.wmflabs.org/?purged [22:51:39] Wee [22:51:42] Oh this is fun [22:52:43] looks like somebody put a scan of their signature on commons, then uses the file in each of their email footers, which results in us getting complaints from spamcop about commons being spamvertised [22:52:55] ori: is there a better way to simulate 503 on the root than just shutting down hhvm? [22:54:39] PROBLEM - Keyholder SSH agent on mira is CRITICAL Keyholder is not armed. Run keyholder arm to arm it. [22:57:18] mutante: I think we had one of those last week too, but the community deleted it? [22:57:37] oh no, that was different [23:00:04] RoanKattouw ostriches rmoen Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150826T2300). [23:00:04] Krinkle: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:01:39] jdlrobson is also on the list, but didn't use the ircnick template [23:01:46] \o [23:02:49] (03PS1) 10Krinkle: apache: Remove unused 'title' parameters from extract2.php urls [puppet] - 10https://gerrit.wikimedia.org/r/234174 [23:03:23] Krinkle, how'd you figure it it was used in 2005? [23:03:28] bblack: I added test.wikipedia.org/w/status.php?code=503 [23:03:31] Krenair: see -editing [23:03:39] Krenair: https://meta.wikimedia.org/w/index.php?title=Www.wikipedia.org_portal&action=history [23:03:41] ah [23:03:50] 6operations, 6Discovery, 5Incident-20150825-Redis, 3Discovery-Cirrus-Sprint, and 2 others: Update Elasticsearch for missing updates from outage on 20150825 - https://phabricator.wikimedia.org/T110179#1577525 (10ksmith) [23:04:14] (03CR) 10Alex Monk: [C: 031] apache: Remove unused 'title' parameters from extract2.php urls [puppet] - 10https://gerrit.wikimedia.org/r/234174 (owner: 10Krinkle) [23:04:22] bblack: though I think there's some configuration directive which determines whether Apache uses ErrorDocument to clobber error responses generated by the reverse-proxied backend [23:04:47] so killing HHVM might be the only practical way [23:05:00] but since it's testwiki, it's not a big deal to shut down HHVM for a minute [23:05:03] Krinkle, did you already sync your patches? [23:05:37] 6operations, 10ops-codfw: Humidity Alarms - https://phabricator.wikimedia.org/T110421#1577532 (10Dzahn) 3NEW [23:05:37] (03PS2) 10Krinkle: apache: Remove unused 'title' parameters from extract2.php urls [puppet] - 10https://gerrit.wikimedia.org/r/234174 [23:05:55] Krenair: I did not [23:06:08] Krenair: I did [23:06:20] ... [23:06:27] Krenair: I scheduled the wrong patch [23:06:33] 6operations, 10ops-codfw: Humidity Alarms from codfw - https://phabricator.wikimedia.org/T110421#1577553 (10Dzahn) [23:06:35] I meant to add these https://gerrit.wikimedia.org/r/#/c/234037/ and https://gerrit.wikimedia.org/r/#/c/234038/ [23:06:57] :D [23:07:50] 6operations, 10ops-codfw: Humidity Alarms from codfw - https://phabricator.wikimedia.org/T110421#1577532 (10Dzahn) examples: Humidity Alarm: ps1-a4-codfw.mgmt.codfw.wmnet Removable Sensor 2 is over threshold: 76% (> 75%) Humidity Alarm: ps1-a4-codfw.mgmt.codfw.wmnet Removable Sensor 2 is over threshold: 75%... [23:08:03] (03CR) 10Jforrester: [C: 031] "Ha." [puppet] - 10https://gerrit.wikimedia.org/r/234174 (owner: 10Krinkle) [23:08:04] 6operations, 10ops-codfw: Humidity Alarms from codfw - https://phabricator.wikimedia.org/T110421#1577581 (10Dzahn) a:3RobH [23:10:07] PROBLEM - Apache HTTP on mw1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.006 second response time [23:10:21] ori: yeah I donno, both the PS1 and PS2 variants don't seem to work as advertised [23:11:03] the straightforward answer is that basically every single RewriteRule that has an R=301 should have an additional RewriteCond to exclude the static error paths, but I though this method would essentially do similar with less configspam [23:11:23] when I have some time I'll play with it somewhere isolated and map out what the behavior really is [23:12:08] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.508 second response time [23:12:39] 6operations, 10Wikimedia-Mailing-lists: Mailman error on wikimedia-de-by moderator interface - https://phabricator.wikimedia.org/T110427#1577609 (10JohnLewis) 3NEW a:3Dzahn [23:13:51] (03CR) 10BBlack: [C: 04-1] "In any case, neither of the PS1 or PS2 approaches seem to work in practice" [puppet] - 10https://gerrit.wikimedia.org/r/234166 (https://phabricator.wikimedia.org/T109226) (owner: 10BBlack) [23:14:58] Krinkle, okay, now I'm not sure why www.wikipedia.beta.wmflabs.org doesn't work [23:15:02] the page exists [23:15:05] extract2 works [23:15:07] Krenair: caching [23:15:07] bblack: gonna try something on mw1017 if you don't mind [23:15:10] Krenair: add ?1234 [23:15:17] bah [23:15:20] okay [23:15:34] www.wikipedia.beta.wmflabs.org/?_=1 [23:15:40] http://www.wikimedia.beta.wmflabs.org/?_=1 [23:16:08] ori: feel free! [23:16:12] yeah [23:16:18] I wonder if that can be purged somehow [23:16:28] RECOVERY - puppet last run on lvs1004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:17:47] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint, and 2 others: Upgrade production to elasticsearch 1.7.1 - https://phabricator.wikimedia.org/T106165#1577633 (10ksmith) [23:19:58] PROBLEM - Apache HTTP on mw1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [23:20:30] PROBLEM - HHVM rendering on mw1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.017 second response time [23:21:10] !log cloning es1005 into es1011, ETA 9 hours [23:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:59] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.197 second response time [23:23:32] 6operations, 10Wikimedia-Mailing-lists: Mailman error on wikimedia-de-by moderator interface - https://phabricator.wikimedia.org/T110427#1577693 (10Dzahn) ``` 20418 Aug 26 23:14:11 2015 admin(17684): @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @@@@ 20419 admin(17684): [----- Mailman Version: 2.1.13 -----]... [23:24:37] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 66101 bytes in 6.075 second response time [23:27:18] !log deployed kartotherian [23:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:33:10] Krinkle, sorry, got sidetracked with beta deployments breaking :) [23:34:04] jenkins completed [23:35:09] !log krenair@tin Synchronized php-1.26wmf19/maintenance/deleteEqualMessages.php: https://gerrit.wikimedia.org/r/#/c/234037/1 (duration: 01m 12s) [23:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:37:13] !log krenair@tin Synchronized php-1.26wmf20/maintenance/deleteEqualMessages.php: https://gerrit.wikimedia.org/r/#/c/234038/ (duration: 01m 12s) [23:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:37:32] Is ops aware of mw2140 being broken? [23:38:17] Krenair: will you be deploying the config change for wikivoyage? I have a meeting in 20 minutes, so am wondering if I need to delay that to be around for the swat. [23:39:06] I will [23:39:51] jdlrobson, isn't the InitialiseSettings-labs addition redundant? [23:40:37] PROBLEM - HHVM rendering on mw1017 is CRITICAL: Connection refused [23:40:47] Krenair: it's good idea to make betalabs consistent with production [23:40:54] i can separate the two if you prefer [23:40:59] yes, but doesn't that load InitialiseSettings anyway? [23:41:16] are you sure about that? [23:41:37] InitialiseSettings-labs is loaded at the bottom of InitialiseSettings [23:41:58] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:41:58] PROBLEM - Apache HTTP on mw1017 is CRITICAL: Connection refused [23:42:30] krenair@tin:/srv/mediawiki-staging (master)$ tail -n 4 wmf-config/InitialiseSettings.php | head -n 2 [23:42:31] if ( $wmfRealm == 'labs' ) { [23:42:31] require ( "$wmfConfigDir/InitialiseSettings-labs.php" ); [23:42:31] Krenair: i'll separate them [23:42:53] done. if the first patch works on beta labs i'll abandon the second one and will have learnt something today :) [23:42:56] (03PS2) 10Jdlrobson: Enable banners on user pages for projects with banner installed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234072 (https://phabricator.wikimedia.org/T109886) [23:42:58] (03PS1) 10Jdlrobson: Enable banners on beta labs where Pagebanner is installed. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234184 [23:43:45] (03CR) 10Alex Monk: [C: 032] Enable banners on user pages for projects with banner installed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234072 (https://phabricator.wikimedia.org/T109886) (owner: 10Jdlrobson) [23:43:51] (03Merged) 10jenkins-bot: Enable banners on user pages for projects with banner installed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234072 (https://phabricator.wikimedia.org/T109886) (owner: 10Jdlrobson) [23:43:57] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.052 second response time [23:44:28] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 66084 bytes in 0.166 second response time [23:45:20] jdlrobson, done [23:45:48] Krenair: testing thanks! [23:46:16] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/234072/ (duration: 01m 12s) [23:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:47:01] The extra 1m is mw2140 [23:47:28] looking good Krenair [23:47:33] great [23:47:35] almost done testing but i'm not seeing problems so far [23:49:08] (03PS1) 10Alex Monk: Remove mw2140 from mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/234186 [23:49:55] Krenair: al good. thanks a bunch! [23:50:05] jdlrobson, are you Konfused-Kitten in phabricator? [23:51:01] Krenair: i have an account during some testing i setup for that but never used it... not sure how Phabricator links it to patches [23:51:09] your email address [23:51:16] ahhh [23:51:29] yeh i used something like jdlrobson+something@gmail.com [23:51:35] that's annoying [23:52:22] (03PS4) 10Tim Landscheidt: WIP: Add BigBrotherMonitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 [23:52:26] 6operations, 6Services, 3Discovery-Maps-Sprint: Tilerator git deploy has 4/5 issue too - https://phabricator.wikimedia.org/T110434#1577815 (10Yurik) 3NEW [23:52:35] (03CR) 10jenkins-bot: [V: 04-1] WIP: Add BigBrotherMonitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 (owner: 10Tim Landscheidt) [23:53:17] 6operations, 10CirrusSearch, 6Discovery, 10hardware-requests: Request Elasticsearch hardware for secondary CirrusSearch in codfw - https://phabricator.wikimedia.org/T105707#1577827 (10RobH) This order has been submitted, and I'm awaiting shipment updates from the vendor. [23:53:26] 6operations, 10CirrusSearch, 6Discovery, 10hardware-requests: Request Elasticsearch hardware for secondary CirrusSearch in codfw - https://phabricator.wikimedia.org/T105707#1577828 (10RobH) p:5High>3Normal [23:54:18] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1577833 (10RobH) 5Open>3stalled The approved config on https://rt.wikimedia.org/Ticket/Display.html?id=9506 has been submitted to the leasing vendor for order. [23:55:39] 6operations, 10Wikimedia-Mailing-lists: Mailman error on wikimedia-de-by moderator interface - https://phabricator.wikimedia.org/T110427#1577842 (10Dzahn) 16:43 think I found the issue 16:44 delete request.pck please 16:44 c.f. https://answers.launchpad.net/mailman/+quest... [23:57:35] !log git deployed tilerator - had the 4/5 issue - https://phabricator.wikimedia.org/T110434 [23:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:59:00] 6operations, 10ops-codfw, 7network: cr1-eqdfw PEM 0 failure - https://phabricator.wikimedia.org/T110435#1577861 (10faidon) 3NEW [23:59:43] !log mwscript deleteEqualMessages.php --wiki rowiki [23:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master