[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151008T0000). [00:00:07] (03PS1) 10Ori.livneh: Remove session-redis cluster definition; unused [puppet] - 10https://gerrit.wikimedia.org/r/244376 [00:00:13] (03CR) 10jenkins-bot: [V: 04-1] Remove session-redis cluster definition; unused [puppet] - 10https://gerrit.wikimedia.org/r/244376 (owner: 10Ori.livneh) [00:00:17] (03PS2) 10Ori.livneh: Remove session-redis cluster definition; unused [puppet] - 10https://gerrit.wikimedia.org/r/244376 [00:07:32] (03CR) 10Alex Monk: [C: 032] Sharper apple-touch icon for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243685 (https://phabricator.wikimedia.org/T114275) (owner: 10Alex Monk) [00:07:59] (03Merged) 10jenkins-bot: Sharper apple-touch icon for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243685 (https://phabricator.wikimedia.org/T114275) (owner: 10Alex Monk) [00:08:33] !log krenair@tin Synchronized w/static/apple-touch/commons.png: https://gerrit.wikimedia.org/r/#/c/243685/ (duration: 00m 17s) [00:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:16:59] (03CR) 10Alex Monk: [C: 04-1] Rename Azerbaijani Wikisource project and namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242096 (https://phabricator.wikimedia.org/T114002) (owner: 10Siebrand) [00:20:20] (03PS1) 10Alex Monk: Portal namespace for fawikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244378 (https://phabricator.wikimedia.org/T113593) [00:24:01] (03PS2) 10Alex Monk: Portal namespace for fawikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244378 (https://phabricator.wikimedia.org/T113593) [00:27:40] (03PS2) 10Dzahn: Document gerrit's limitations for regexp matching in Phabricator URLs [puppet] - 10https://gerrit.wikimedia.org/r/244370 (owner: 10QChris) [00:28:17] (03CR) 10Dzahn: [C: 032] "just comments - but useful comments indeed" [puppet] - 10https://gerrit.wikimedia.org/r/244370 (owner: 10QChris) [00:30:50] just got a "Service Temporarily Unavailable" for gerrit [00:30:57] me too [00:30:59] yep, 503ing [00:31:23] back now [00:32:10] eh, that was probably my merge above [00:32:22] it just added comments but caused a restart nevertheless ..from puppet [00:32:44] it's still not loading change sets [00:33:46] are you sure '#' for comments is supported? [00:33:50] i don't see it here: https://code.google.com/p/gerrit/source/browse/Documentation/config-gerrit.txt?r=f69aeb124dc3acc64be1e164a606a1df1f6b1426 [00:34:27] PROBLEM - Host mw1160 is DOWN: PING CRITICAL - Packet loss = 100% [00:34:48] ori: ehm.. but where does not load changes? [00:34:55] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [00:35:41] looks at ytterbium [00:36:20] i suggest reverting first [00:36:57] ok, but where do you seen an error? [00:37:14] never mind, changes are loading now [00:37:22] i jumped the gun, sorry [00:37:24] i clicked revert and it made the change for me.. [00:37:25] my mistake [00:37:25] hmm [00:37:34] it does seem kind of slow [00:37:54] yeah but that's not likely to be a consequence of your change -- you just added comments [00:38:02] it'd be one thing if gerrit refused to load because of a syntax error [00:38:08] but i can't imagine comments making things slower [00:38:18] or rather, it's probably a consequence of gerrit having just restarted [00:38:25] agreed, yea [00:38:48] now i just want to see the bot again :) [00:39:29] and btw, that puppet error on neon.. wasnt real [00:39:40] what about mw1160? [00:40:06] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:40:10] * greg-g shouldn't ask questions like that right before he intends on leaving [00:40:39] checks [00:41:05] eh, it's in the middle of booting up [00:41:08] without me doing anything [00:41:23] is... it supposed to be booting up? [00:41:33] hah [00:41:49] it's broken [00:41:55] [5572679.337062] ata1: lost interrupt (Status 0x50) [00:42:20] :( [00:43:46] !pybal depool mw1160 [00:43:48] I wish [00:44:00] * greg-g goes [00:44:04] !log powercycled mw1160 - looks like broken hardware or cable (ata1: lost interrupt) [00:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:45:26] RECOVERY - Host mw1160 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [00:45:51] well, ssh works [00:46:49] the gerrit bot ... [00:47:58] mw1160 appears normal now [00:48:30] eh, here, nevermind: [00:48:38] mw1160 kernel: [5507278.928847] convert invoked oom-killer: [00:49:18] mediawiki job 31497 if that tells us anything [00:49:39] taking phabricator down for maintenance [00:50:44] !log phabricator maintenance/upgrade. Expect 10 minutes downtime [00:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:51:03] !log mw1160 - was oom-killer (convert, mw job 31497) [00:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:51:23] phab down? [00:51:28] OK [00:51:57] err [00:52:08] mw1160 is down? [00:52:13] no [00:52:18] oh, no, my irc client has just stopped scrolling [00:52:20] nvm [00:52:46] mutante: does the gerrit bot need restarting? [00:52:59] yuvipanda: yes [00:53:02] doing [00:53:05] thanks [00:53:58] killed it [00:54:00] should come back [00:54:03] :) [00:54:12] there it is [00:54:22] heh [00:55:03] (03CR) 10Dzahn: "test" [puppet] - 10https://gerrit.wikimedia.org/r/244381 (owner: 10Dzahn) [00:55:12] :) ok [00:55:15] :) [01:00:52] Dear Phabricator, WTF: https://phabricator.wikimedia.org/T113002 [01:00:55] Unhandled Exception ("RuntimeException") [01:00:58] Undefined variable: user [01:01:17] ditto [01:01:42] Request URL: https://phabricator.wikimedia.org/T72595 [01:01:43] Request Method: GET [01:01:43] Status Code: HTTP/1.1 500 Internal Server Error [01:02:15] ah, I see "10 minutes of downtime" above. ok. :) [01:02:52] !log finished phabricator update [01:02:52] I was creating a ticket and it apparently got sent out with no projects [01:02:57] I definitely added projects [01:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:03:07] Krenair: hmm. new bug? [01:03:28] I can't actually see the ticket because it's completely broken [01:03:36] Undefined variable: user [01:04:43] Krenair: just one specific task? [01:04:45] hmm. confirmed [01:04:56] https://phabricator.wikimedia.org/T114971 [01:04:59] https://phabricator.wikimedia.org/T9455#1711374 [01:05:17] example for me https://phabricator.wikimedia.org/T114642 but not all [01:05:30] strange [01:06:06] I made another ticket and the same thing happens [01:06:14] reported in -dev with no projects [01:06:19] try to view and you jut get an error [01:06:32] maybe the bot is trying to look at the projects but just getting the same error [01:06:40] Krenair: that might be because the bot screenscrapes to get projects [01:06:43] at a wild guess I would say it affects tickets without an assignee [01:06:51] Getting Unhandled Exception ("RuntimeException") = unknown user on all pages. But my ticket was submitted nonetheless (though it lost the associated projects?) [01:07:05] what tgr said [01:07:09] uh, wtf [01:07:11] it seems all tickets that are not assigned [01:07:12] look at -dev [01:07:31] but if assigned, then ok [01:07:37] what's epriestley up to? [01:07:39] let me try to "fix" one by taking it [01:07:49] repo import bug? [01:08:01] phab bug [01:08:13] gimme a second I'm going to stop apache so that no more broken tickets get created. [01:08:20] https://phabricator.wikimedia.org/p/epriestley/ [01:08:45] " "Declined" to "Resolved" by committing Unknown Object" [01:08:51] interesting [01:09:23] !log buggy update, stopping apache2 on iridium. [01:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:11:43] icinga has not noticed yet? [01:14:02] (03PS1) 10Ori.livneh: Removed session.php & session-labs.php; no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244382 [01:14:03] Hm. Do we even check for "good" response? [01:14:12] it has noticed [01:14:17] but the bot isnt here [01:14:19] (03CR) 10Ori.livneh: [C: 032] Removed session.php & session-labs.php; no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244382 (owner: 10Ori.livneh) [01:14:24] it got kicked again [01:14:25] (03Merged) 10jenkins-bot: Removed session.php & session-labs.php; no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244382 (owner: 10Ori.livneh) [01:14:27] icinga knows [01:14:55] yes, we check for "good" [01:15:00] string 'Wikimedia and MediaWiki' not found on 'https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/' - 8490 bytes in 0.049 second response time [01:15:10] Ah. Indeed. [01:15:30] 17:40 -!- wikibugs [tools.wiki@wikimedia/bot/pywikibugs] has quit [Excess Flood] [01:15:34] I scheduled maintenance, so icinga won't alert [01:16:04] ok, so 2 reasons:) no bot and scheduled downtime :) [01:17:29] bot issue = T112032 [01:17:33] !log phabricator is now up and running again [01:17:38] error fixed [01:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:18:27] confirmed fixed for my example, thank you [01:19:33] (03CR) 10Alex Monk: [C: 04-1] "remove all of the existing individual entries too?" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243920 (https://phabricator.wikimedia.org/T111335) (owner: 10Glaisher) [01:19:38] 6operations, 10ops-eqiad: cp1059 has network issues - https://phabricator.wikimedia.org/T114870#1711444 (10Dzahn) how about first replacing the cable from server to switch? [01:35:25] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [01:59:21] (03PS2) 10BBlack: X-Client-IP 9/12 - Set X-C + X-C-M [puppet] - 10https://gerrit.wikimedia.org/r/244209 (https://phabricator.wikimedia.org/T89177) [02:02:10] (03CR) 10BBlack: [C: 032 V: 032] X-Client-IP 9/12 - Set X-C + X-C-M [puppet] - 10https://gerrit.wikimedia.org/r/244209 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [02:31:30] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1711519 (10MZMcBride) An RFC meeting about this task has been scheduled for Wednesday, October 14 in #wikimedia-office on freenode. [02:40:35] !log l10nupdate@tin Synchronized php-1.27.0-wmf.1/cache/l10n: l10nupdate for 1.27.0-wmf.1 (duration: 10m 30s) [02:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:46:57] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.1) at 2015-10-08 02:46:57+00:00 [02:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:52:54] (03PS2) 10BBlack: X-Client-IP 10/12 - switch zero.inc to using XC + XCM [puppet] - 10https://gerrit.wikimedia.org/r/244210 (https://phabricator.wikimedia.org/T89177) [02:53:04] (03CR) 10BBlack: [C: 032 V: 032] X-Client-IP 10/12 - switch zero.inc to using XC + XCM [puppet] - 10https://gerrit.wikimedia.org/r/244210 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [02:53:32] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1711531 (10Selsharbaty-WMF) Hi @RobH. I'm trying to get a solution from your previous comments, but this part is not very clear to me. Would you clarify it for me, please?... [02:59:15] 6operations, 6Phabricator, 7audits-data-retention: Enable mod_remoteip and ensure logs follow retention guidelines - https://phabricator.wikimedia.org/T114014#1711533 (10faidon) I've already implemented remoteip for OTRS, it's already in the puppet tree and is smart about doing a Hiera lookup to limit the se... [02:59:47] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: puppet fail [03:01:36] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [03:02:01] (03PS2) 10BBlack: X-Client-IP 11/12 - remove outdated 404-01b zero case [puppet] - 10https://gerrit.wikimedia.org/r/244211 (https://phabricator.wikimedia.org/T89177) [03:02:08] !log l10nupdate@tin Synchronized php-1.27.0-wmf.2/cache/l10n: l10nupdate for 1.27.0-wmf.2 (duration: 05m 33s) [03:02:09] (03CR) 10BBlack: [C: 032 V: 032] X-Client-IP 11/12 - remove outdated 404-01b zero case [puppet] - 10https://gerrit.wikimedia.org/r/244211 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [03:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:04:47] 7Blocked-on-Operations, 6operations, 6Phabricator, 6Release-Engineering-Team, and 2 others: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1711537 (10faidon) I notice the lack of IPv6 everywhere on this effort (no service IP assigned, no ACLs for it etc.). We've been building e... [03:04:52] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.2) at 2015-10-08 03:04:52+00:00 [03:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:09:47] (03PS2) 10BBlack: X-Client-IP 12/12 - switch zero analytics to use XC/XCM [puppet] - 10https://gerrit.wikimedia.org/r/244212 (https://phabricator.wikimedia.org/T89177) [03:13:27] (03CR) 10BBlack: [C: 032] X-Client-IP 12/12 - switch zero analytics to use XC/XCM [puppet] - 10https://gerrit.wikimedia.org/r/244212 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [03:57:06] 6operations, 10Traffic, 7Pybal: Run IPVS in a separate network namespace - https://phabricator.wikimedia.org/T114979#1711544 (10faidon) 3NEW [03:57:33] 6operations, 10Traffic, 7Pybal: Run IPVS in a separate network namespace - https://phabricator.wikimedia.org/T114979#1711554 (10faidon) [03:58:47] (03CR) 10EBernhardson: "https://gerrit.wikimedia.org/r/#/c/244392/ implements per-cluster timeouts so we can drop failures to labsearch after 10 minutes, but hold" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson) [04:32:58] 6operations, 10Parsoid: Investigate Oct 3 outage of the Parsoid cluster due to high cpu usage + high memory usage (sharp spike in both) around 08:35 UTC - https://phabricator.wikimedia.org/T114558#1711589 (10ssastry) See T75412#1711585 which I think explains the Oct 3 outage. [04:45:25] (03PS9) 10EBernhardson: Refactor monolog handling for kafka logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) [04:54:55] (03CR) 10EBernhardson: Refactor monolog handling for kafka logs (036 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) (owner: 10EBernhardson) [05:17:04] 6operations, 10Traffic, 7Pybal: Run IPVS in a separate network namespace - https://phabricator.wikimedia.org/T114979#1711605 (10faidon) [05:40:45] (03CR) 10Dzahn: "thanks Andre, so nowadays the question is like "is it worth using donor money to have this domain and an SSL cert for it just for the redi" [dns] - 10https://gerrit.wikimedia.org/r/244104 (owner: 10Dzahn) [05:43:56] (03PS2) 10Dzahn: deactivate wikimania.asia [dns] - 10https://gerrit.wikimedia.org/r/244103 [05:45:22] (03PS2) 10Dzahn: deactivate wikimemory.org [dns] - 10https://gerrit.wikimedia.org/r/244101 [05:46:28] (03PS2) 10Dzahn: deactivate wikimediacommons.[co.uk|eu|info|jp.net|mobi|net|org] [dns] - 10https://gerrit.wikimedia.org/r/244092 [05:47:08] (03CR) 10Dzahn: "please tell Victor" [dns] - 10https://gerrit.wikimedia.org/r/244086 (owner: 10Dzahn) [05:48:22] (03CR) 10Dzahn: "add yuvi" [dns] - 10https://gerrit.wikimedia.org/r/244081 (owner: 10Dzahn) [05:52:48] (03PS2) 10Dzahn: deactivate wikipaedia.net [dns] - 10https://gerrit.wikimedia.org/r/244090 [05:53:10] (03PS3) 10Dzahn: deactivate wikipaedia.net [dns] - 10https://gerrit.wikimedia.org/r/244090 [05:53:58] (03CR) 10KartikMistry: [C: 031] "I don't know indiawikipedia.com exists :)" [dns] - 10https://gerrit.wikimedia.org/r/244081 (owner: 10Dzahn) [05:54:07] (03CR) 10Dzahn: "the prototype of junk domain" [dns] - 10https://gerrit.wikimedia.org/r/243970 (owner: 10Dzahn) [05:55:15] (03PS1) 10Ori.livneh: Update GettingStarted config for redis changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244399 [05:55:30] (03CR) 10Ori.livneh: [C: 032] Update GettingStarted config for redis changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244399 (owner: 10Ori.livneh) [05:55:36] (03Merged) 10jenkins-bot: Update GettingStarted config for redis changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244399 (owner: 10Ori.livneh) [05:56:17] (03CR) 10Yuvipanda: [C: 031] "What is 'India'?" [dns] - 10https://gerrit.wikimedia.org/r/244081 (owner: 10Dzahn) [05:56:27] !log ori@tin Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 18s) [05:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:03:20] (03PS2) 10Muehlenhoff: Don't open up the Kafka JMX port for debugging [puppet] - 10https://gerrit.wikimedia.org/r/244187 [06:04:43] (03CR) 10Muehlenhoff: [C: 032 V: 032] Don't open up the Kafka JMX port for debugging [puppet] - 10https://gerrit.wikimedia.org/r/244187 (owner: 10Muehlenhoff) [06:04:56] (03CR) 10Dzahn: "@Yuvipanda http://wikimedia.in/wikipedia.html" [dns] - 10https://gerrit.wikimedia.org/r/244081 (owner: 10Dzahn) [06:06:10] (03PS1) 10Ori.livneh: Amend ferm rules for app servers to drop exemption for port 6380 [puppet] - 10https://gerrit.wikimedia.org/r/244402 [06:06:47] (03PS2) 10Ori.livneh: Amend ferm rules for app servers to drop exemption for port 6380 [puppet] - 10https://gerrit.wikimedia.org/r/244402 [06:06:51] moritzm: ^ [06:09:15] ori: having a look [06:10:14] thanks [06:15:02] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me (also double-checked with salt that no mw* server still uses it)." [puppet] - 10https://gerrit.wikimedia.org/r/244402 (owner: 10Ori.livneh) [06:15:16] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Oct 8 06:15:16 UTC 2015 (duration 15m 15s) [06:15:23] (03CR) 10Ori.livneh: [C: 032] "Thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/244402 (owner: 10Ori.livneh) [06:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:29:25] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: puppet fail [06:29:57] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:17] PROBLEM - puppet last run on mc1017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:58] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:05] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:25] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:26] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:45] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:45] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:55] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:06] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:21] (03CR) 10Muehlenhoff: "I dont't see such disagreement. Enabling it via site.pp was useful for a transition period to migrate systems over piece by piece, but wit" [puppet] - 10https://gerrit.wikimedia.org/r/242180 (owner: 10Muehlenhoff) [06:49:44] (03PS1) 10KartikMistry: Add initial Debian package for apertium-is-sv [debs/contenttranslation/apertium-is-sv] - 10https://gerrit.wikimedia.org/r/244405 (https://phabricator.wikimedia.org/T111902) [06:55:15] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:55:36] RECOVERY - puppet last run on mc1017 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:56:26] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:56:45] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:57:59] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:13] (03PS1) 10Jcrespo: Pooling db1051 and db1055 with the same weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244406 [06:59:24] man, it felt like just a short time ago I saw yesterday's daily puppet failure marking _joe_'s arrival [06:59:26] and it's time again [06:59:28] * yuvipanda should go to sleep [07:00:40] <_joe_> /kick yuvipanda [07:00:53] I'm writing a proposal for kubecon [07:00:55] https://etherpad.wikimedia.org/p/kubecon [07:01:10] <_joe_> when is kubecon? [07:01:14] <_joe_> I should come too in fact [07:01:30] _joe_: wooo [07:01:32] that'll be great [07:01:37] _joe_: kubecon.io [07:01:37] <_joe_> I said I should, not that I would [07:01:42] nonovember [07:01:44] november [07:01:50] <_joe_> in SF? [07:01:52] <_joe_> meh [07:02:07] <_joe_> I'd like to stay home for some time and do some actual work [07:02:52] hehe [07:02:52] <_joe_> yeah, I think I'll pass [07:03:22] <_joe_> (also, taking me to SF is quite expensive for the foundation, and it would be for a 3 day conference...) [07:04:06] yeah that's why you should do what I do, and then hop around for a month in various places! [07:04:13] (j/k, don't do that) [07:04:33] (03CR) 10Jcrespo: [C: 032] Pooling db1051 and db1055 with the same weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244406 (owner: 10Jcrespo) [07:06:27] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Matching db1051 and db1055 weight for load balancing (duration: 00m 16s) [07:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:15:05] 6operations, 10MediaWiki-extensions-BounceHandler: BounceHandler still HTTP posting to test2.wikipeida.org API in production - https://phabricator.wikimedia.org/T114984#1711657 (1001tonythomas) 3NEW [07:27:06] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:06] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [07:27:26] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:35] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [07:27:42] * hoo looks for an op [07:28:27] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:45] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:55] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [07:29:59] https://gerrit.wikimedia.org/r/238396 Brandon is ok with it and the fixes for all blockers have been deployed [07:30:41] would prefer to deploy it early so that we still can act on trouble [07:32:33] I'm here for only 10min more and so probably shouldn't merge it [07:32:35] sorry hoo [07:34:26] That's ok... we have people in Europe after all [07:35:07] (03PS1) 10Jcrespo: Increasing regular load of db1051 and db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244409 [07:36:20] 6operations, 10MediaWiki-extensions-BounceHandler: BounceHandler still HTTP posting to test2.wikipedia.org API in production - https://phabricator.wikimedia.org/T114984#1711701 (10faidon) [07:38:02] (03PS1) 10KartikMistry: Added Debian package for apertium-mlt-ara [debs/contenttranslation/apertium-mlt-ara] - 10https://gerrit.wikimedia.org/r/244410 (https://phabricator.wikimedia.org/T111902) [07:40:20] <_joe_> hoo: lemme take a look [07:41:02] thanks :) [07:41:44] (03CR) 10Jcrespo: [C: 032] Increasing regular load of db1051 and db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244409 (owner: 10Jcrespo) [07:41:52] How on earth did I come up with the idea that Wednesday is October 5? :D [07:42:03] <_joe_> I was about to ask [07:42:09] timezones [07:42:12] <_joe_> which year are you going to deploy this [07:42:12] always blame timezones [07:42:28] <_joe_> because oct 5 is not a wednesday this year :P [07:43:36] <_joe_> hoo: sorry but I literally never took a look at that code before [07:43:53] <_joe_> seems straightforward as a change, but you never know with varnish [07:44:36] Is there anyone up who I could ask? [07:44:39] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Increase load of db1051 and db1055 also for regular traffic (duration: 00m 17s) [07:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:45:05] <_joe_> hoo: you can just wait 10 minutes until I got a good grasp on mobile redirection ;) [07:45:23] Ouh :) [07:46:01] _joe_: I made hoo get bblack's +1 earlier for this reason :) [07:47:20] <_joe_> yuvipanda: +1 means "looks good in general" [07:47:32] (03PS5) 10Giuseppe Lavagetto: Enable automatic redirect to mobile Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/238396 (https://phabricator.wikimedia.org/T111015) (owner: 10Bene) [07:47:46] <_joe_> but yeah the code is actually pretty straightforward there [07:49:25] (03CR) 10Giuseppe Lavagetto: [C: 032] Enable automatic redirect to mobile Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/238396 (https://phabricator.wikimedia.org/T111015) (owner: 10Bene) [07:54:05] <_joe_> hoo: it works on a test host [07:54:30] Nice [07:54:48] <_joe_> curl -H 'Host: wikidata.org' -H 'User-Agent: docomo' -H 'X-Forwarded-Proto: https' localhost/wiki/Main_Page -v redirects to mobile [07:55:09] <_joe_> (I know the url is bogus, but it doesn't matter ) [08:00:52] I can reproduce in various browsers :) [08:01:29] It doesn't redirect my desktop firefox with changed user agent, but whatever [08:04:54] <_joe_> hoo: wait another 15 minutes [08:05:56] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 100 [08:19:35] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 100 [08:19:37] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1711771 (10atgo) Hey guys - what's the next steps for getting the tighter sampling into pgheres? Inquiring minds... [08:28:20] (03PS1) 10Muehlenhoff: Move the ferm rules for elasticsearch internode traffic into role::logstash::elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/244412 (https://phabricator.wikimedia.org/T104964) [08:38:48] (03PS1) 10Jcrespo: Depool db1055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244414 (https://phabricator.wikimedia.org/T112478) [08:43:10] (03CR) 10Jcrespo: [C: 032] Depool db1055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244414 (https://phabricator.wikimedia.org/T112478) (owner: 10Jcrespo) [08:45:13] (03CR) 10Hoo man: "@Jan: There's no need to wait, having settings that aren't yet used wont break anything. This can be deployed independently from the featu" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244165 (https://phabricator.wikimedia.org/T112865) (owner: 10Thiemo Mättig (WMDE)) [08:45:49] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1055 (duration: 00m 17s) [08:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:49:44] (03PS1) 10KartikMistry: Add Debian package for apertium-isl [debs/contenttranslation/apertium-isl] - 10https://gerrit.wikimedia.org/r/244415 (https://phabricator.wikimedia.org/T114988) [08:51:51] (03CR) 10Aklapper: "@Dzahn: CZ chapter? http://www.wikimedia.cz/web/Kontakt lists a general contact address and http://www.wikimedia.cz/web/Lid%C3%A9 lists ad" [dns] - 10https://gerrit.wikimedia.org/r/244104 (owner: 10Dzahn) [08:54:09] (03PS1) 10KartikMistry: Add Debian package of apertium-isl-eng [debs/contenttranslation/apertium-isl-eng] - 10https://gerrit.wikimedia.org/r/244416 (https://phabricator.wikimedia.org/T114988) [08:55:25] 6operations: "Opsonly" bastion? - https://phabricator.wikimedia.org/T114992#1711827 (10faidon) 3NEW [08:55:34] moritzm: ^ btw [08:57:41] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1711834 (10fgiunchedi) >>! In T114711#1705795, @Cmjohnson wrote: > @fgiunchedi Seems reasonable to me. Space in row A may be tight but will find a place. Do we want to consider this the official p... [09:05:47] 6operations: "Opsonly" bastion? - https://phabricator.wikimedia.org/T114992#1711846 (10MoritzMuehlenhoff) I was wondering the same, and I think we can let it go. It would however be useful to have it around for a while as the initial test system for 2FA. Once that is established on all bastions I think we can l... [09:12:38] (03PS1) 10Alexandros Kosiaris: bacula: Remove leading whitespace in ERB output [puppet] - 10https://gerrit.wikimedia.org/r/244417 [09:12:46] (03PS1) 10Filippo Giunchedi: cassandra: metrics blacklist for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/244418 [09:16:49] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1711861 (10faidon) Having one zone per row sounds fine to me, as is the table with the final allocation of zones. I'm worrying a bit about two things: - Whether those 4TB disks will actually be used... [09:21:27] !log downtime for db1055 for maintenance (kernel update, mysql update, config update) [09:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:22:30] (03PS10) 10Hoo man: Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) [09:23:06] gnah, I thought tin had base::firewall by now [09:23:07] nvm [09:24:42] (03PS2) 10Filippo Giunchedi: cassandra: metrics blacklist for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/244418 [09:29:53] hoo: there was more work needed on the rules, so it was postponed: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=186649&oldid=186417 (not sure why the bot still announced it yesterday) [09:30:29] (03PS3) 10Filippo Giunchedi: cassandra: metrics blacklist for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/244418 [09:30:32] :( [09:30:34] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: metrics blacklist for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/244418 (owner: 10Filippo Giunchedi) [09:31:17] moritzm, "ferm will be enabled later on mira"? [09:31:20] I thought it already was? [09:32:06] PROBLEM - DPKG on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:15] PROBLEM - SSH on mw1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:32:16] 6operations, 7Mail: Protect incoming emails with SMTP STARTLS - https://phabricator.wikimedia.org/T101452#1711875 (10faidon) [09:32:18] 6operations, 7Mail: Replace primary mail relays (polonium/lead) - https://phabricator.wikimedia.org/T113211#1711871 (10faidon) 5Open>3Resolved This is essentially done for a few days now. See T113962 for the decom task. [09:32:26] PROBLEM - nutcracker process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:57] PROBLEM - configured eth on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:05] PROBLEM - salt-minion processes on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:16] PROBLEM - puppet last run on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:26] PROBLEM - dhclient process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:27] PROBLEM - RAID on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:27] PROBLEM - nutcracker port on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:35] PROBLEM - Disk space on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:39:18] Krenair: sorry for the confusion, that changelog line should have said "tin". mutante enabled ferm on mira for testing some time ago, but it's broken there (since the incomplete rules also affect mira) [09:48:12] (03PS1) 10Faidon Liambotis: exim: remove dead code from OTRS/Phabricator/RT [puppet] - 10https://gerrit.wikimedia.org/r/244419 [09:48:29] (03PS1) 10Filippo Giunchedi: codfw: add test cassandra instances [dns] - 10https://gerrit.wikimedia.org/r/244420 (https://phabricator.wikimedia.org/T95253) [09:48:47] (03CR) 10Faidon Liambotis: [C: 032] exim: remove dead code from OTRS/Phabricator/RT [puppet] - 10https://gerrit.wikimedia.org/r/244419 (owner: 10Faidon Liambotis) [09:50:50] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] codfw: add test cassandra instances [dns] - 10https://gerrit.wikimedia.org/r/244420 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [09:54:48] what happened to mw1008? [10:02:22] _joe_ was troubleshooting it [10:03:35] PROBLEM - NTP on mw1008 is CRITICAL: NTP CRITICAL: No response from NTP server [10:06:05] <_joe_> yeah it has had a surge in memory usage, it's oom'ing [10:06:27] <_joe_> I'm waiting to see if I can find a way to get into the machine and understand what happened [10:06:43] (03PS6) 10Ori.livneh: Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 [10:06:49] _joe_: works ^ [10:06:56] <_joe_> ori: \o/ [10:07:02] (03PS2) 10Thiemo Mättig (WMDE): Add pageImagesPropertyIds configuration for Wikibase servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244165 (https://phabricator.wikimedia.org/T112865) [10:07:09] <_joe_> I'll look at it today [10:07:11] (03CR) 10jenkins-bot: [V: 04-1] Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh) [10:07:19] _joe_: the commit message is shitty and it needs some better comments [10:07:22] but it's usable [10:07:56] <_joe_> ori: this is awesome, and you should sleep :) [10:08:59] thanks :) good night [10:11:05] btw, I've said so way in the past but the timing now is way better [10:11:42] we should write a better README.md for pybal, plus some better docs for it [10:11:54] and after that, we can upload it to Debian [10:12:07] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000000.0] [10:12:11] that will get it some exposure [10:15:12] !log salt rm /home/*/.ssh/authorized_keys [10:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:15:20] <_joe_> paravoid: duly noted :) [10:17:55] paravoid: http://pyb.al [10:18:10] haha this actually works [10:18:10] wtf [10:18:24] you're insane [10:18:32] i thought it'd be funny :P [10:18:50] it is! [10:19:03] does mark know this? [10:19:16] don't think so [10:20:55] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [10:25:56] 6operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 6Services, 5ContentTranslation-Release7: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#1712040 (10Arrbee) [10:26:03] 6operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 6Services, 5ContentTranslation-Release7: Test CXServer in Jessie - https://phabricator.wikimedia.org/T107307#1712044 (10Arrbee) [10:26:12] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 6 others: Standardise CXServer deployment - https://phabricator.wikimedia.org/T101272#1712045 (10Arrbee) [10:27:59] 6operations, 10Analytics, 10Deployment-Systems, 6Services, 3Scap3: Use Scap3 for deploying AQS - https://phabricator.wikimedia.org/T114999#1712059 (10mobrovac) 3NEW [10:31:11] (03PS1) 10Jcrespo: Repool db1055 at 10% weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244425 (https://phabricator.wikimedia.org/T112478) [10:32:36] RECOVERY - salt-minion processes on mw1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:32:58] (03CR) 10Jcrespo: [C: 032] Repool db1055 at 10% weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244425 (https://phabricator.wikimedia.org/T112478) (owner: 10Jcrespo) [10:33:04] (03Merged) 10jenkins-bot: Repool db1055 at 10% weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244425 (https://phabricator.wikimedia.org/T112478) (owner: 10Jcrespo) [10:34:58] 1 will fail, I suppose [10:35:05] PROBLEM - NTP on mw1008 is CRITICAL: NTP CRITICAL: No response from NTP server [10:35:33] <_joe_> jynus: yes, lemem bring that machine back from the dead [10:35:46] no worries [10:35:46] PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL: CRITICAL: 19.23% of data above the critical threshold [100000000.0] [10:37:01] the timeout is really long... [10:37:27] <_joe_> !log rebooting mw1008, in oom spiral [10:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:37:46] PROBLEM - salt-minion processes on mw1008 is CRITICAL: Timeout while attempting connection [10:39:35] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1055 (duration: 05m 10s) [10:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:39:46] RECOVERY - nutcracker port on mw1008 is OK: TCP OK - 0.000 second response time on port 11212 [10:39:46] RECOVERY - dhclient process on mw1008 is OK: PROCS OK: 0 processes with command name dhclient [10:39:46] RECOVERY - RAID on mw1008 is OK: OK: no RAID installed [10:39:55] RECOVERY - Disk space on mw1008 is OK: DISK OK [10:40:17] RECOVERY - SSH on mw1008 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [10:40:17] RECOVERY - DPKG on mw1008 is OK: All packages OK [10:40:37] RECOVERY - nutcracker process on mw1008 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [10:41:06] RECOVERY - configured eth on mw1008 is OK: OK - interfaces up [10:41:06] RECOVERY - salt-minion processes on mw1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:41:15] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [10:42:44] (03PS1) 10Filippo Giunchedi: cassandra: exclude all seeds on the same host [puppet] - 10https://gerrit.wikimedia.org/r/244426 [10:43:36] RECOVERY - NTP on mw1008 is OK: NTP OK: Offset -0.0004200935364 secs [10:57:25] (03PS1) 10Alexandros Kosiaris: maps: Deduplicate the roles [puppet] - 10https://gerrit.wikimedia.org/r/244427 [11:01:47] RECOVERY - Outgoing network saturation on labstore1002 is OK: OK: Less than 10.00% above the threshold [75000000.0] [11:12:27] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: puppet fail [11:13:44] (03CR) 10Alexandros Kosiaris: [C: 031] cassandra: exclude all seeds on the same host [puppet] - 10https://gerrit.wikimedia.org/r/244426 (owner: 10Filippo Giunchedi) [11:15:54] 6operations, 10Parsoid: Investigate Oct 3 outage of the Parsoid cluster due to high cpu usage + high memory usage (sharp spike in both) around 08:35 UTC - https://phabricator.wikimedia.org/T114558#1712132 (10Arlolra) > See T75412#1711585 which I think explains the Oct 3 outage. Something I noticed on `enwiki/... [11:16:19] (03PS2) 10Alexandros Kosiaris: maps: Deduplicate the roles [puppet] - 10https://gerrit.wikimedia.org/r/244427 [11:25:27] (03CR) 10Daniel Kinzler: "I didn't see any attempt using non-capturing group, only the claim that they don't work. I would very much like to understand why and how " [puppet] - 10https://gerrit.wikimedia.org/r/242237 (owner: 10Daniel Kinzler) [11:31:35] (03PS1) 10Ori.livneh: Add TcpConnStates Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/244432 [11:32:29] paravoid: got time for a CR? ^ [11:33:16] PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL: CRITICAL: 12.00% of data above the critical threshold [100000000.0] [11:35:54] 6operations, 10Parsoid: Investigate Oct 3 outage of the Parsoid cluster due to high cpu usage + high memory usage (sharp spike in both) around 08:35 UTC - https://phabricator.wikimedia.org/T114558#1712165 (10Arlolra) Was any maintenance happening on the servers around that time? [11:38:02] <_joe_> !log rebooting mc2002 [11:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:38:45] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [11:43:26] (03PS2) 10Glaisher: Set $wgUploadNavigationUrl to use uselang=$lang for commonsuploads wikis by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243920 (https://phabricator.wikimedia.org/T111335) [11:44:45] (03CR) 10Glaisher: "Also removed those entries as well." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243920 (https://phabricator.wikimedia.org/T111335) (owner: 10Glaisher) [11:46:40] (03PS1) 10Glaisher: Remove duplicate entry 'mswiktionary' from commonsuploads.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244435 [11:48:55] (03PS1) 10Alexandros Kosiaris: maps: Add mapsadminui service [puppet] - 10https://gerrit.wikimedia.org/r/244436 (https://phabricator.wikimedia.org/T112914) [11:55:31] (03CR) 10Ori.livneh: [C: 032] Add TcpConnStates Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/244432 (owner: 10Ori.livneh) [11:58:14] (03PS1) 10Alexandros Kosiaris: tilerator: omit the port argument [puppet] - 10https://gerrit.wikimedia.org/r/244437 (https://phabricator.wikimedia.org/T112914) [11:59:16] RECOVERY - Outgoing network saturation on labstore1002 is OK: OK: Less than 10.00% above the threshold [75000000.0] [12:02:20] (03CR) 10Yurik: [C: 04-1] "I think we should leave it in, because Tilerator is based on the service template, which always listens to a port, even if only for the se" [puppet] - 10https://gerrit.wikimedia.org/r/244437 (https://phabricator.wikimedia.org/T112914) (owner: 10Alexandros Kosiaris) [12:02:45] (03CR) 10Alexandros Kosiaris: "@yurik, this exposes an assumption we now have. That our monitoring for nodejs happens via HTTP checks. Any idea on how to monitor tilerat" [puppet] - 10https://gerrit.wikimedia.org/r/244437 (https://phabricator.wikimedia.org/T112914) (owner: 10Alexandros Kosiaris) [12:04:35] (03CR) 10Alexandros Kosiaris: "hehe, answered my question right before I asked it. If I understand correctly you suggest we stay with the status quo of monitoring via HT" [puppet] - 10https://gerrit.wikimedia.org/r/244437 (https://phabricator.wikimedia.org/T112914) (owner: 10Alexandros Kosiaris) [12:09:06] !log update facts on puppet compiler from palladium [12:09:11] 6operations: create a gerrit repo for the Cards extension - https://phabricator.wikimedia.org/T115003#1712210 (10bmansurov) 3NEW [12:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:09:46] 6operations: create a gerrit repo for the Cards extension - https://phabricator.wikimedia.org/T115003#1712218 (10bmansurov) [12:10:51] <_joe_> godog: thanks! [12:10:56] (03PS2) 10Yurik: maps: Add mapsadminui service [puppet] - 10https://gerrit.wikimedia.org/r/244436 (https://phabricator.wikimedia.org/T112914) (owner: 10Alexandros Kosiaris) [12:11:04] <_joe_> godog: did I leave proper instructions somewhere, right? [12:11:35] yeah I got it from https://phabricator.wikimedia.org/T110546#1584132 [12:12:07] (03CR) 10Yurik: "I added a missing config value." [puppet] - 10https://gerrit.wikimedia.org/r/244436 (https://phabricator.wikimedia.org/T112914) (owner: 10Alexandros Kosiaris) [12:13:04] even though it feels like the import/export script should live in the compiler repo itself since it is of general usage [12:13:26] (03PS2) 10Filippo Giunchedi: cassandra: exclude all seeds on the same host [puppet] - 10https://gerrit.wikimedia.org/r/244426 [12:13:33] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: exclude all seeds on the same host [puppet] - 10https://gerrit.wikimedia.org/r/244426 (owner: 10Filippo Giunchedi) [12:14:46] PROBLEM - puppet last run on restbase-test2001 is CRITICAL: CRITICAL: puppet fail [12:16:18] akosiaris, hi, i added a missing config value to your patch. But I think we should rename it to "tileratoradmin" or "tileratorui" - indicating that its specifically a tilerator sub-service [12:16:27] RECOVERY - puppet last run on restbase-test2001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [12:17:16] yurik: ok, I can live with that [12:20:16] PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL: CRITICAL: 21.74% of data above the critical threshold [100000000.0] [12:20:27] PROBLEM - Restbase root url on restbase-test2001 is CRITICAL: Connection refused [12:20:37] PROBLEM - Restbase endpoints health on restbase-test2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [12:22:20] that's me ^ [12:24:57] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures [12:28:32] akosiaris, will you rename it or should i? also, the patch that disables a port - we don't need it - all services based on the template need to listen to a port - that's how we can later monitor its health [12:44:46] yurik: yeah, I 've left a comment for the health thing on tilerator... I suppose that's fine [12:45:11] yurik: wanna rename it ? I promise not to bikeshed too much [12:52:50] 6operations, 5wikis-in-codfw: Document what is left for having a full cluster installation in codfw - https://phabricator.wikimedia.org/T97322#1712284 (10mark) [12:53:15] RECOVERY - Outgoing network saturation on labstore1002 is OK: OK: Less than 10.00% above the threshold [75000000.0] [12:59:00] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Jenkins: Re-enable lint checks for Apache config in operations-puppet - https://phabricator.wikimedia.org/T72068#1712296 (10hashar) 5Resolved>3Open We no have the redirects expansion being done. Still need to run the Apache2 configt... [13:09:34] (03PS1) 10Jcrespo: Increasing weight of db1044 and db1015 due to load issues on db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244441 [13:10:11] (03CR) 10Jcrespo: [C: 032] Increasing weight of db1044 and db1015 due to load issues on db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244441 (owner: 10Jcrespo) [13:11:05] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Reduce db1035 weight (duration: 00m 16s) [13:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:20:37] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [13:23:52] (03CR) 10coren: [C: 032] webservicemonitor: some improvements [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/239377 (https://phabricator.wikimedia.org/T109362) (owner: 10coren) [13:26:45] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [13:26:46] (03PS1) 10BBlack: X-Client-IP: get rid of temp var, update commentary [puppet] - 10https://gerrit.wikimedia.org/r/244442 (https://phabricator.wikimedia.org/T89177) [13:29:05] (03CR) 10BBlack: [C: 032] X-Client-IP: get rid of temp var, update commentary [puppet] - 10https://gerrit.wikimedia.org/r/244442 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [13:29:36] 6operations, 10MediaWiki-extensions-BounceHandler: BounceHandler still HTTP posting to test2.wikipedia.org API in production - https://phabricator.wikimedia.org/T114984#1712361 (10Jgreen) > but the bounce emails gets POSTed back to the API of test2.wikipedia.org from mx1001/mx2001 Here's the exim config line... [13:30:44] (03PS1) 10coren: Add 0.5 to changelog for rebuild [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/244443 [13:32:34] (03CR) 10coren: [C: 032] "Just changelog" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/244443 (owner: 10coren) [13:37:56] (03Abandoned) 10coren: Labs: Reduce labstore* LDAP config to the minimum [puppet] - 10https://gerrit.wikimedia.org/r/207514 (https://phabricator.wikimedia.org/T95559) (owner: 10coren) [13:38:24] 6operations, 10Datasets-General-or-Unknown, 7HHVM, 5Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1712386 (10Hydriz) Just curious, did this change have an effect on the dumps generation process? It seems like the probability of getting failed job... [13:38:46] (03PS2) 10BBlack: Move all X-Analytics code to analytics.inc, include in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243977 (https://phabricator.wikimedia.org/T89177) [13:39:45] (03PS1) 10Jcrespo: Reducing db1044 weight, too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244446 [13:42:39] (03PS1) 10Jcrespo: Repooling db1055 at top capacity after warming up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244447 (https://phabricator.wikimedia.org/T112478) [13:43:23] 6operations, 10ops-eqiad: cp1059 has network issues - https://phabricator.wikimedia.org/T114870#1712393 (10Cmjohnson) a:3Cmjohnson Taking this to look at possible physical layer [13:47:43] (03CR) 10JanZerebecki: "If we don't change the name and structure of the setting anymore, yes this patch can be deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244165 (https://phabricator.wikimedia.org/T112865) (owner: 10Thiemo Mättig (WMDE)) [13:54:27] (03CR) 10Jcrespo: [C: 032] Reducing db1044 weight, too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244446 (owner: 10Jcrespo) [13:55:27] 6operations, 10Datasets-General-or-Unknown, 7HHVM, 5Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1712429 (10Reedy) >>! In T94277#1712386, @Hydriz wrote: > Just curious, did this change have an effect on the dumps generation process? It seems lik... [13:57:57] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [13:58:22] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Jenkins: Re-enable lint checks for Apache config in operations-puppet - https://phabricator.wikimedia.org/T72068#1712430 (10JanZerebecki) 5Open>3Resolved I opened T114801 for that. [13:59:11] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Reduce db1035 & db1044 weight, repool at 100% db1051 & db1055 (duration: 00m 17s) [13:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:01:38] (03Abandoned) 10Cmjohnson: admin: add dcausse to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/243686 (https://phabricator.wikimedia.org/T114642) (owner: 10John F. Lewis) [14:07:22] 6operations, 7Database: Upgrade db1055 mysql version and configuration, and reduce its pool weight - https://phabricator.wikimedia.org/T112478#1712438 (10jcrespo) [14:08:17] cmjohnson1: abandoned? [14:08:56] johnFlewis: git was giving me a hard time and thought it would be easier to abandon and do again. Didn't think you were around [14:09:11] 6operations, 7Database: Upgrade db1055 mysql version and configuration, and reduce its pool weight - https://phabricator.wikimedia.org/T112478#1712442 (10jcrespo) 5Open>3Resolved db1055 performance issues have been resolved: * It's kernel, MySQL version and configuration have been updated * It is no longe... [14:09:23] cmjohnson1: Ah, alright. Fair enough :) [14:09:41] i was getting merge conflicts and couldn't clear them [14:09:49] git gave me grief last night fixing a merge conflict anyway so it's what I'm doing today for said patch :) [14:10:26] (03CR) 10Alex Monk: "Also brwikisource:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244435 (owner: 10Glaisher) [14:10:45] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [5000000.0] [14:10:55] (03CR) 10BBlack: [C: 032] Move all X-Analytics code to analytics.inc, include in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243977 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [14:11:49] HM [14:12:38] HM? [14:13:02] that kafka alert [14:13:05] looking [14:13:38] !log performing schema change on the m4/analytics/eventlogging databases (db1046, db1047, dbstore2002) [14:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:15:00] (03PS11) 10BBlack: varnish: misspass limiter [puppet] - 10https://gerrit.wikimedia.org/r/241643 [14:19:26] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [14:20:46] (03PS1) 10BBlack: varnish: get rid of old ensure=>absent on dead VCLs [puppet] - 10https://gerrit.wikimedia.org/r/244448 [14:25:39] (03CR) 10Giuseppe Lavagetto: [C: 031] Production logstash uses port 10514, fix the configuration for WDQS. [puppet] - 10https://gerrit.wikimedia.org/r/244045 (owner: 10Smalyshev) [14:30:35] (03PS3) 10Giuseppe Lavagetto: Convert logging from print to twisted.python.log [debs/pybal] - 10https://gerrit.wikimedia.org/r/244138 [14:31:02] (03CR) 10jenkins-bot: [V: 04-1] Convert logging from print to twisted.python.log [debs/pybal] - 10https://gerrit.wikimedia.org/r/244138 (owner: 10Giuseppe Lavagetto) [14:32:03] (03PS1) 10Cmjohnson: add dcausse to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/244449 [14:34:54] (03CR) 10John F. Lewis: [C: 031] add dcausse to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/244449 (owner: 10Cmjohnson) [14:36:44] (03CR) 10Cmjohnson: [C: 032] add dcausse to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/244449 (owner: 10Cmjohnson) [14:37:09] \o/ thanks :) [14:40:07] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [14:41:29] dcausse: YW! [14:42:07] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: adding dcausse to analytics-privatedata-users (hive and webrequests) - https://phabricator.wikimedia.org/T114642#1712494 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson It has been 3 days and I did not see any objections. Merged the change https://gerri... [14:42:40] (03PS1) 10Reedy: Remove static-1.26wmf[34] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244450 [14:42:48] * Reedy wonders why they are still there [14:43:18] 6operations, 10ops-codfw: audit for juniper switch QFX5100-48S-AFI - https://phabricator.wikimedia.org/T114952#1712499 (10Papaul) a:5Papaul>3RobH I have 1 EX4300 SN:PE3713320082 1 QFX5100-48S-AFI SN:TA3713500309 1 power supply for the EX4300 SN: 1EDD4440178 [14:44:34] 6operations, 10ops-codfw: update spares sheet with DAC cable count - https://phabricator.wikimedia.org/T114720#1712503 (10Papaul) a:5Papaul>3RobH [14:45:16] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [14:47:47] (03CR) 10Reedy: [C: 032] Remove static-1.26wmf[34] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244450 (owner: 10Reedy) [14:47:53] (03Merged) 10jenkins-bot: Remove static-1.26wmf[34] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244450 (owner: 10Reedy) [14:48:01] <_joe_> Reedy: naughty boy [14:48:21] <_joe_> you merged that core patch :P [14:48:27] !log reedy@tin Synchronized docroot and w: (no message) (duration: 00m 17s) [14:48:29] core? [14:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:48:43] <_joe_> the one dropping .php5 [14:48:44] oh, oris killing php5? [14:49:00] Why, what's wrong with that? :P [14:50:28] <_joe_> nothing, it's just going to cause a breaking change for $random_people [14:50:51] <_joe_> I think there are a lot of things that are right about it in fact :) [14:51:49] heh [14:51:55] (03PS4) 10Giuseppe Lavagetto: Convert logging from print to twisted.python.log [debs/pybal] - 10https://gerrit.wikimedia.org/r/244138 [14:52:06] It'd be interesting to see how many people it actually causes a problem for [14:52:14] And why tf they are still using it [14:52:25] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 106, down: 1, dormant: 0, excluded: 0, unused: 0BRae1: down - Core: asw-a-codfw:ae1BR [14:53:15] !log replacing cr1-codfw<->asw-a-codfw QSFPs [14:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:54:06] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 116, down: 0, dormant: 0, excluded: 0, unused: 0 [14:59:17] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 106, down: 1, dormant: 0, excluded: 0, unused: 0BRae1: down - Core: asw-a-codfw:ae1BR [15:00:04] anomie ostriches thcipriani marktraceur: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151008T1500). Please do the needful. [15:00:04] kart_: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:22] here [15:01:18] I'm here [15:01:27] I guess I failed the format [15:01:30] <_joe_> andrewbogott: I am gonna be unavailable for Puppetswat, sorry [15:01:39] Is anyone planning on doing it or should I? [15:01:44] <_joe_> but I reviewed the only patch that was planned, and it's GTG [15:02:04] marktraceur: I can SWAT [15:02:22] That would be super [15:02:29] I only have a labs config change, so it's trivial-ish [15:02:48] (03PS2) 10Glaisher: Remove duplicate entries from commonsuploads.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244435 [15:02:51] _joe_: ok. My dentist’s appointment was postponed so I’ll be late to the window but will still get it done. [15:03:21] (03CR) 10Andrew Bogott: "This is slated for merging this morning. I'm going to be a bit late to the swat window but will do it before the window closes." [puppet] - 10https://gerrit.wikimedia.org/r/244045 (owner: 10Smalyshev) [15:03:39] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244142 (https://phabricator.wikimedia.org/T112848) (owner: 10KartikMistry) [15:04:00] (03Merged) 10jenkins-bot: Enable CX suggestions in ast, bn, ml, nb, ta and ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244142 (https://phabricator.wikimedia.org/T112848) (owner: 10KartikMistry) [15:04:27] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [15:05:51] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable CX suggestions in ast, bn, ml, nb, ta and ukwiki [[gerrit:244142]] (duration: 00m 17s) [15:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:06:01] ^ kart_ check please [15:06:28] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244332 (owner: 10MarkTraceur) [15:06:51] (03Merged) 10jenkins-bot: Fix labs settings for foreign uploads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244332 (owner: 10MarkTraceur) [15:06:54] thcipriani: Thanks. Checking. [15:07:49] 6operations, 10ops-codfw, 7Swift: [determine] rack ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1712544 (10RobH) Row A does have space, my suggestion was purely in an effort to fill all the racks evenly. A2 and A7 have plenty of space. So then reviewing @fgiunchedi's above comment, I'd propos... [15:09:04] 6operations, 10ops-codfw, 7Swift: [determine] rack ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1712545 (10RobH) [15:09:13] thcipriani: works fine. Thanks! [15:09:20] kart_: awesome. thanks! [15:09:38] 6operations, 10ops-codfw, 7Swift: rack & initial on-site setup of ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1712546 (10RobH) p:5Triage>3Normal a:5fgiunchedi>3Papaul [15:10:13] 6operations, 10ops-codfw: update spares sheet with DAC cable count - https://phabricator.wikimedia.org/T114720#1712549 (10RobH) So there are 10 onsite, and the new swift backends will take up 6 of them. I'll be creating a new ticket for ordering more. [15:12:17] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: Fix labs settings for foreign uploads (syncing out so it doesnt surprise future SWATters) [[gerrit:244332]] (duration: 00m 18s) [15:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:08] (03PS3) 10Alexandros Kosiaris: maps: Deduplicate the roles [puppet] - 10https://gerrit.wikimedia.org/r/244427 [15:14:22] (03CR) 10Alexandros Kosiaris: [C: 032] maps: Deduplicate the roles [puppet] - 10https://gerrit.wikimedia.org/r/244427 (owner: 10Alexandros Kosiaris) [15:14:55] (03CR) 10Alexandros Kosiaris: [C: 032] bacula: Remove leading whitespace in ERB output [puppet] - 10https://gerrit.wikimedia.org/r/244417 (owner: 10Alexandros Kosiaris) [15:15:15] (03PS2) 10Alexandros Kosiaris: bacula: Remove leading whitespace in ERB output [puppet] - 10https://gerrit.wikimedia.org/r/244417 [15:19:03] 6operations, 10ops-codfw: audit for juniper switch QFX5100-48S-AFI - https://phabricator.wikimedia.org/T114952#1712583 (10RobH) a:5RobH>3Papaul @Papaul: * 1 EX4300 SN:PE3713320082 ** This one is in racktables as ex4300-spare1-codfw * 1 QFX5100-48S-AFI SN:TA3713500309 ** This is the one I was hoping you... [15:21:53] FWIW my swat patch looks fine, thcipriani [15:21:55] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 116, down: 0, dormant: 0, excluded: 0, unused: 0 [15:22:15] marktraceur: awesome, thanks for checking! [15:23:15] 6operations, 10netops: cr1/cr2-codfw QSFP+ errors every second for qsfp-0/0/0 - https://phabricator.wikimedia.org/T92616#1712599 (10faidon) 5Open>3Resolved The CU5M has been replaced with a couple of QSFP+s and a fiber. The errors disappeared, hopefully for good. [15:31:57] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [15:34:25] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Puppet has 1 failures [15:37:16] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:39:06] PROBLEM - configured eth on iridium is CRITICAL: eth1 reporting no carrier. [15:42:01] <_joe_> strontium has unmerged changes, can someone look into it? [15:42:05] <_joe_> akosiaris: ^^ [15:45:50] <_joe_> oh, solved already, sorry [15:45:55] RECOVERY - configured eth on iridium is OK: OK - interfaces up [15:55:13] 6operations, 10ops-codfw: audit for juniper switch QFX5100-48S-AFI - https://phabricator.wikimedia.org/T114952#1712668 (10Papaul) 5Open>3Resolved Complete [15:58:36] 6operations: "Opsonly" bastion? - https://phabricator.wikimedia.org/T114992#1712672 (10Dzahn) Same here, the only reason i knew was agent forwarding and preventing that our keys get stolen. And since we don't allow that anymore i have been wondering the same thing. Especially related to all the discussions about... [15:58:37] 6operations: "Opsonly" bastion? - https://phabricator.wikimedia.org/T114992#1712673 (10Dzahn) Same here, the only reason i knew was agent forwarding and preventing that our keys get stolen. And since we don't allow that anymore i have been wondering the same thing. Especially related to all the discussions about... [16:00:05] _joe_ andrewbogott: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151008T1600). Please do the needful. [16:00:05] SMalyshev: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:02:06] 6operations: "Opsonly" bastion? - https://phabricator.wikimedia.org/T114992#1712678 (10chasemp) I also have no attachment to iron [16:05:50] (03PS1) 10Ottomata: Add script for icinga checks of hdfs active namenode [puppet/cdh] - 10https://gerrit.wikimedia.org/r/244456 (https://phabricator.wikimedia.org/T90642) [16:09:22] (03PS2) 10Ottomata: Add script for icinga checks of hdfs active namenode [puppet/cdh] - 10https://gerrit.wikimedia.org/r/244456 (https://phabricator.wikimedia.org/T90642) [16:09:55] (03PS3) 10Ottomata: Add script for icinga checks of hdfs active namenode [puppet/cdh] - 10https://gerrit.wikimedia.org/r/244456 (https://phabricator.wikimedia.org/T90642) [16:11:16] (03CR) 10Ottomata: [C: 032] Add script for icinga checks of hdfs active namenode [puppet/cdh] - 10https://gerrit.wikimedia.org/r/244456 (https://phabricator.wikimedia.org/T90642) (owner: 10Ottomata) [16:11:57] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Replace uses of monitoring::ganglia with monitoring::graphite_* - https://phabricator.wikimedia.org/T90642#1712691 (10Dzahn) This is really just curiosity and not a rhetorical question. What is better about monitoring::g... [16:12:14] (03PS1) 10Ottomata: Update cdh module with check_hdfs_active_namenode script, set up nrpe monitoring [puppet] - 10https://gerrit.wikimedia.org/r/244457 (https://phabricator.wikimedia.org/T90642) [16:12:27] (03PS2) 10Ottomata: Update cdh module with check_hdfs_active_namenode script, set up nrpe monitoring [puppet] - 10https://gerrit.wikimedia.org/r/244457 (https://phabricator.wikimedia.org/T90642) [16:13:01] (03PS2) 10BBlack: varnish: get rid of old ensure=>absent on dead VCLs [puppet] - 10https://gerrit.wikimedia.org/r/244448 [16:13:12] (03CR) 10BBlack: [C: 032 V: 032] varnish: get rid of old ensure=>absent on dead VCLs [puppet] - 10https://gerrit.wikimedia.org/r/244448 (owner: 10BBlack) [16:13:34] (03PS3) 10Ottomata: Update cdh module with check_hdfs_active_namenode script, set up nrpe monitoring [puppet] - 10https://gerrit.wikimedia.org/r/244457 (https://phabricator.wikimedia.org/T90642) [16:13:49] 6operations, 10Analytics-Cluster, 6Analytics-Kanban: Fix active namenode monitoring so that ANY active namenode is an OK state. - https://phabricator.wikimedia.org/T89463#1712693 (10Ottomata) [16:14:23] (03PS4) 10Ottomata: Update cdh module with check_hdfs_active_namenode script, set up nrpe monitoring [puppet] - 10https://gerrit.wikimedia.org/r/244457 (https://phabricator.wikimedia.org/T90642) [16:14:29] (03CR) 10Ottomata: [C: 032 V: 032] Update cdh module with check_hdfs_active_namenode script, set up nrpe monitoring [puppet] - 10https://gerrit.wikimedia.org/r/244457 (https://phabricator.wikimedia.org/T90642) (owner: 10Ottomata) [16:19:34] (03PS4) 10Andrew Bogott: Production logstash uses port 10514, fix the configuration for WDQS. [puppet] - 10https://gerrit.wikimedia.org/r/244045 (owner: 10Smalyshev) [16:19:52] SMalyshev: I just need to rebase and then I’ll merge. [16:20:51] (03CR) 10Andrew Bogott: [C: 032] Production logstash uses port 10514, fix the configuration for WDQS. [puppet] - 10https://gerrit.wikimedia.org/r/244045 (owner: 10Smalyshev) [16:22:29] (03PS1) 10BBlack: zero.inc.vcl: refactor/simplify [puppet] - 10https://gerrit.wikimedia.org/r/244458 [16:22:34] andrewbogott: thanks! [16:22:58] SMalyshev: are there prod hosts I should check, or is that strictly for labs instances? [16:23:39] andrewbogott: both. the prod ones are wdqs1001 and wdqs1002 [16:25:14] 6operations, 6Phabricator, 7audits-data-retention: Enable mod_remoteip and ensure logs follow retention guidelines - https://phabricator.wikimedia.org/T114014#1712722 (10BBlack) The new headers are live and working now (on all clusters, including misc). To reiterate: almost invariably, what app-layer things... [16:25:28] SMalyshev: ok, merged on those prod hosts. Anything you need me to verify? [16:26:06] andrewbogott: no, I think that's it for now. Thanks! [16:29:42] 6operations, 6Services: Set up external uptime metrics for REST API - https://phabricator.wikimedia.org/T115022#1712757 (10GWicke) 3NEW [16:31:38] (03CR) 10DCausse: "I think that this will affect only jobs that are replayed because the cluster is frozen. I don't know yet how the job queue handles failed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson) [16:37:20] (03PS1) 10Ottomata: Only sudo -u hdfs for check_hdfs_active_namenode if needed [puppet/cdh] - 10https://gerrit.wikimedia.org/r/244459 (https://phabricator.wikimedia.org/T89463) [16:37:43] (03CR) 10Ottomata: [C: 032] Only sudo -u hdfs for check_hdfs_active_namenode if needed [puppet/cdh] - 10https://gerrit.wikimedia.org/r/244459 (https://phabricator.wikimedia.org/T89463) (owner: 10Ottomata) [16:40:19] (03PS1) 10Ottomata: Allow nagios user to sudo to hdfs to run check_hdfs_active_namenode [puppet] - 10https://gerrit.wikimedia.org/r/244460 (https://phabricator.wikimedia.org/T89463) [16:41:11] (03CR) 10jenkins-bot: [V: 04-1] Allow nagios user to sudo to hdfs to run check_hdfs_active_namenode [puppet] - 10https://gerrit.wikimedia.org/r/244460 (https://phabricator.wikimedia.org/T89463) (owner: 10Ottomata) [16:42:06] (03PS2) 10Ottomata: Allow nagios user to sudo to hdfs to run check_hdfs_active_namenode [puppet] - 10https://gerrit.wikimedia.org/r/244460 (https://phabricator.wikimedia.org/T89463) [16:43:14] (03CR) 10Ottomata: [C: 032] Allow nagios user to sudo to hdfs to run check_hdfs_active_namenode [puppet] - 10https://gerrit.wikimedia.org/r/244460 (https://phabricator.wikimedia.org/T89463) (owner: 10Ottomata) [16:45:22] !log mathoid deploying 110abaf [16:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:47:20] 6operations, 10Parsoid: Investigate Oct 3 outage of the Parsoid cluster due to high cpu usage + high memory usage (sharp spike in both) around 08:35 UTC - https://phabricator.wikimedia.org/T114558#1712831 (10GWicke) > In other words, the request timeout doesn't actually halt processing, it just clears the cpu... [16:54:12] 6operations: SSL cert needed for benefactorevents.wikimedia.org - https://phabricator.wikimedia.org/T115028#1712876 (10CCogdill_WMF) 3NEW [16:57:06] (03PS1) 10Ottomata: Allow anyone to execute check_hdfs_active_namenode script [puppet/cdh] - 10https://gerrit.wikimedia.org/r/244463 (https://phabricator.wikimedia.org/T89463) [16:57:20] (03CR) 10Ottomata: [C: 032] Allow anyone to execute check_hdfs_active_namenode script [puppet/cdh] - 10https://gerrit.wikimedia.org/r/244463 (https://phabricator.wikimedia.org/T89463) (owner: 10Ottomata) [16:57:41] (03PS1) 10Ottomata: Update cdh module with perm changes for check_hdfs_active_namenode [puppet] - 10https://gerrit.wikimedia.org/r/244464 [16:58:48] (03CR) 10Ottomata: [C: 032] Update cdh module with perm changes for check_hdfs_active_namenode [puppet] - 10https://gerrit.wikimedia.org/r/244464 (owner: 10Ottomata) [17:07:23] 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 7Icinga, 5Patch-For-Review: Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group - https://phabricator.wikimedia.org/T111243#1712951 (10Deskana) We believe this is blocked by Ops, who are curren... [17:11:43] 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 7Icinga, 5Patch-For-Review: Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group - https://phabricator.wikimedia.org/T111243#1712976 (10JohnLewis) >>! In T111243#1712951, @Deskana wrote: > We be... [17:12:45] PROBLEM - RAID on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:12:45] PROBLEM - puppet last run on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:15:04] PROBLEM - YARN NodeManager Node-State on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:15:14] PROBLEM - dhclient process on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:15:31] 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 7Icinga, 5Patch-For-Review: Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group - https://phabricator.wikimedia.org/T111243#1712985 (10Deskana) >>! In T111243#1712976, @JohnLewis wrote: > Offsi... [17:16:25] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 16 minutes ago with 0 failures [17:16:25] RECOVERY - RAID on analytics1034 is OK: OK: optimal, 13 logical, 14 physical [17:16:54] RECOVERY - dhclient process on analytics1034 is OK: PROCS OK: 0 processes with command name dhclient [17:18:34] RECOVERY - YARN NodeManager Node-State on analytics1034 is OK: OK: YARN NodeManager analytics1034.eqiad.wmnet:8041 Node-State: RUNNING [17:33:05] PROBLEM - YARN NodeManager Node-State on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:36:34] RECOVERY - YARN NodeManager Node-State on analytics1034 is OK: OK: YARN NodeManager analytics1034.eqiad.wmnet:8041 Node-State: RUNNING [17:37:13] 6operations, 10ops-codfw, 7Swift: rack & initial on-site setup of ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1713017 (10RobH) [17:37:17] 6operations, 10ops-codfw: update spares sheet with DAC cable count - https://phabricator.wikimedia.org/T114720#1713016 (10RobH) 5Open>3Resolved [17:38:07] 6operations, 10ops-codfw, 7Swift: rack & initial on-site setup of ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1703678 (10RobH) [17:38:35] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1713028 (10RobH) [17:50:07] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1713070 (10Eevans) [17:50:09] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra/CQL query interface monitoring - https://phabricator.wikimedia.org/T93886#1713069 (10Eevans) 5Resolved>3Open [17:51:48] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra/CQL query interface monitoring - https://phabricator.wikimedia.org/T93886#1713075 (10Eevans) Reopening to add a CQL-based check (since how we do Icinga service monitoring needs to be revisited in the wake of {T95253}). [17:52:38] (03PS1) 10Rush: admin: allow all active users to be applied [puppet] - 10https://gerrit.wikimedia.org/r/244471 [17:53:12] 6operations, 5Patch-For-Review: Do not require people to be explicitly added to the bastiononly group - https://phabricator.wikimedia.org/T114161#1713089 (10chasemp) I am thinking something more like https://gerrit.wikimedia.org/r/#/c/244471/ [18:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151008T1800). [18:01:45] PROBLEM - At least one Hadoop HDFS NameNode is active on analytics1001 is CRITICAL: CRITICAL - Hadoop Active NameNode: no namenodes are active [18:01:48] shhh [18:01:49] not true! [18:01:56] testing new check, it isn't working properly [18:04:21] 6operations, 6Services: Set up external uptime metrics for REST API - https://phabricator.wikimedia.org/T115022#1713122 (10chasemp) You directed me on what to watch :) We currently do this with a selenium engine: // Step - 1 open("http://rest.wikimedia.org/en.wikipedia.org/v1/page/html/Foobar") assertHtmlSou... [18:10:36] RECOVERY - At least one Hadoop HDFS NameNode is active on analytics1001 is OK: OKAY - Hadoop Active Namenode: analytics1001-eqiad-wmnet [18:10:54] (03PS1) 1020after4: wikipedia wikis to 1.27.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244477 [18:11:20] (03PS1) 10Ottomata: Nagios should sudo when nrpe checking heck_hdfs_active_namenode [puppet] - 10https://gerrit.wikimedia.org/r/244478 (https://phabricator.wikimedia.org/T89463) [18:11:44] (03PS2) 10Ottomata: Nagios should sudo when nrpe running check_hdfs_active_namenode [puppet] - 10https://gerrit.wikimedia.org/r/244478 (https://phabricator.wikimedia.org/T89463) [18:12:46] (03CR) 10Ottomata: [C: 032] Nagios should sudo when nrpe running check_hdfs_active_namenode [puppet] - 10https://gerrit.wikimedia.org/r/244478 (https://phabricator.wikimedia.org/T89463) (owner: 10Ottomata) [18:12:48] (03CR) 1020after4: [C: 032] wikipedia wikis to 1.27.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244477 (owner: 1020after4) [18:12:53] (03Merged) 10jenkins-bot: wikipedia wikis to 1.27.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244477 (owner: 1020after4) [18:13:24] (03PS1) 10Ottomata: exit(2) for CRITICAL in check_hdfs_active_namenode [puppet/cdh] - 10https://gerrit.wikimedia.org/r/244480 (https://phabricator.wikimedia.org/T89463) [18:13:39] (03CR) 10Ottomata: [C: 032] exit(2) for CRITICAL in check_hdfs_active_namenode [puppet/cdh] - 10https://gerrit.wikimedia.org/r/244480 (https://phabricator.wikimedia.org/T89463) (owner: 10Ottomata) [18:14:10] (03PS1) 10Ottomata: Update cdh submodule with check_hdfs_active_namenode CRITICAL fix [puppet] - 10https://gerrit.wikimedia.org/r/244481 [18:14:22] (03PS2) 10Ottomata: Update cdh submodule with check_hdfs_active_namenode CRITICAL fix [puppet] - 10https://gerrit.wikimedia.org/r/244481 [18:14:36] (03CR) 10Ottomata: [C: 032 V: 032] Update cdh submodule with check_hdfs_active_namenode CRITICAL fix [puppet] - 10https://gerrit.wikimedia.org/r/244481 (owner: 10Ottomata) [18:20:28] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedia wikis to 1.27.0-wmf.2 [18:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:22:28] (03PS1) 10Ottomata: Remove monitoring::ganglia hadoop-hdfs-namenode-primary-is-active [puppet] - 10https://gerrit.wikimedia.org/r/244482 (https://phabricator.wikimedia.org/T90642) [18:25:52] (03CR) 10Ottomata: [C: 032] Remove monitoring::ganglia hadoop-hdfs-namenode-primary-is-active [puppet] - 10https://gerrit.wikimedia.org/r/244482 (https://phabricator.wikimedia.org/T90642) (owner: 10Ottomata) [18:28:32] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1713186 (10chasemp) p:5Triage>3Normal [18:29:40] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Replace uses of monitoring::ganglia with monitoring::graphite_* [8 pts] - https://phabricator.wikimedia.org/T90642#1713190 (10Ottomata) [18:29:54] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 5Patch-For-Review: Fix active namenode monitoring so that ANY active namenode is an OK state. [8 pts] - https://phabricator.wikimedia.org/T89463#1713191 (10Ottomata) [18:30:25] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [18:43:52] (03PS1) 10Rush: phab: manage iridium-vcs address and allow in fw [puppet] - 10https://gerrit.wikimedia.org/r/244487 [18:44:05] (03PS2) 10Rush: phab: manage iridium-vcs address and allow in fw [puppet] - 10https://gerrit.wikimedia.org/r/244487 [18:44:15] (03PS1) 10Krinkle: webperf: Allow zero values for navtiming metrics [puppet] - 10https://gerrit.wikimedia.org/r/244488 (https://phabricator.wikimedia.org/T104902) [18:44:33] (03PS2) 10Krinkle: webperf: Allow zero values for navtiming metrics [puppet] - 10https://gerrit.wikimedia.org/r/244488 (https://phabricator.wikimedia.org/T104902) [18:45:30] (03PS1) 10Ottomata: Remove no longer used monitoring::ganglia 'hadoop-hdfs-namenode-primary-is-active' [puppet] - 10https://gerrit.wikimedia.org/r/244491 (https://phabricator.wikimedia.org/T90642) [18:46:21] (03CR) 10Ottomata: [C: 032] Remove no longer used monitoring::ganglia 'hadoop-hdfs-namenode-primary-is-active' [puppet] - 10https://gerrit.wikimedia.org/r/244491 (https://phabricator.wikimedia.org/T90642) (owner: 10Ottomata) [18:46:53] (03PS3) 10Rush: phab: manage iridium-vcs address and allow in fw [puppet] - 10https://gerrit.wikimedia.org/r/244487 [18:47:53] (03CR) 10Rush: [C: 032] phab: manage iridium-vcs address and allow in fw [puppet] - 10https://gerrit.wikimedia.org/r/244487 (owner: 10Rush) [18:51:12] PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [18:51:53] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 5Patch-For-Review: Fix active namenode monitoring so that ANY active namenode is an OK state. [8 pts] - https://phabricator.wikimedia.org/T89463#1713268 (10Ottomata) Done! https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=analytics1001... [18:52:53] RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:59:12] Smooth train deployment today, at least for ContentTranslation. Thanks. [19:04:02] 6operations, 10Wikimedia-Mailing-lists: move sodium backup to archive pool? - https://phabricator.wikimedia.org/T113828#1713324 (10Dzahn) @akosiaris did this and created a job for it. thank you! the 22505 is the jobid of the last full backup of sodium. ,,, Name = "sodium Job .. (03PS6) 10Rush: SSH repo hosting support for phabricator. [puppet] - 10https://gerrit.wikimedia.org/r/237096 (https://phabricator.wikimedia.org/T128) (owner: 1020after4) [19:14:50] (03CR) 10jenkins-bot: [V: 04-1] SSH repo hosting support for phabricator. [puppet] - 10https://gerrit.wikimedia.org/r/237096 (https://phabricator.wikimedia.org/T128) (owner: 1020after4) [19:22:13] (03CR) 1020after4: [C: 031] SSH repo hosting support for phabricator. [puppet] - 10https://gerrit.wikimedia.org/r/237096 (https://phabricator.wikimedia.org/T128) (owner: 1020after4) [19:23:06] (03PS8) 10Rush: SSH repo hosting support for phabricator. [puppet] - 10https://gerrit.wikimedia.org/r/237096 (https://phabricator.wikimedia.org/T128) (owner: 1020after4) [19:24:31] (03CR) 10Rush: [C: 032] SSH repo hosting support for phabricator. [puppet] - 10https://gerrit.wikimedia.org/r/237096 (https://phabricator.wikimedia.org/T128) (owner: 1020after4) [19:31:56] (03PS1) 10Rush: phab: sshd vcs port is static [puppet] - 10https://gerrit.wikimedia.org/r/244497 [19:33:52] (03CR) 10Rush: [C: 032] phab: sshd vcs port is static [puppet] - 10https://gerrit.wikimedia.org/r/244497 (owner: 10Rush) [19:59:29] (03PS1) 10Hashar: contint: stop gerrit replication to gallium [puppet] - 10https://gerrit.wikimedia.org/r/244498 (https://phabricator.wikimedia.org/T86661) [20:01:18] (03CR) 10Hashar: "I will handle the manual cleanup on gallium. Namely remove whatever gerrit::replicationdest is adding such as the 'gerritslave' user and d" [puppet] - 10https://gerrit.wikimedia.org/r/244498 (https://phabricator.wikimedia.org/T86661) (owner: 10Hashar) [20:01:30] (03CR) 10Chad: [C: 031] "lgtm on the gerrit side. Just needs a merge and reload of the replication plugin there." [puppet] - 10https://gerrit.wikimedia.org/r/244498 (https://phabricator.wikimedia.org/T86661) (owner: 10Hashar) [20:03:27] (03PS2) 10Hashar: contint: stop gerrit replication to gallium [puppet] - 10https://gerrit.wikimedia.org/r/244498 (https://phabricator.wikimedia.org/T86661) [20:06:05] (03CR) 10QChris: "> I didn't see any attempt using non-capturing group, [...]" [puppet] - 10https://gerrit.wikimedia.org/r/242237 (owner: 10Daniel Kinzler) [20:06:09] (03CR) 10Hashar: "Updated commit message to mention that needs reloading the replication plugin (Chad can handle that part)" [puppet] - 10https://gerrit.wikimedia.org/r/244498 (https://phabricator.wikimedia.org/T86661) (owner: 10Hashar) [20:20:57] (03PS1) 1020after4: Fix phabricator basedir in vcs.pp / phabricator-ssh-hook [puppet] - 10https://gerrit.wikimedia.org/r/244506 [20:35:12] (03PS2) 1020after4: Fix phabricator basedir in vcs.pp / phabricator-ssh-hook [puppet] - 10https://gerrit.wikimedia.org/r/244506 (https://phabricator.wikimedia.org/T100519) [20:36:03] 7Blocked-on-Operations, 6operations, 6Phabricator, 6Release-Engineering-Team, and 2 others: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1713598 (10mmodell) https://gerrit.wikimedia.org/r/#/c/244506/ [20:36:14] 6operations, 10Wikimedia-Mailing-lists: move sodium backup to archive pool? - https://phabricator.wikimedia.org/T113828#1713599 (10Dzahn) ``` *list jobid=22505 Automatically selected Catalog: production Using Catalog "production" +--------+-----------------------------------------------------------------+-----... [20:37:59] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Decommission sodium - https://phabricator.wikimedia.org/T110142#1713604 (10Dzahn) [20:38:00] 6operations, 10Wikimedia-Mailing-lists: move sodium backup to archive pool? - https://phabricator.wikimedia.org/T113828#1713603 (10Dzahn) 5Open>3Resolved [20:38:13] 6operations, 10Wikimedia-Mailing-lists: move sodium backup to archive pool? - https://phabricator.wikimedia.org/T113828#1676993 (10Dzahn) a:5Dzahn>3akosiaris [20:41:56] 6operations, 10Wikimedia-Mailing-lists: Decommission sodium - https://phabricator.wikimedia.org/T110142#1713629 (10Dzahn) [20:42:10] (03CR) 10EBernhardson: "failures unrelated to frozen indices have a static number of retries decided by the job system. After it fails 3 times it gets put in an " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson) [20:44:18] (03CR) 10EBernhardson: "i've added aaron as a reviewer as the resident job queue expert, maybe he has an idea or two" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson) [20:46:26] 6operations, 10Architecture, 10Incident-20150423-Commons, 10MediaWiki-RfCs, and 6 others: RFC: Re-evaluate varnish-level request-restart behavior on 5xx - https://phabricator.wikimedia.org/T97206#1713665 (10GWicke) We just found an instance where a client retrying up to five times caused 72 failing parsoid... [20:46:55] (03CR) 10EBernhardson: "one option here would be to do an end run around the job queue abandond jobs system, basically treating all failed jobs the same as we for" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson) [21:00:37] (03PS1) 10Dzahn: mailman: rm sodium remnants, hardcoded IP [puppet] - 10https://gerrit.wikimedia.org/r/244514 [21:03:27] 6operations, 10ops-codfw: Humidity Alarms from codfw - https://phabricator.wikimedia.org/T110421#1713738 (10RobH) @Papaul, Have blanking panels been installed in all the open spots or did we need to order some? [21:04:04] (03PS2) 10Dzahn: mailman: rm sodium remnants, hardcoded IP [puppet] - 10https://gerrit.wikimedia.org/r/244514 [21:04:46] (03PS3) 10Dzahn: mailman: rm sodium remnants, hardcoded IP [puppet] - 10https://gerrit.wikimedia.org/r/244514 [21:07:53] (03PS1) 10Dzahn: admin: remove spamd from enforce-users-groups [puppet] - 10https://gerrit.wikimedia.org/r/244555 [21:08:28] 6operations, 10ops-codfw: Humidity Alarms from codfw - https://phabricator.wikimedia.org/T110421#1713791 (10Papaul) I have covert must sports but not all. if we need to order some more blanks, it has to be large enough to cover like for example in rack A5 U29 to U48. I have only few 1U blanks lefts. [21:09:29] (03CR) 10Dzahn: [C: 032] mailman: rm sodium remnants, hardcoded IP [puppet] - 10https://gerrit.wikimedia.org/r/244514 (owner: 10Dzahn) [21:16:12] (03CR) 10Siebrand: Rename Azerbaijani Wikisource project and namespaces (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242096 (https://phabricator.wikimedia.org/T114002) (owner: 10Siebrand) [21:16:21] (03PS2) 10Siebrand: Rename Azerbaijani Wikisource project and namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242096 (https://phabricator.wikimedia.org/T114002) [21:16:38] bblack: yt? [21:16:53] bblack: have a question that should be quick to answer [21:17:33] 6operations, 10Wikimedia-Mailing-lists: Decommission sodium - https://phabricator.wikimedia.org/T110142#1713842 (10Dzahn) I did all the things on the [[ https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_or_Decommission | Server Lifecycle ]] page, site.pp , DHCP/netinstall, DNS but keep mgmt, puppet... [21:18:01] 6operations, 10ops-eqiad: Decommission sodium - https://phabricator.wikimedia.org/T110142#1713843 (10Dzahn) [21:18:42] 6operations, 10ops-eqiad: Decommission sodium - https://phabricator.wikimedia.org/T110142#1713845 (10Dzahn) a:5Dzahn>3Cmjohnson @cmjohnson ok if i assign to you directly, per "ops-eqiad"? [21:21:03] mutante: yep! th [21:21:03] tx [21:21:49] cmjohnson1: ok, thanks:) [21:31:26] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1713886 (10RobH) [21:31:36] 6operations, 10ops-codfw, 7Swift: rack & initial on-site setup of ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1713887 (10RobH) [21:32:50] (03PS1) 10Aude: Explicitly set wmgMFNearby = false for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244591 (https://phabricator.wikimedia.org/T114869) [21:35:03] (03PS3) 10Siebrand: Rename Azerbaijani Wikisource project and namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242096 (https://phabricator.wikimedia.org/T114002) [22:05:20] (03PS2) 10Dzahn: mailman: queue monitoring, enable multi thresholds [puppet] - 10https://gerrit.wikimedia.org/r/244366 (https://phabricator.wikimedia.org/T114861) [22:05:46] (03PS3) 10Dzahn: mailman: queue monitoring, enable multi thresholds [puppet] - 10https://gerrit.wikimedia.org/r/244366 (https://phabricator.wikimedia.org/T114861) [22:05:54] 6operations: audit contractors sheet against cluster access - https://phabricator.wikimedia.org/T114430#1714005 (10RobH) line 34 of 226 processes so far (just keeping notes for when i task swap) [22:06:58] (03CR) 10Dzahn: [C: 032] mailman: queue monitoring, enable multi thresholds [puppet] - 10https://gerrit.wikimedia.org/r/244366 (https://phabricator.wikimedia.org/T114861) (owner: 10Dzahn) [22:12:45] 6operations, 5Patch-For-Review: mailman check_queue recurrent alarm/recovery - https://phabricator.wikimedia.org/T114861#1714008 (10Dzahn) The script now takes separate limits for each of the 4 queues we monitor, in, out, bounces, virgin, which allows to be more specific. Follow-up is finding the right values. [22:14:30] (03PS1) 10Madhuvishy: [WIP] analytics:Add cron that schedules camus imports for mediawiki data [puppet] - 10https://gerrit.wikimedia.org/r/244601 (https://phabricator.wikimedia.org/T113521) [22:17:26] nuria: yes [22:18:09] 6operations, 5Patch-For-Review: mailman check_queue recurrent alarm/recovery - https://phabricator.wikimedia.org/T114861#1714020 (10Dzahn) >>! In T114861#1708831, @JohnLewis wrote: > Needs per queue levels. > The issue is all bounces emails and all digest emails go out at once which easily is 500+ emails at an... [22:19:02] bblack: question, given that our attemps at counting users using last-access-cookie [22:19:14] bblack: did not worked out so well due to bot traffic [22:19:26] bblack: we are thinking of using as a cheap proxy to bots [22:19:34] a flag in x-analytics [22:19:46] that is turned on if the request comes without any cookies [22:19:57] bblack: that is cookies=0 [22:20:15] bblack: so in varnish i would need to look at all cookies and if there is none set this flag [22:20:22] bblack: does that seem doable? [22:20:35] (03PS2) 10BBlack: zero.inc.vcl: refactor/simplify [puppet] - 10https://gerrit.wikimedia.org/r/244458 [22:20:49] bblack: asking cause some trivial things in varnish are real hard (and the other way arround) [22:21:21] nuria: that's pretty easy to do, but keep in mind it would also catch fresh/anon human clients too [22:21:50] (those who've been gone long enough to expire cookies, or cmd+shift+N for new anon window, or just installed a new browser/machine, or a Guest-account login, etc) [22:22:27] (03PS2) 10Dzahn: add 3 typo domains to parking [dns] - 10https://gerrit.wikimedia.org/r/244367 (https://phabricator.wikimedia.org/T114922) [22:22:40] what about the last-access cookie didn't work due to bot traffic? it seems like if the bots don't use cookies they wouldn't show up there, and if they do they fail the no-cookies check? [22:23:46] bblack: it is a bit more complex https://wikitech.wikimedia.org/wiki/Analytics/Unique_clients/Last_access_solution#How_will_we_be_counting:_Plain_English [22:24:07] bblack: we count for a day requests without cookie to say hey "a new user" [22:24:07] (03PS3) 10Dzahn: add 3 typo domains to parking [dns] - 10https://gerrit.wikimedia.org/r/244367 (https://phabricator.wikimedia.org/T114922) [22:24:19] bblack: thus bots are being counted every time [22:24:40] (03CR) 10Dzahn: [C: 032] add 3 typo domains to parking [dns] - 10https://gerrit.wikimedia.org/r/244367 (https://phabricator.wikimedia.org/T114922) (owner: 10Dzahn) [22:24:42] bblack: on a day of traffic we can filter them ok and give an ok estimates [22:24:50] bblack: on a month there is no way [22:25:31] bblack: and identifying bots well, either with heuristics or fancier feature based machine learning techniques is complex [22:25:47] bblack: thus the no cookie might give us a good proxy w/o much effort [22:25:54] bblack: makes sense? [22:30:38] nuria: no it still doesn't really make sense to me how it helps. are you counting unique IPs from each set (the ones missing all cookies, and the ones missing last-access at least initially), and then diffing that or something? [22:31:08] bblack: no, we will not count any traffic that came without cookies and compare our counts [22:31:22] 6operations, 10Wikimedia-DNS, 5Patch-For-Review, 7domains: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#1714051 (10Dzahn) `for domain in wilkipedia wikidpedia wikipedial; do dig $domain.org @ns0.wikimedia.org | grep -A1 "AUTHORITY SECTION"; done` ``` ;; AUTHORITY SEC... [22:31:55] bblack: with the ones we had prior, thus estimating the number of requests that are from either bots or fresh sessions [22:31:56] 6operations, 10Wikimedia-DNS, 7domains: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#1714052 (10Dzahn) [22:31:58] the ones without cookies are the ones without the last-access cookie too [22:32:10] I guess what I mean is: if they're missing either (for humans) they're likely missing both [22:33:16] 6operations, 10Wikimedia-DNS, 7domains: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#1709675 (10Dzahn) Hi, @VBaranetsky all three domains have now been added on our side. All we need now to close this is to have MarkMonitor change the records for wilkipedia.org the s... [22:38:04] nuria: so you're comparing no-cookies-at-all counts to lacks-last-access-cookie counts? [22:38:06] bblack: on cookies - but people with Last access cookie set to null, need to be counted. If you are coming to the site for the first time say after your Last access cookie expired, you wont have the cookie set, but you are new for the day, and we want to count you as a unique. In the subsequent requests, you'll have the cookie set to today's date and we will [22:38:06] ignore you. However, if you send 100 requests a day with the cookie as null, we will count you everytime. Now, if we know that you have no other cookies set either, we want to ignore your requests. [22:39:04] it's still not really exactly what you want. there's no difference from any cookie perspective between a bot and a fresh user [22:39:37] bblack: yeah, i realized that we will miss out on fresh users [22:40:00] bblack: yes, but we will miss ONLY fresh users that way [22:40:07] if you're missing fresh users anyways, why not just ignore all users that don't have the last-access cookie set, and count the ones that do? [22:40:20] bblack: we have done that [22:40:42] RoanKattouw: ostriches: rmoen: Krenair: can I add a few cherry-picked CentralNotice changes for the SWAT deploy? [22:40:46] bblack: and our numbers are very low [22:41:09] bblack: and given our access patterns it is not supper surprising [22:41:35] nuria: but I think bblack is saying that we will have the same results by doing this [22:42:18] bblack: i do not think that is right cause number of fresh users != number of users that visit teh site once in a given day [22:42:32] AndyRussG: Sure [22:42:35] yeah I guess that's the question: what's the final number you care about? [22:43:05] bblack: well, we want to get the best number we can with this methodology and flagging "no-cookies" requests [22:43:18] bblack: allows us to further refine the counting [22:43:20] RoanKattouw: amazing, thanks much!! I'll just give you a change on CN wmf_deploy branch, like last time, if it's OK :) [22:43:24] is it unique humans per day? [22:43:26] number of users(devices really) that visit the site atleast once in a given day [22:43:26] Sure [22:43:36] bblack: no, it is devices per day or month [22:43:37] bblack: yes, and monthly [22:43:41] AndyRussG: Do you guys have a plan to migrate to the wmf/* style at some point? [22:43:44] madhuvishy: not humans though [22:43:48] madhuvishy: devices [22:43:56] nuria: yup i mentioned that :) [22:44:03] RoanKattouw: sounds like something for January... maybe? [22:44:07] and "device" is determined by unique client-ip + UA, or? [22:44:32] uniqness is determined by absence or presence on cookie on request [22:44:39] ok [22:44:47] there is no ua or ip counting at all [22:45:10] so basically anytime you get a request that doesn't have today's last-access value (yesterday, or none at all), that single request counts as "1x unique device today" (before running into the bots problem) [22:45:23] exactly [22:45:25] exactly [22:45:27] haha [22:45:29] :) [22:45:29] ay [22:46:03] bblack: thus, without "no cookies at all trffic" we will be UNDERestimating [22:46:09] bblack: that is known [22:46:15] AndyRussG, fyi, it's not generally a problem to be adding items at this time, given that the window doesn't start for another 15 minutes [22:46:26] well there's no winning on the fresh users problem. You could do a count of "unique devices that visited us 2 or more times today" that excluded non-cookie-ing bots, by doing some kind of echo through the cookie stuff on the first hit and setting last access on the second, basically... [22:46:30] I would ask if it's already started of course [22:46:39] Krenair: heh ok!!! [22:46:58] but fresh users is not an insignificant problem. we probably have a *lot* of our traffic coming from users that only hit us rarely. [22:47:08] they click a wiki link from a google search at most 1-2 times a day [22:47:13] bblack: so let's know that for real [22:47:24] well we wouldn't know, we're just ignoring them as if they're bots. [22:47:55] 6operations: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1714084 (10Dzahn) The following are covered already: - lists - otrs [22:47:58] bblack: you are right that not counting fresh users might have quite an impact but i would like to have the numbers to quantify it [22:48:22] bblack: cause for monthly users might be quite a big win [22:48:32] in the counting yes, but we might be able to estimate it using some ip/UA and corresponding number of requests based analysis [22:48:56] what I'm saying is you wouldn't even get the numbers to quantify it. What you're asking for would give you two counts: "devices that definitely hit us uniquely today" and "number of total requests today from either bots or one-hit users" [22:49:42] but #requests isn't going to correlate with #devices to begin with, and even then the 2nd bin's mix of the two categories is unknown. [22:50:22] bblack: no, i do not think that is correct [22:50:32] (03CR) 10BBlack: [C: 032] zero.inc.vcl: refactor/simplify [puppet] - 10https://gerrit.wikimedia.org/r/244458 (owner: 10BBlack) [22:50:55] bblack: in a day timeperiod a "unique" (let's not say device) is a person that hits our site without a cookie for that day [22:51:35] bblack: in a month timeperiod a "unique" is a person that hits our site without a valid cookie for the month (these second cookie expires in 30 days) [22:52:28] bblack: if a request comes without any cookie at all it's a fresh user (or a bot) and will get discounted from the unique bucket [22:52:38] there are two cookies? [22:53:59] bblack: in the implementation there is only one, with an expire of 30 days that gets reseted every day [22:54:24] (03PS1) 10Ori.livneh: Make the metric path of the TCP connection diamond collector simpler [puppet] - 10https://gerrit.wikimedia.org/r/244608 [22:54:37] (03PS2) 10Ori.livneh: Make the metric path of the TCP connection diamond collector simpler [puppet] - 10https://gerrit.wikimedia.org/r/244608 [22:54:46] (03CR) 10Ori.livneh: [C: 032 V: 032] Make the metric path of the TCP connection diamond collector simpler [puppet] - 10https://gerrit.wikimedia.org/r/244608 (owner: 10Ori.livneh) [22:55:00] RoanKattouw: ostriches: rmoen: Krenair: aaarg I'm having trouble cherry-picking 'cause it's too conflicty, I'm just going to merge in the whole CN master branch. I think it'll be safe enuf [22:55:10] RoanKattouw: ostriches: rmoen: Krenair: did I mention it needs a scap? [22:55:16] OK [22:55:28] bblack: it's not intuitive at all, that is why the lengthy explanation on wikitech [22:55:31] You're updating the deployment branch to master? [22:56:02] AndyRussG: When was that branch last cut? Tuesda? [22:56:20] RoanKattouw: for core? Dunno... [22:56:48] I mean the wmf_deployment branch in CN, when was that last updated to master? [22:57:05] If your thing is very conflicty, that suggests it might have been a while ago [22:57:20] nuria: we should get rid of these cookies [22:57:32] ori: why? [22:58:02] ori: we will if this last change we want to make doesn't give us helpful data [22:58:05] nuria: I know how the existing implementation works, I wrote it, but wikitech doesn't really go beyond that or explain the convo above well [22:58:23] anyways, I'll bbl, maybe we should discuss it in a phab ticket or something [22:58:34] because tracking and counting and cookie-ing are inherently creepy, unsavory things to be doing, and the only conceivable justification for doing them is in the service of a very clear need [22:58:56] bblack: https://phabricator.wikimedia.org/T114370 there is one here [22:58:57] ori: this doesn't track [22:59:01] at all [22:59:24] ori: it just has a date when you last access [22:59:36] ori: when you last accessed wikipedia [23:00:01] i'm not saying it's some catastrophic ethical violation, but it's a lot of complex, difficult to maintain, perplexing code [23:00:03] ori: doesn't identify a user in the least [23:00:04] RoanKattouw ostriches rmoen Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151008T2300). [23:00:04] jdlrobson jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:10] it'd be one thing if it provided some very clear signal value, but it doesn't [23:00:20] you're trying to salvage some signal from it at this point [23:00:46] RoanKattouw: it was not toooo long ago, but the last time there was some cherrypicking involved [23:00:52] (03PS1) 10Dzahn: wikitech: add SSL cert expiry monitoring [puppet] - 10https://gerrit.wikimedia.org/r/244610 (https://phabricator.wikimedia.org/T114059) [23:01:01] ori: Agreed if the changes to flag requests without cookies at all do not work, if so, we shall remove it [23:01:14] RoanKattouw: ostriches: rmoen: Krenair: just a few minutes to re-check my merge? [23:01:27] I'm doing jdlrobson's patches now, I can wait for yours [23:01:44] ori: privacy wise we thought about this a LOT so it couldn't be used to track users in any way [23:01:46] it's not going to work, it's too complicated [23:02:11] ori: let's use the cheap proxy of requests without cookies and see [23:02:59] ori: if that is easy to set up in varnish it is worth checking whether it provides value cause might give us a better estimate than the one we have now about our monthly robot traffic [23:03:00] \o [23:03:31] but it's not easy to set up [23:03:53] ori: to identify in varnish whether you have any cookie set ? [23:04:05] ori: ay... i though bblack said it was doable [23:04:05] "easy to set up" is not just a matter of how many lines of code you need to write to get something working right here and now [23:04:45] 6operations, 10Architecture, 10Incident-20150423-Commons, 10MediaWiki-RfCs, and 6 others: RFC: Re-evaluate varnish-level request-restart behavior on 5xx - https://phabricator.wikimedia.org/T97206#1714107 (10GWicke) I just did an experiment with a page that reliably fails in Parsoid: `curl https://ia.wikib... [23:04:56] ori: true, what else do you think is involved in this case? [23:05:10] ori: wait, where is the complication, the update to our existing code (in the hive side) is minimal [23:05:14] you should factor into that assessment the portability of the implementation and the likelihood that it will break due to changes elsewhere [23:05:18] bbiab, meeting [23:05:28] ori: k [23:05:47] (03PS1) 10Dzahn: icinga: add ssl cert expiry for icinga itself [puppet] - 10https://gerrit.wikimedia.org/r/244614 (https://phabricator.wikimedia.org/T114059) [23:06:12] (03PS2) 10Dzahn: icinga: add cert expiry check for icinga itself [puppet] - 10https://gerrit.wikimedia.org/r/244614 (https://phabricator.wikimedia.org/T114059) [23:06:17] ori: hmmm, afaik, it'll just be a boolean field added to the X-Analytics header, i don't know much varnish but in theory i'm not seeing what else it'll break [23:06:39] !log deployed kartotherian [23:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:07:44] RoanKattouw: https://gerrit.wikimedia.org/r/#/c/244613/ [23:08:08] I'll just add it to the deployments page then!!! :) thanks again!!! [23:08:36] bblack: is your concern varnish code or another one? [23:08:46] (03PS1) 10Dzahn: dumps: add cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/244617 (https://phabricator.wikimedia.org/T114059) [23:09:07] bblack: cause we work hard to make sure these changes had no privacy implications [23:09:09] my concern is mostly that I don't understand how this will work or solve the problem, and we already have a thing that also doesn't solve the problem :) [23:09:45] maybe that's just me misunderstanding, but if so clearly an IRC conversation isn't enough heh [23:10:26] AndyRussG: OK, there are still diffs between master and wmf_deploy, but they're all blank lines or comments [23:10:28] (03CR) 10Nuria: [C: 031] [WIP] analytics:Add cron that schedules camus imports for mediawiki data [puppet] - 10https://gerrit.wikimedia.org/r/244601 (https://phabricator.wikimedia.org/T113521) (owner: 10Madhuvishy) [23:10:41] So I'm satisfied that wmf_deploy is at least functionally the same as master [23:10:51] RoanKattouw: hum! that's strange! [23:10:55] (03PS1) 10Dzahn: gerrit: add cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/244618 (https://phabricator.wikimedia.org/T114059) [23:11:09] AndyRussG: cd CentralNotice; git fetch; git diff origin/master..origin/wmf_deploy [23:11:14] to see what I'm talking about [23:11:17] (the .gitreview diff is legitimate) [23:12:44] RoanKattouw: aarg, I should've seen that... K, yeah it should work. I guess after we can add a patch to master so that we don't get merge conflicts next time, no? [23:12:52] Yup [23:14:00] RoanKattouw: ping me know when youneed me to test :) [23:14:13] Sorry, still waiting on some git slowness [23:14:20] Somehow CN updates are no longer automagic so I need to create them manually [23:14:43] RoanKattouw: Oh! huh, sorry... Maybe it's just slowness there too? [23:15:04] Nah I don't think so [23:15:17] The slowness I was referring to was waiting for git pull to finish on a 6-ish-week-old checkout [23:15:32] Aaahhh right [23:15:34] I haven't had to mess with wmf/* manually for a while because so much of it is automatic now [23:16:15] RoanKattouw: do you want me to prepare the core patches then? I trust you much more than me... [23:16:44] I just did them [23:17:08] Now I just have to wait for them to both go through Jenkins [23:17:09] K yeah that's like 20x faster than me, in addition to more trustworthy 8p [23:17:11] And then I'll scap [23:17:17] The scap itself will take a while too [23:17:55] RoanKattouw: amazing!! yeah I know... should I await a ping when it's live? [23:18:00] Yup [23:18:31] RoanKattouw: K fantastic, thanks again!!!!!!! [23:18:43] bblack: i see, yes, it is not easy to explain in irc [23:19:12] bblack: let's talk about it briefly in hangout? doesn't have to be today [23:19:37] just make a ticket, it will be simpler because we can paste code and talk in specific terms about implementation [23:20:05] bblack: ticket here: https://phabricator.wikimedia.org/T114370 [23:20:28] ok [23:20:54] bblack: let me know if you want more detail.. etc [23:20:57] 6operations, 6Analytics-Backlog, 6Analytics-Kanban, 10Traffic: Flag in x-analytics in varnish any request that comes with no cookies whatsoever - https://phabricator.wikimedia.org/T114370#1714113 (10BBlack) [23:21:04] 6operations, 5Patch-For-Review: Ferm rules for tin/mira - https://phabricator.wikimedia.org/T113351#1714117 (10Dzahn) The problem here is: ``` # http on trebuchet deployment servers, for serving actual files to deploy +&R_SERVICE(tcp, 80, 208.80.152.0/222620:0:860::/4610.64.0.0/222620:0:861:101::/6410.64.16... [23:26:55] OK, here we go, scapping now [23:27:38] !log catrope@tin Started scap: SWAT [23:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:29:20] 7Blocked-on-Operations, 3Discovery-Maps-Sprint, 7service-runner: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1714121 (10Yurik) 3NEW [23:33:56] (03PS1) 10Dzahn: deployment: fix firewalling for mira [puppet] - 10https://gerrit.wikimedia.org/r/244622 (https://phabricator.wikimedia.org/T113351) [23:36:21] (03CR) 10Dzahn: "# http on trebuchet deployment servers, for serving actual files to deploy" [puppet] - 10https://gerrit.wikimedia.org/r/244622 (https://phabricator.wikimedia.org/T113351) (owner: 10Dzahn) [23:38:05] (03CR) 10Dzahn: [C: 032] "~# /etc/init.d/ferm start" [puppet] - 10https://gerrit.wikimedia.org/r/244622 (https://phabricator.wikimedia.org/T113351) (owner: 10Dzahn) [23:40:24] 6operations, 10Flow, 10MediaWiki-Redirects, 3Collaboration-Team-Current, and 2 others: Flow notification links on mobile point to desktop - https://phabricator.wikimedia.org/T107108#1714138 (10Mooeypoo) I can't reproduce this now. Clicking notification about a flow topic leads me to the mobile version of F... [23:46:15] RoanKattouw: still scapping? [23:48:44] (03PS1) 10Dzahn: deployment: fix firewalling for mira pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/244624 (https://phabricator.wikimedia.org/T113351) [23:49:21] (03CR) 10jenkins-bot: [V: 04-1] deployment: fix firewalling for mira pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/244624 (https://phabricator.wikimedia.org/T113351) (owner: 10Dzahn) [23:50:33] who knows how to view service status log? service status shows some errors, but i can't find a way to view more [23:50:51] (03PS2) 10Dzahn: deployment: fix firewalling for mira pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/244624 (https://phabricator.wikimedia.org/T113351) [23:50:56] have a deploy crashing bug, had to revert [23:51:24] !log reverted Kartotherian to HEAD^^ - the service wouldn't start [23:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:56:00] 6operations, 6Services: Set up external uptime metrics for REST API - https://phabricator.wikimedia.org/T115022#1714160 (10GWicke) @chasemp: Ohh, forgot about this. Thanks for the data & the link. Is there a way to query this information in an ongoing manner? [23:56:02] yurik: if we follow the same sudo rules as for other services it should be "journalctl -u kartotherian" but it looks that is missing from your admin group [23:56:48] !log catrope@tin Finished scap: SWAT (duration: 29m 10s) [23:56:53] mutante, could you add that please? CC: akosiaris [23:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:57:17] yurik: yea, but we still need a ticket for it [23:57:44] mutante, https://phabricator.wikimedia.org/T115067 [23:58:49] yurik@maps-test2001:~/kartotherian$ journalctl -u kartotherian [23:58:49] No journal files were found. [23:59:13] i would have expected a perm error [23:59:14] yurik: ok, i was about to say i'll upload a patch...like this: [23:59:18] 'ALL = NOPASSWD: /bin/journalctl -u nodepool', [23:59:25] in the nodepool group