[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151111T0000). Please do the needful. [00:00:08] (03PS3) 10Dzahn: Phab: clean up role, remove ::config and ::main abstraction [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad) [00:00:29] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [00:01:15] (03CR) 10jenkins-bot: [V: 04-1] Phab: clean up role, remove ::config and ::main abstraction [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad) [00:04:56] (03PS4) 10Dzahn: Phab: clean up role, remove ::config and ::main abstraction [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad) [00:05:00] PROBLEM - puppet last run on db2066 is CRITICAL: CRITICAL: puppet fail [00:05:06] anyone doing swat deploy? i can i suppose [00:05:20] ebernhardson: train is still running [00:05:30] ahh, ok [00:06:03] it's not even syncing apache nodes yet, so ~30 minutes remain [00:06:10] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 22 data above and 7 below the confidence bounds [00:06:21] (03CR) 10jenkins-bot: [V: 04-1] Phab: clean up role, remove ::config and ::main abstraction [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad) [00:07:18] (03CR) 10Dzahn: "amended to: respond to ori's request to fix variable names, manually rebase and add lint fixes" [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad) [00:08:50] (03PS5) 10Dzahn: Phab: clean up role, remove ::config and ::main abstraction [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad) [00:10:28] (03PS1) 10Thcipriani: Move scap-specific items out of mediawiki class [puppet] - 10https://gerrit.wikimedia.org/r/252362 (https://phabricator.wikimedia.org/T116606) [00:13:55] 7Puppet, 10Deployment-Systems, 5Patch-For-Review, 3Scap3: Refactor `mediawiki::scap` to make sure Scap dependencies are not dependent on mediawiki - https://phabricator.wikimedia.org/T116606#1798095 (10thcipriani) Patch above is some work that I had ~80% complete before this task moved to in-progress. @joe... [00:14:08] (03PS2) 10BryanDavis: Monolog: wrap channel handlers in a WhatFailureGroupHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252359 (https://phabricator.wikimedia.org/T118057) [00:14:13] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1798099 (10GWicke) [00:15:46] twentyafterfour: how we doing re train? [00:15:52] ah, still running, nvm [00:15:54] :/ [00:17:17] greg-g: All my fault (with some help from my friends) [00:17:39] we changed shit and didn't make sure twentyafterfour knew how it might all go sideways [00:17:39] there's a sitcom that reminds me of, but I can't place it [00:17:44] * greg-g nods [00:17:50] shame shame [00:19:27] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/1230/iridium.eqiad.wmnet/change.iridium.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad) [00:21:09] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 23 data above and 7 below the confidence bounds [00:22:13] (03PS1) 10MaxSem: Switch www.wikipedia.org to Git [puppet] - 10https://gerrit.wikimedia.org/r/252364 [00:22:25] (03PS6) 10Dzahn: Phab: clean up role, remove ::config and ::main abstraction [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad) [00:24:23] (03PS2) 10MaxSem: Switch www.wikipedia.org to Git [puppet] - 10https://gerrit.wikimedia.org/r/252364 [00:26:26] (03CR) 10Dzahn: [C: 04-1] "one error fixed, different error now" [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad) [00:26:41] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 21 data above and 6 below the confidence bounds [00:30:14] (03PS1) 10MaxSem: Switch remaining portals to Git [puppet] - 10https://gerrit.wikimedia.org/r/252366 [00:30:16] (03CR) 10Jhobs: [C: 031] First QuickSurvey for reader segmentation research - external survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251133 (https://phabricator.wikimedia.org/T113443) (owner: 10Jdlrobson) [00:31:19] RECOVERY - puppet last run on db2066 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:31:37] the thing is, it's still going to go sideways next time ... [00:38:03] !log twentyafterfour@tin Finished scap: grr: sync 1.27.0-wmf.6 (duration: 55m 05s) [00:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:38:09] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 18 data above and 7 below the confidence bounds [00:39:46] still need to do the wikiversions update? [00:40:02] greg-g: yes that was testwiki [00:40:09] * greg-g nods [00:40:25] * greg-g sees it now [00:42:57] (03PS1) 1020after4: group0 wikis to 1.27.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252371 [00:47:10] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [00:48:00] RECOVERY - Host mw2027 is UP: PING OK - Packet loss = 0%, RTA = 34.77 ms [00:49:28] (03CR) 1020after4: [C: 032] group0 wikis to 1.27.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252371 (owner: 1020after4) [00:49:48] (03Merged) 10jenkins-bot: group0 wikis to 1.27.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252371 (owner: 1020after4) [00:50:08] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 wikis to 1.27.0-wmf.6 [00:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:50:57] RoanKattouw: I was never pinged by a bot and there's about 10 minutes left in the window, is everything still fine with deployment of 251133? [00:52:26] (03CR) 10Dzahn: "one error fixed, different error now" [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad) [00:53:02] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 6 below the confidence bounds [00:53:20] 6operations, 10Continuous-Integration-Infrastructure, 7Jenkins, 7WorkType-Maintenance: Please refresh Jenkins package on apt.wikimedia.org to 1.625.1 - https://phabricator.wikimedia.org/T118158#1798164 (10Dzahn) a:3Dzahn [00:53:42] jhobs: From what I heard the 11am (Pacific Time) deployment is massively delayed and still not finished [00:53:59] jhobs: So at the least the SWAT will have to wait, and perhaps it'll be canceled entirely [00:54:03] That's up to twentyafterfour [00:54:06] and greg-g [00:54:36] given the timing, unfortunately it'll be post-poned [00:54:41] unless it's an emergency? [00:54:53] train just finished [00:55:36] greg-g: not an emergency, but we're planning to deploy a survey to enwiki on Thursday and the patch for this SWAT was to deploy to testwiki to give a couple days to be certain everything's fine [00:55:44] twentyafterfour: but, future scaps are screwed? or? [00:55:56] jhobs: yeah, understood. checking [00:56:02] were there any changes re API rate limiting recently? [00:56:13] gwicke: I think so [00:56:21] Maybe ops side, rather than MW side [00:56:23] I'm suddenly getting lots of 429 errors when running restbase tests [00:56:49] looks like potentially serious breakage if I can trigger this by running tests [00:58:33] !log added jenkins_1.625.1 to APT repo. precise-wikimedia [00:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:58:49] greg-g: creating a new branch is all messed up [00:58:51] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [00:58:53] scap isn't broken [00:59:08] twentyafterfour: ahhhh, ok [01:00:14] twentyafterfour: was the chgrp all you had to manually do once you figured it out? [01:00:28] or was it worse than that? [01:00:30] bd808: no [01:01:01] * bd808 remembers umask was busted too [01:01:04] I did chmod -R g+w $VERSION and ori had to do the chgrp for me [01:01:12] yeah that's no good [01:01:20] we can roll back the sudo patch [01:01:31] 6operations, 10Continuous-Integration-Infrastructure, 7Jenkins, 7WorkType-Maintenance: Please refresh Jenkins package on apt.wikimedia.org to 1.625.1 - https://phabricator.wikimedia.org/T118158#1798168 (10Dzahn) wget http://pkg.jenkins-ci.org/debian-stable/binary/jenkins_1.625.1_all.deb .. [carbon:/srv/wik... [01:01:37] or add sudoers rules to set the group... [01:01:54] really we need to figure out how to ignore the mtime failures on rsync [01:01:57] 6operations, 10Continuous-Integration-Infrastructure, 7Jenkins, 7WorkType-Maintenance: Please refresh Jenkins package on apt.wikimedia.org to 1.625.1 - https://phabricator.wikimedia.org/T118158#1798169 (10Dzahn) 5Open>3Resolved [01:02:00] that would be nicest [01:02:49] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [01:04:22] greg-g: I know you guys are probably super busy right now trying to patch everything up, but when you get a sec, I just need to know whether to reschedule my SWAT patch to tomorrow morning [01:04:25] author of gitblit: "I was not expecting users to create a repo and a subdirectory with the same name." [01:04:30] of course we did though :) [01:04:47] and that breaks things, heh [01:04:48] ^5 [01:04:55] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for ejegg - https://phabricator.wikimedia.org/T118320#1798176 (10Ottomata) @ejegg wants to be in the `analytics-privatedata-users`group. [01:05:41] bd808: https://gerrit.wikimedia.org/r/#/c/251133/ could prob use another set of eyes with JonR out btw if you have time to review [01:06:04] bd808: it's just the testwiki enable for QS we talked about yesterday [01:06:13] Reedy: proposed solution: "could be fixed today in that installation by renaming the mediawiki/extensions repo" hehehe [01:08:10] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [01:08:42] 6operations, 10Gitblit: Accessing raw link on git.wikimedia.org causes "Error Sorry, the repository mediawiki does not have a extensions branch!" - https://phabricator.wikimedia.org/T118156#1798180 (10Dzahn) >>! In T118156#1793885, @Paladox wrote: > And could be fixed with ! in the config I think. Unfortunat... [01:09:48] 6operations, 10Gitblit: Accessing raw link on git.wikimedia.org causes "Error Sorry, the repository mediawiki does not have a extensions branch!" - https://phabricator.wikimedia.org/T118156#1798183 (10Dzahn) It's interesting though.. the "I was not expecting users to create a repo and a subdirectory with the s... [01:10:01] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [01:12:12] jhobs: the google form that points at says it is closed. other than that it looks ok (as far as I understand QS) [01:12:30] bd808: that is by intention. It will be opened on Thursday before it deploys to enwiki [01:15:58] (03CR) 10Krinkle: Switch www.wikimedia.org to source control (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem) [01:16:43] (03CR) 10MaxSem: Switch www.wikimedia.org to source control (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem) [01:17:55] (03CR) 10Krinkle: Switch www.wikimedia.org to source control (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem) [01:18:59] (03CR) 10MaxSem: Switch www.wikimedia.org to source control (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem) [01:19:39] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [01:21:29] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [01:22:54] (03PS4) 10MaxSem: Switch www.wikimedia.org to source control [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) [01:24:15] 6operations, 6Labs, 10Tool-Labs, 7Mail: Offer a solution to manage @toolserver.org mail redirections - https://phabricator.wikimedia.org/T116373#1798202 (10Dzahn) [01:25:27] RoanKattouw greg-g twentyafterfour: sorry for the mass ping, but it's almost 8:30pm here and I still haven't gotten a solid answer. Is the Evening SWAT still happening and, if not, is there anything you need me to do to reschedule my patch for tomorrow morning, or will that happen automatically? [01:26:30] Someone would usually have to move it [01:26:33] Or you can do it yourself [01:26:38] I would assume that it's canceled, and that you should manually move your patch to tomorrow's "morning" window [01:26:45] Ok, thank you. [01:27:02] have a good evening folks o/ [01:29:03] 6operations, 10Wikimedia-Media-storage, 7Monitoring: Monitor [[Special:ListFiles]] for non 200 HTTP statuses in thumbnails - https://phabricator.wikimedia.org/T106937#1798210 (10Dzahn) >>! In T106937#1485653, @chasemp wrote: > @mark is this worthy of a catchpoint alert? It seems like it may be a good extern... [01:30:42] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure: Varnish rate limiting has broken beta - https://phabricator.wikimedia.org/T118362#1798211 (10Reedy) 3NEW [01:30:49] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [01:31:52] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure: Varnish rate limiting has broken beta - https://phabricator.wikimedia.org/T118362#1798218 (10Reedy) p:5Triage>3High [01:34:55] MaxSem: Would be nice to see those apache configs in beta first. Doesn't feel tested right now. E.g. the pattern match being different etc. [01:36:06] Krinkle, was thoroughly tested on http://www.wikipedia.beta.wmflabs.org/ (before rate limiting broke the hell outta it) [01:38:05] MaxSem: For for www.wikimedia.org though [01:38:10] and the apache configs don't match [01:38:18] Between what's on beta and in your commit [01:39:34] https://gerrit.wikimedia.org/r/#/c/248374/7/modules/mediawiki/files/apache/beta/sites/wikipedia.conf vs. https://gerrit.wikimedia.org/r/#/c/252364/2/modules/mediawiki/files/apache/sites/wwwportals.conf [01:40:50] MaxSem: OK. That matches now. Then I only have the Cache-control issue remaining [01:41:05] it's 3600 now in prod [01:41:10] not that it matters [01:41:14] it seems a regression for the html. But I see now a more severe issue which is that it enforces it for all of portal/* assets as well [01:41:23] That should not be needed at all. [01:41:32] Does E-Tag/304 not work? [01:41:39] no [01:42:12] etag/last-modified is for browsers pulling changes from varnish [01:42:44] varnish will however cache everything for 30 days if you don't set a smaxage explicitly [01:43:39] this header is pulled verbatim from extract2.php, and for a reason [01:44:10] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1083's sda disk is dying - https://phabricator.wikimedia.org/T116184#1798227 (10Dzahn) looks like it has been reinstalled but is not pooled yet [01:44:41] MaxSem: for the html yeah, [01:44:45] but not for the assets [01:45:20] Giving browsers max-age=0 for assets will likely degrade performance [01:46:02] (03CR) 10GWicke: "It looks like this broke a couple of things:" [puppet] - 10https://gerrit.wikimedia.org/r/241643 (owner: 10BBlack) [01:48:38] !log repooled mw1083 (T116184) [01:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:48:49] Krinkle, unlike the wikis, portals are just one singe page each, and users aren't supposed to return to them [01:49:09] I'm not going to explain this now. I can file a task later with more details if you need it. [01:49:16] Closing up for the day. [01:49:26] sweet dreas Krinkle [01:49:29] *dreams [01:49:45] It's not a huge issue. [01:49:59] 6operations: Investigate idle/depooled eqiad appservers - https://phabricator.wikimedia.org/T116256#1798231 (10Dzahn) [01:50:00] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1083's sda disk is dying - https://phabricator.wikimedia.org/T116184#1798228 (10Dzahn) 5Open>3Resolved a:3Dzahn repooled in pybal. getting traffic again. [01:50:32] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure: Varnish rate limiting has broken beta - https://phabricator.wikimedia.org/T118362#1798233 (10GWicke) Other victims are RESTBase integration tests, from either the office or travis. About 40 API requests distributed across a ~50 second test run. [01:50:35] 6operations, 10ops-eqiad: mw1083's sda disk is dying - https://phabricator.wikimedia.org/T116184#1798234 (10Dzahn) [01:51:51] (03CR) 10GWicke: "See https://phabricator.wikimedia.org/T118362." [puppet] - 10https://gerrit.wikimedia.org/r/241643 (owner: 10BBlack) [01:52:49] 6operations, 5Patch-For-Review: Install fonts-wqy-zenhei on all mediawiki app servers - https://phabricator.wikimedia.org/T84777#1798235 (10Dzahn) a:3Dzahn [02:02:13] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure: Varnish rate limiting has broken beta - https://phabricator.wikimedia.org/T118362#1798242 (10Dzahn) Maybe oauth from labs is affected. Getting reports that login screens like: https://tools.wmflabs.org/wikidata-game/ using the "widar" tool... [02:03:28] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure: Varnish rate limiting has broken beta - https://phabricator.wikimedia.org/T118362#1798243 (10GWicke) In case it helps with debugging, there seems to have been a multi-hour delay between the deploy & blocking starting in the office. [02:04:59] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [02:08:20] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [02:08:36] (03PS1) 10Faidon Liambotis: Revert "varnish: misspass limiter" [puppet] - 10https://gerrit.wikimedia.org/r/252385 (https://phabricator.wikimedia.org/T118362) [02:08:40] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [02:13:27] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 5Patch-For-Review: Varnish rate limiting has broken beta - https://phabricator.wikimedia.org/T118362#1798250 (10Dzahn) This works again http://deployment.wikimedia.beta.wmflabs.org/wiki/Main_Page after Faidon restarted varnish in beta. [02:15:51] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 5 below the confidence bounds [02:26:24] 6operations, 10ops-codfw: return pollux to spares - https://phabricator.wikimedia.org/T117423#1798286 (10Dzahn) [02:28:14] 6operations: build newer tor packages - https://phabricator.wikimedia.org/T116964#1798289 (10Dzahn) a:5Dzahn>3faidon already done by Faidon. http://apt.wikimedia.org/wikimedia/pool/thirdparty/t/tor/ ii tor 0.2.6.10-1~d80.jessie+1 [02:28:20] 6operations: build newer tor packages - https://phabricator.wikimedia.org/T116964#1798292 (10Dzahn) 5Open>3Resolved [02:28:21] 6operations: upgrade radium to jessie - https://phabricator.wikimedia.org/T116963#1798293 (10Dzahn) [02:36:14] !log l10nupdate@tin Synchronized php-1.27.0-wmf.5/cache/l10n: l10nupdate for 1.27.0-wmf.5 (duration: 07m 01s) [02:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:41:40] 6operations, 10Traffic: Increase request limits for /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1798302 (10GWicke) 3NEW [02:41:56] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1798310 (10GWicke) [02:43:25] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1798302 (10GWicke) [02:48:55] (03PS1) 10Dzahn: zookeeper,dynamicproxy: double quoted strings [puppet] - 10https://gerrit.wikimedia.org/r/252389 [02:48:57] (03PS1) 10Dzahn: role/lvs: double quoted strings [puppet] - 10https://gerrit.wikimedia.org/r/252390 [02:48:59] (03PS1) 10Dzahn: ircbalance: double quoted strings [puppet] - 10https://gerrit.wikimedia.org/r/252391 [02:49:01] (03PS1) 10Dzahn: role/lvs/balancer: double quoted strings [puppet] - 10https://gerrit.wikimedia.org/r/252392 [02:51:22] !log krinkle@tin Synchronized php-1.27.0-wmf.5/includes/Title.php: Deploy I47517021471 which was merged earlier today but forgotten (duration: 01m 58s) [02:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:51:40] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [02:52:30] (03PS3) 10Dzahn: Re add some regex.global. code back for gitblit [puppet] - 10https://gerrit.wikimedia.org/r/250444 (owner: 10Paladox) [02:52:40] (03CR) 10Dzahn: [C: 032] Re add some regex.global. code back for gitblit [puppet] - 10https://gerrit.wikimedia.org/r/250444 (owner: 10Paladox) [02:53:14] (03CR) 10Dzahn: tune gitblit settings to improve performance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/250369 (owner: 10Ori.livneh) [02:56:02] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 5Patch-For-Review: Varnish rate limiting has broken beta - https://phabricator.wikimedia.org/T118362#1798315 (10GWicke) 5Open>3Resolved a:3GWicke Confirmed resolved for RESTBase tests as well. Most of those are actually hitting labs... [03:14:45] (03CR) 10Chad: [C: 031] "Can be merged whenever, cleanup happens after merge." [puppet] - 10https://gerrit.wikimedia.org/r/244498 (https://phabricator.wikimedia.org/T86661) (owner: 10Hashar) [03:17:29] (03CR) 10Chad: "Tags are not disabled, this just controls how many branches/tags are shown on the summary page. Are they really needed there?" [puppet] - 10https://gerrit.wikimedia.org/r/250449 (owner: 10Paladox) [03:19:59] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [03:21:50] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [03:23:07] !log krinkle@tin Synchronized php-1.27.0-wmf.6/includes/cache/MessageBlobStore.php: Logging for T93800 (duration: 00m 31s) [03:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:33:19] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [03:35:10] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [03:42:20] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [03:55:29] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [04:17:20] (03PS2) 10Chad: Remove 3 old wmgMonologChannels related to closed bugs/tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250850 (owner: 10Reedy) [04:17:42] Reedy: Going forth ^ [04:18:11] (03CR) 10Chad: [C: 032] Remove 3 old wmgMonologChannels related to closed bugs/tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250850 (owner: 10Reedy) [04:18:31] (03Merged) 10jenkins-bot: Remove 3 old wmgMonologChannels related to closed bugs/tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250850 (owner: 10Reedy) [04:19:24] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: rm some old log channels (duration: 00m 31s) [04:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:22:00] ori: I wonder how many settings we have that are the same as mw core defaults and we're needlessly overriding in mw-config. [04:22:09] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [04:22:13] (on the subject of trying to optimize our code there since it's hot) [04:22:31] ostriches: hah, it's funny that you ask. I don't think removing those would be a perf win, but it would certainly be a boon for sanity and maintainability [04:22:35] we could check that programmatically [04:23:13] in a loop, do the following: [04:23:28] * remove a line from config [04:23:31] * see if it broke [04:23:35] * revert [04:23:38] heh [04:23:41] not quite [04:23:49] match $wg[\w+] = [^;]+; [04:24:15] run mwscript eval.php foowiki <<<"var_dump $wgVariableName;" [04:24:17] remove the line [04:24:19] run mwscript eval.php foowiki <<<"var_dump $wgVariableName;" [04:24:21] compare outputs [04:24:32] etc. [04:24:46] That would still require lots of manual checking because there are plenty of if ( $wgDBname = ... ) blocks [04:24:54] but that should at least produce a shortlist [04:25:20] I've actually got an idea for a script that could do it. [04:25:55] Grab the list of $GLOBALS starting with $wg, then load up DefaultSettings.php and see what's different. [04:26:07] Some false positives because of Setup.php, but easily ruled out. [04:27:49] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [04:44:29] ostriches: that's a far better idea [04:44:32] i like that [04:44:38] much simpler and more effective [04:44:55] Completely unrelated, but headline made me lol. http://www.theguardian.com/technology/2015/nov/10/betamax-dead-long-live-vhs-sony-end-prodution [05:17:34] (03PS1) 10Ori.livneh: Add a paging alert for Redis memory utilization [puppet] - 10https://gerrit.wikimedia.org/r/252396 (https://phabricator.wikimedia.org/T118331) [05:17:36] ostriches: I heard about the betamax thing after an NPR Marketplace segment on a Springfield, MO (cc twentyafterfour ) cassette tape duplicator/printer that sold more tapes this past year than ever before (10 million, I think). [05:18:29] Get them before they're gone? [05:19:44] apparently the company bought up all the machines from all the others than went out of business. No one makes the machines anymore so they have their own machine shop to service them. He sounded really legitimately confident. [05:22:46] to duplicate audio cassettes? [05:22:49] or video? [05:22:51] 6operations, 5Patch-For-Review: Alert when used_memory gets too high for redis queues - https://phabricator.wikimedia.org/T118331#1798403 (10ori) The patch above implements what you asked for, but I'm not sure it's a good idea, at least not by itself. Paging ops won't do much good if they don't know how what t... [05:23:03] twentyafterfour: audio [05:23:18] and print new ones (that part was slightly unclear, but pretty sure) [05:23:34] s/print/press/ # I guess? not sure what the lingo is :) [05:23:43] dub? [05:23:48] heh [06:17:09] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [06:18:59] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [06:28:10] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: puppet fail [06:29:41] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: puppet fail [06:30:30] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:39] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:40] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:00] PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:01] PROBLEM - puppet last run on mw1060 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:20] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:31] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:31] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: puppet fail [06:31:59] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:00] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:10] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:10] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 4 failures [06:51:00] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [06:52:50] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [06:56:41] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:56:49] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:57:41] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:42] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:58:00] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:11] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:48] (03PS1) 10Legoktm: [Planet Wikimedia] Use HTTPS for Legoktm's blog [puppet] - 10https://gerrit.wikimedia.org/r/252398 [07:11:49] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [07:13:40] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [07:27:00] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:11] RECOVERY - puppet last run on mw1060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:20] RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [07:27:29] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [07:27:30] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [07:27:30] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:28:09] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:20] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:36:30] (03PS1) 10Florianschmidtwelzow: REL1_26 knocks on the door [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252399 [07:42:04] (03PS3) 10Giuseppe Lavagetto: maintenance: reduce the lightprocess count from cli [puppet] - 10https://gerrit.wikimedia.org/r/252203 [07:43:49] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [07:45:02] (03CR) 10Legoktm: [C: 031] REL1_26 knocks on the door [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252399 (owner: 10Florianschmidtwelzow) [07:45:39] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [07:56:52] (03PS2) 10Yuvipanda: tools: make sure UseDNS is set for HBA [puppet] - 10https://gerrit.wikimedia.org/r/252258 (https://phabricator.wikimedia.org/T116687) (owner: 10Merlijn van Deen) [07:57:06] (03PS3) 10Yuvipanda: tools: make sure UseDNS is set for HBA [puppet] - 10https://gerrit.wikimedia.org/r/252258 (https://phabricator.wikimedia.org/T116687) (owner: 10Merlijn van Deen) [07:58:54] (03CR) 10Yuvipanda: [C: 032] tools: make sure UseDNS is set for HBA [puppet] - 10https://gerrit.wikimedia.org/r/252258 (https://phabricator.wikimedia.org/T116687) (owner: 10Merlijn van Deen) [08:02:39] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [08:04:30] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [08:08:20] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0] [08:11:20] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [08:12:09] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [08:21:19] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [5000000.0] [08:24:20] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [08:25:01] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0] [08:27:22] (03PS2) 10Muehlenhoff: Fix Hiera path for analytics::spark::standalone::worker role [puppet] - 10https://gerrit.wikimedia.org/r/250685 [08:27:59] (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix Hiera path for analytics::spark::standalone::worker role [puppet] - 10https://gerrit.wikimedia.org/r/250685 (owner: 10Muehlenhoff) [08:31:33] 6operations, 10Gitblit: Accessing raw link on git.wikimedia.org causes "Error Sorry, the repository mediawiki does not have a extensions branch!" - https://phabricator.wikimedia.org/T118156#1798539 (10saper) @aklapper added operations just in case it is related to some kind of URL rewriting/server-side configu... [08:39:09] (03PS1) 10Muehlenhoff: Tweak some server groups / grains [puppet] - 10https://gerrit.wikimedia.org/r/252403 [08:39:12] (03PS1) 10Muehlenhoff: Assign salt grains for ipsec test hosts [puppet] - 10https://gerrit.wikimedia.org/r/252404 [08:43:33] (03PS2) 10Muehlenhoff: Tweak some server groups / grains [puppet] - 10https://gerrit.wikimedia.org/r/252403 [08:43:41] (03CR) 10Muehlenhoff: [C: 032 V: 032] Tweak some server groups / grains [puppet] - 10https://gerrit.wikimedia.org/r/252403 (owner: 10Muehlenhoff) [08:43:57] (03PS2) 10Muehlenhoff: Assign salt grains for ipsec test hosts [puppet] - 10https://gerrit.wikimedia.org/r/252404 [08:44:08] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for ipsec test hosts [puppet] - 10https://gerrit.wikimedia.org/r/252404 (owner: 10Muehlenhoff) [08:51:39] !log importing wmf-mariadb10_10.0.22-1_amd64.deb wmf-mysql57_5.7.9-1_amd64.deb on carbon [08:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:57:03] (03PS2) 10Giuseppe Lavagetto: pybal: don't write pool files using confd [puppet] - 10https://gerrit.wikimedia.org/r/252242 [09:01:30] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [09:09:54] robh: https://phabricator.wikimedia.org/T118372 // OAuth is broken for all apps [09:10:07] 6operations, 10MediaWiki-extensions-OAuth: OAuth broken - https://phabricator.wikimedia.org/T118372#1798586 (10valhallasw) [09:10:15] (03PS1) 10Giuseppe Lavagetto: hiera: double-quote interpolating tokens [puppet] - 10https://gerrit.wikimedia.org/r/252405 [09:10:24] <_joe_> valhallasw`cloud: what has rob to do with it? [09:10:46] from the topic I gather he's the contact person for when stuff breaks? [09:10:57] 'ops clinic duty'/ [09:10:59] <_joe_> oh yes, but he's not here atm I guess [09:11:05] <_joe_> since it's like 3 am there [09:11:21] <_joe_> anyways, oauth is broken? let me check [09:11:32] 6operations, 10MediaWiki-extensions-OAuth: OAuth broken - https://phabricator.wikimedia.org/T118372#1798540 (10valhallasw) [09:11:58] <_joe_> valhallasw`cloud: I doubt it's an ops issue though [09:12:01] yeah, see the error message at https://tools.wmflabs.org/widar/index.php?action=authorize, which is what widar receives from mw.org [09:12:06] Request from 10.64.0.102 via cp1066 cp1066 ([10.64.0.103]:3128), Varnish XID 2018594656 [09:12:06] Forwarded for: 10.68.20.248, 10.64.0.102, 10.64.0.102 [09:12:06] Error: 503, Service Unavailable at Wed, 11 Nov 2015 09:11:43 GMT [09:12:23] <_joe_> yeah just tested oauth, it's broken [09:12:31] <_joe_> now let me look at error logs [09:13:25] <_joe_> Fatal error: $this is null in /srv/mediawiki/php-1.27.0-wmf.6/extensions/OAuth/backend/MWOAuthRequest.php on line 71 [09:13:34] <_joe_> this seems related, doesn't it valhallasw`cloud ? [09:13:35] <_joe_> :P [09:13:48] <_joe_> ok let me find out what has changed there [09:14:55] 6operations, 10MediaWiki-extensions-OAuth: OAuth broken - https://phabricator.wikimedia.org/T118372#1798591 (10Joe) a:3Joe [09:15:56] the 1:50 scap seems to be the origin point [09:16:09] 6operations, 10MediaWiki-extensions-OAuth: OAuth broken - https://phabricator.wikimedia.org/T118372#1798540 (10Joe) I confirmed OAuth is broken and the reason is the following: ``` Fatal error: $this is null in /srv/mediawiki/php-1.27.0-wmf.6/extensions/OAuth/backend/MWOAuthRequest.php on line 71 ``` Investi... [09:16:12] <_joe_> jynus: yup [09:16:13] o no [09:16:17] <_joe_> no? [09:16:21] it has happened since 1:05 [09:16:49] so around 1AM GMT [09:17:30] <_joe_> 1 AM GMT? [09:17:33] <_joe_> sure it's GMT? [09:17:39] PROBLEM - puppet last run on ms-be1006 is CRITICAL: CRITICAL: Puppet has 1 failures [09:17:51] maybe kibana is translating to my local timezone [09:18:18] <_joe_> I think it is [09:19:10] no, it is GMT and it happend as early as 00:53 GMT / 1:53 my timezone [09:19:19] <_joe_> jynus: no you're actually right, it started around 0:53 [09:19:20] <_joe_> eheh [09:19:24] :-) [09:19:34] <_joe_> 00:50 logmsgbot: twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 wikis to 1.27.0-wmf.6 [09:19:41] <_joe_> this seems the smoking gun [09:19:45] 6operations, 10MediaWiki-extensions-OAuth: OAuth broken - https://phabricator.wikimedia.org/T118372#1798600 (10Glaisher) https://github.com/wikimedia/mediawiki-extensions-OAuth/blob/master/backend/MWOAuthRequest.php#L37 MWOAuthRequest::fromRequest() is a static function [09:19:57] <_joe_> I have no idea how to revert that [09:20:19] ... [09:20:36] _joe_: want me to take a look? [09:20:49] <_joe_> twentyafterfour: I think glashier found the problem [09:20:52] <_joe_> see his comment [09:21:13] <_joe_> that is clearly wrong :) [09:21:42] <_joe_> I am working on a patch [09:22:04] 6operations, 10MediaWiki-extensions-OAuth: OAuth broken - https://phabricator.wikimedia.org/T118372#1798602 (10Joe) @Glasher you're right, that is the problem indeed. [09:22:14] 6operations, 10MediaWiki-extensions-OAuth: OAuth broken - https://phabricator.wikimedia.org/T118372#1798603 (10Glaisher) Caused by https://gerrit.wikimedia.org/r/#/c/248488/ [09:22:21] 6operations, 10MediaWiki-extensions-OAuth: OAuth broken - https://phabricator.wikimedia.org/T118372#1798606 (10mmodell) [09:22:24] <_joe_> twentyafterfour: what I don't know is how to push a fix to an extension to prod [09:23:09] <_joe_> twentyafterfour: so, should I just patch the master of the extension and then cherry-pick that to the branch? [09:23:18] _joe_: just patch then git pull and sync-file on tin [09:23:23] I can deploy it [09:23:32] <_joe_> patch on tin? [09:23:32] _joe_: yeah [09:24:25] cc me on the patch and I can deploy it. cherry picking from master to the branch is the current practice, although I'd like to change that in the future [09:28:03] <_joe_> twentyafterfour: ok I am still preparing the patch [09:28:57] 6operations, 10Wikimedia-SVG-rendering, 7Upstream: Filter effect Gaussian blur filter not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#1798622 (10zhuyifei1999) [09:33:45] <_joe_> twentyafterfour: https://gerrit.wikimedia.org/r/252407 [09:35:32] 6operations, 10MediaWiki-extensions-OAuth, 5Patch-For-Review: OAuth broken - https://phabricator.wikimedia.org/T118372#1798631 (10Joe) 5Open>3Resolved [09:35:38] <_joe_> Glaisher: you too :) [09:35:49] <_joe_> who resolved that bug? [09:36:01] heh [09:36:19] 6operations, 10MediaWiki-extensions-OAuth, 5Patch-For-Review: OAuth broken - https://phabricator.wikimedia.org/T118372#1798540 (10Joe) [09:36:41] <_joe_> Glaisher: I would love you to review the patch I wrote :) [09:37:11] <_joe_> oh, twentyafterfour already merged it :) ok [09:37:21] _joe_: I was just testing something in phabricator settings, it's my fault that bug got resolved [09:37:25] :D [09:37:41] * twentyafterfour will fix that right after deploying this [09:37:50] <_joe_> twentyafterfour: no problems, I was just saying I didn't think it was :P [09:38:00] <_joe_> and thanks for working on the deployment this late :) [09:38:24] https://gerrit.wikimedia.org/r/#/c/252408/ is the cherry-pick to wmf.6 [09:39:05] <_joe_> yup just gave my +2 [09:39:06] 6operations, 10MediaWiki-extensions-OAuth, 5Patch-For-Review: OAuth broken - https://phabricator.wikimedia.org/T118372#1798637 (10mmodell) [09:39:11] and it's no problem at all. I'm usually up late :) [09:39:27] 6operations, 10MediaWiki-extensions-OAuth, 5Patch-For-Review: OAuth broken - https://phabricator.wikimedia.org/T118372#1798639 (10Joe) [09:40:36] <_joe_> Glaisher: thanks for pointing out the bug btw, I was uselessly running around classes... damn my php is rusty [09:40:52] np [09:41:09] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [09:42:51] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [09:43:27] <_joe_> twentyafterfour: are you going to deploy the change, or should I? [09:43:34] _joe_: it's syncing now [09:43:38] <_joe_> oh ok [09:43:44] <_joe_> the sal comes at the end [09:43:48] !log twentyafterfour@tin Synchronized php-1.27.0-wmf.6/extensions/OAuth/: oauth login broken by php error. This fixes T118372 (duration: 00m 31s) [09:43:51] scap doesn't !log until the end when it's sync-dir... [09:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:44:06] * twentyafterfour should fix that. all scap syncs should log at beginning and end I think [09:44:19] RECOVERY - puppet last run on ms-be1006 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:44:21] <_joe_> uhm doesn't seem to work [09:45:27] (03PS1) 10Jcrespo: Depool db1067 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252411 [09:45:39] <_joe_> so, what will the error be now? [09:45:54] hmm, another error? /me reads through the history on that extension [09:46:27] <_joe_> twentyafterfour: no seems like some delay in getting up with the change? [09:46:34] <_joe_> can you touch the file and sync it again? [09:46:57] Fatal error: Class undefined: MediaWiki\Extensions\OAuth\LoggerFactory [09:47:57] <_joe_> jynus: ah, crap [09:48:08] <_joe_> yes of course [09:48:11] <_joe_> let me fix that [09:50:30] 6operations, 10MediaWiki-extensions-OAuth, 5Patch-For-Review: OAuth broken - https://phabricator.wikimedia.org/T118372#1798652 (10jcrespo) 5Resolved>3Open [09:50:33] * twentyafterfour thinks we need more unit tests for oauth [09:50:59] (...hmm checks to see if we have any unit tests for oauth) [09:51:33] <_joe_> twentyafterfour: yeah I am just writing hotfixes [09:51:45] <_joe_> so let's cherry-pick this first, and merge it later :) [09:51:58] <_joe_> and next time I'll just revert the relevant change, I swear :P [09:52:08] <_joe_> twentyafterfour: https://gerrit.wikimedia.org/r/252412 [09:53:32] cherry pick and merge later? so just ignore the patch on master? [09:53:41] <_joe_> not ignore [09:53:47] <_joe_> just wait to be sure it works :P [09:53:51] right [09:54:07] <_joe_> cherry-picking [09:54:25] <_joe_> twentyafterfour: https://gerrit.wikimedia.org/r/#/c/252413/ [09:56:09] 6operations, 10MediaWiki-extensions-OAuth, 5Patch-For-Review: OAuth broken - https://phabricator.wikimedia.org/T118372#1798670 (10Joe) 5Open>3Resolved [09:56:12] 6operations, 10MediaWiki-extensions-OAuth, 5Patch-For-Review: OAuth broken - https://phabricator.wikimedia.org/T118372#1798540 (10Joe) [09:56:56] syncing [09:57:19] !log twentyafterfour@tin Synchronized php-1.27.0-wmf.6/extensions/OAuth/: really fix T118372 this time (duration: 00m 29s) [09:57:21] <_joe_> kk [09:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:57:28] <_joe_> let me test this again [09:57:39] <_joe_> works! [09:58:05] 6operations, 10MediaWiki-extensions-OAuth, 5Patch-For-Review: OAuth broken - https://phabricator.wikimedia.org/T118372#1798674 (10mmodell) 5Resolved>3Open [09:59:07] 6operations, 10MediaWiki-extensions-OAuth, 5Patch-For-Review: OAuth broken - https://phabricator.wikimedia.org/T118372#1798676 (10Joe) @Magnus my tests show that this is now resolved, can you confirm? [09:59:08] woot! [09:59:41] (03PS1) 10Muehlenhoff: Fix typo in regexp [puppet] - 10https://gerrit.wikimedia.org/r/252415 [10:00:24] * twentyafterfour accidentally reopened the task [10:02:01] 6operations, 10MediaWiki-extensions-OAuth, 5Patch-For-Review, 5WMF-deploy-2015-11-10_(1.27.0-wmf.6), 5WMF-deploy-2015-11-17_(1.27.0-wmf.7): OAuth broken - https://phabricator.wikimedia.org/T118372#1798681 (10jcrespo) [10:02:31] (03PS3) 10Filippo Giunchedi: swift: monitor mediawiki originals upload rate [puppet] - 10https://gerrit.wikimedia.org/r/251526 (https://phabricator.wikimedia.org/T92322) [10:02:47] 6operations, 10MediaWiki-extensions-OAuth, 5Patch-For-Review, 5WMF-deploy-2015-11-10_(1.27.0-wmf.6), 5WMF-deploy-2015-11-17_(1.27.0-wmf.7): OAuth broken - https://phabricator.wikimedia.org/T118372#1798683 (10Joe) I tested a few applications and OAuth seems to be back. [10:02:53] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: monitor mediawiki originals upload rate [puppet] - 10https://gerrit.wikimedia.org/r/251526 (https://phabricator.wikimedia.org/T92322) (owner: 10Filippo Giunchedi) [10:02:55] 6operations, 10MediaWiki-extensions-OAuth, 5Patch-For-Review, 5WMF-deploy-2015-11-10_(1.27.0-wmf.6), 5WMF-deploy-2015-11-17_(1.27.0-wmf.7): OAuth broken - https://phabricator.wikimedia.org/T118372#1798685 (10Joe) 5Open>3Resolved [10:03:20] (03CR) 10Jcrespo: [C: 032] Depool db1067 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252411 (owner: 10Jcrespo) [10:03:34] 6operations, 5Patch-For-Review: Add monitoring of upload rate on commons to icinga alerts - https://phabricator.wikimedia.org/T92322#1798688 (10fgiunchedi) 5Open>3Resolved [10:04:05] <_joe_> godog: sweet [10:04:30] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [10:05:07] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1067 for maintenance (duration: 00m 30s) [10:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:07:22] (03PS2) 10Muehlenhoff: Fix typo in regexp [puppet] - 10https://gerrit.wikimedia.org/r/252415 [10:07:29] (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix typo in regexp [puppet] - 10https://gerrit.wikimedia.org/r/252415 (owner: 10Muehlenhoff) [10:07:37] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2067 for maintenance (not db1067) (duration: 00m 30s) [10:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:07:51] _joe_: hehe baby steps, getting there tho [10:08:13] 6operations, 6Labs, 10Tool-Labs, 7Mail: Offer a solution to manage @toolserver.org mail redirections - https://phabricator.wikimedia.org/T116373#1798694 (10valhallasw) Yes: create a ticket with the requested change, in the #tool-labs project, ccing @coren. /etc/toolserver.aliases is currently unpuppetized... [10:09:27] (03CR) 10Filippo Giunchedi: "I agree with the general idea, though it seems we could piggy back on existing check_graphite checks? in other words I can't think of a ca" [puppet] - 10https://gerrit.wikimedia.org/r/251675 (owner: 10Ori.livneh) [10:12:27] 6operations, 10Continuous-Integration-Infrastructure, 7Jenkins, 7WorkType-Maintenance: Please refresh Jenkins package on apt.wikimedia.org to 1.625.1 - https://phabricator.wikimedia.org/T118158#1798710 (10hashar) Danke! ``` $ ssh gallium.wikimedia.org apt-cache policy jenkins jenkins: Installed: 1.625.1... [10:13:21] 6operations: Initial ferm setup is disruptive - https://phabricator.wikimedia.org/T110514#1798712 (10MoritzMuehlenhoff) 5Open>3declined In the end none of the two options was used (since this is effectively a one-time transition). One very usable workaround used on the Kafka brokers was to: - run "systemctl... [10:15:43] 6operations, 10MediaWiki-extensions-OAuth, 5WMF-deploy-2015-11-10_(1.27.0-wmf.6), 5WMF-deploy-2015-11-17_(1.27.0-wmf.7): OAuth broken - https://phabricator.wikimedia.org/T118372#1798719 (10revi) [10:16:12] !log reimage restbase2001.codfw.wmnet [10:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:16:21] !log disabling puppet and restarting mysql on db2067 [10:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:22:00] PROBLEM - Host restbase2001 is DOWN: PING CRITICAL - Packet loss = 100% [10:23:00] PROBLEM - mysqld processes on db2067 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [10:23:38] ^that is me [10:24:21] ACKNOWLEDGEMENT - mysqld processes on db2067 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Jcrespo regular maintenance, ignore [10:24:35] (03PS1) 10Muehlenhoff: Assign salt grains for es database servers [puppet] - 10https://gerrit.wikimedia.org/r/252417 [10:24:49] RECOVERY - Host restbase2001 is UP: PING OK - Packet loss = 0%, RTA = 34.58 ms [10:27:41] (03PS1) 10Filippo Giunchedi: fix stdout/stderr shell redirection syntax - take #2 [puppet] - 10https://gerrit.wikimedia.org/r/252418 [10:28:50] PROBLEM - Restbase root url on restbase2001 is CRITICAL: Connection refused [10:28:51] PROBLEM - configured eth on restbase2001 is CRITICAL: Connection refused by host [10:28:51] PROBLEM - Check size of conntrack table on restbase2001 is CRITICAL: Connection refused by host [10:29:09] PROBLEM - puppet last run on restbase2001 is CRITICAL: Connection refused by host [10:29:20] PROBLEM - dhclient process on restbase2001 is CRITICAL: Connection refused by host [10:29:32] (03PS2) 10Filippo Giunchedi: fix stdout/stderr shell redirection syntax - take #2 [puppet] - 10https://gerrit.wikimedia.org/r/252418 [10:29:39] PROBLEM - salt-minion processes on restbase2001 is CRITICAL: Connection refused by host [10:29:40] PROBLEM - Disk space on restbase2001 is CRITICAL: Connection refused by host [10:29:47] (03PS2) 10Muehlenhoff: Assign salt grains for es database servers [puppet] - 10https://gerrit.wikimedia.org/r/252417 [10:29:50] PROBLEM - DPKG on restbase2001 is CRITICAL: Connection refused by host [10:29:55] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for es database servers [puppet] - 10https://gerrit.wikimedia.org/r/252417 (owner: 10Muehlenhoff) [10:30:02] PROBLEM - Restbase endpoints health on restbase2001 is CRITICAL: Connection refused by host [10:30:09] PROBLEM - service on restbase2001 is CRITICAL: Connection refused by host [10:30:12] silenced ^ [10:31:17] (03PS3) 10Filippo Giunchedi: fix stdout/stderr shell redirection syntax - take #2 [puppet] - 10https://gerrit.wikimedia.org/r/252418 [10:31:25] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] fix stdout/stderr shell redirection syntax - take #2 [puppet] - 10https://gerrit.wikimedia.org/r/252418 (owner: 10Filippo Giunchedi) [10:37:18] RECOVERY - mysqld processes on db2067 is OK: PROCS OK: 1 process with command name mysqld [10:41:24] 6operations, 10hardware-requests: dbproxy servers for codfw - https://phabricator.wikimedia.org/T109116#1798739 (10jcrespo) > and forget these systems will likely eventually need allocation (in some shape or form.) I think it is better to close this for now, as I will not request proxys at any time soon. The... [10:43:29] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [10:44:29] (03CR) 10Paladox: "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/250444 (owner: 10Paladox) [10:46:14] (03PS3) 10Giuseppe Lavagetto: pybal: don't write pool files using confd [puppet] - 10https://gerrit.wikimedia.org/r/252242 [10:49:28] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [10:54:26] (03PS2) 10Giuseppe Lavagetto: hiera: double-quote interpolating tokens [puppet] - 10https://gerrit.wikimedia.org/r/252405 [10:56:02] (03CR) 10Giuseppe Lavagetto: [C: 032] hiera: double-quote interpolating tokens [puppet] - 10https://gerrit.wikimedia.org/r/252405 (owner: 10Giuseppe Lavagetto) [10:59:28] 6operations, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1798743 (10IKhitron) Hello, @Krenair. Still does not work, two run days after. [11:04:49] PROBLEM - Restbase endpoints health on restbase2001 is CRITICAL: Connection refused by host [11:05:19] PROBLEM - Restbase root url on restbase2001 is CRITICAL: Connection refused [11:05:58] PROBLEM - cassandra CQL 10.192.16.152:9042 on restbase2001 is CRITICAL: Connection refused [11:06:05] (03PS2) 10Florianschmidtwelzow: REL1_26 knocks on the door [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252399 [11:06:07] PROBLEM - configured eth on restbase2001 is CRITICAL: Connection refused by host [11:06:28] PROBLEM - dhclient process on restbase2001 is CRITICAL: Connection refused by host [11:06:57] PROBLEM - puppet last run on restbase2001 is CRITICAL: Connection refused by host [11:07:18] PROBLEM - salt-minion processes on restbase2001 is CRITICAL: Connection refused by host [11:07:37] PROBLEM - service on restbase2001 is CRITICAL: Connection refused by host [11:07:48] PROBLEM - Check size of conntrack table on restbase2001 is CRITICAL: Connection refused by host [11:07:58] (03PS1) 10Muehlenhoff: Assign salt grains for dbproxy systems [puppet] - 10https://gerrit.wikimedia.org/r/252419 [11:08:08] PROBLEM - DPKG on restbase2001 is CRITICAL: Connection refused by host [11:08:28] PROBLEM - Disk space on restbase2001 is CRITICAL: Connection refused by host [11:08:57] (03PS2) 10Muehlenhoff: Assign salt grains for dbproxy systems [puppet] - 10https://gerrit.wikimedia.org/r/252419 [11:09:07] PROBLEM - RAID on restbase2001 is CRITICAL: Connection refused by host [11:10:29] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [11:11:21] (03PS4) 10Giuseppe Lavagetto: pybal: don't write pool files using confd [puppet] - 10https://gerrit.wikimedia.org/r/252242 [11:12:07] RECOVERY - configured eth on restbase2001 is OK: OK - interfaces up [11:12:08] RECOVERY - DPKG on restbase2001 is OK: All packages OK [11:12:19] RECOVERY - Disk space on restbase2001 is OK: DISK OK [11:12:20] 6operations, 10Salt: slow salt-call invocation on minions - https://phabricator.wikimedia.org/T118380#1798768 (10fgiunchedi) 3NEW a:3ArielGlenn [11:12:27] RECOVERY - dhclient process on restbase2001 is OK: PROCS OK: 0 processes with command name dhclient [11:12:58] RECOVERY - RAID on restbase2001 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 [11:13:08] RECOVERY - salt-minion processes on restbase2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:13:43] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for dbproxy systems [puppet] - 10https://gerrit.wikimedia.org/r/252419 (owner: 10Muehlenhoff) [11:13:47] RECOVERY - Check size of conntrack table on restbase2001 is OK: OK: nf_conntrack is 0 % full [11:13:55] 6operations, 10Salt: slow salt-call invocation on minions - https://phabricator.wikimedia.org/T118380#1798776 (10fgiunchedi) more context ``` root@restbase2001:~# ps fwaux | grep -i salt-call root 2317 1.3 0.0 334692 50056 ? Ssl 11:11 0:00 \_ /usr/bin/python /usr/bin/salt-call --log-leve... [11:14:27] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [11:15:00] (03PS5) 10Giuseppe Lavagetto: pybal: don't write pool files using confd [puppet] - 10https://gerrit.wikimedia.org/r/252242 [11:15:25] 6operations, 10Salt: slow salt-call invocation on minions - https://phabricator.wikimedia.org/T118380#1798780 (10fgiunchedi) [11:16:38] RECOVERY - puppet last run on restbase2001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [11:22:33] (03PS1) 10Filippo Giunchedi: install_server: fix /srv restbase provisioning in codfw [puppet] - 10https://gerrit.wikimedia.org/r/252420 [11:26:02] (03PS2) 10Filippo Giunchedi: install_server: fix /srv restbase provisioning in codfw [puppet] - 10https://gerrit.wikimedia.org/r/252420 [11:26:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install_server: fix /srv restbase provisioning in codfw [puppet] - 10https://gerrit.wikimedia.org/r/252420 (owner: 10Filippo Giunchedi) [11:26:51] moritzm: I accidentally your grains change [11:27:24] godog: ok, thanks [11:28:03] !log reimage restbase2001.codfw.wmnet - take #2 [11:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:32:36] (03CR) 10Phuedx: [C: 031] Use CirrusSearch API in RelatedArticles on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252022 (https://phabricator.wikimedia.org/T116707) (owner: 10Bmansurov) [11:33:07] PROBLEM - Host restbase2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:34:22] that's me ^ [11:36:16] RECOVERY - Host restbase2001 is UP: PING OK - Packet loss = 0%, RTA = 34.21 ms [11:39:05] 6operations: mobileapps service_checker flapping on scb1002 - https://phabricator.wikimedia.org/T118383#1798830 (10fgiunchedi) 3NEW [11:39:19] _joe_ mobrovac thoughts ^ ? [11:39:39] <_joe_> godog: I didn't look inside mobileapps [11:39:39] 6operations: mobileapps service_checker flapping on scb - https://phabricator.wikimedia.org/T118383#1798837 (10fgiunchedi) [11:39:50] <_joe_> it's flapping for a good reason [11:40:03] i'm trying to figure it out, but ... [11:40:16] PROBLEM - Check size of conntrack table on restbase2001 is CRITICAL: Connection refused by host [11:40:21] <_joe_> ok I will take a look later [11:40:56] fixed one bug in https://gerrit.wikimedia.org/r/#/c/252410/ [11:41:04] but it's unlikely that will change anything [11:41:06] 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1798842 (10jcrespo) So, I am continuing with the operational tests (please understand that I have not yet solidified the ciphers, it was a test of configuration changes), and that I am us... [11:41:46] _joe_: yeah but 'malformed body' followed by a python error, expected? [11:43:42] 6operations, 10Salt: slow salt-call invocation on minions - https://phabricator.wikimedia.org/T118380#1798843 (10ArielGlenn) p:5Triage>3Normal [11:45:23] <_joe_> godog: it is, yes, if the body is expected to contain something and its check fail it will print out the python exception - I might want to catch this specific exception though [11:48:43] ^moritz, you may want to be aware of openssl/yassl-related comments about mysql [11:50:05] 6operations: mobileapps service_checker flapping on scb - https://phabricator.wikimedia.org/T118383#1798866 (10fgiunchedi) looks like the offending response looks like this ``` w.. w...HTTP/1.1 200 OK access-control-allow-origin: * access-control-allow-headers: accept, x-requested-with, content-type access-cont... [11:53:21] 6operations: Enforce password requirements for account creation on wikitech - https://phabricator.wikimedia.org/T118386#1798876 (10MoritzMuehlenhoff) 3NEW [11:54:24] 6operations: Enforce password requirements for account creation on wikitech - https://phabricator.wikimedia.org/T118386#1798884 (10MoritzMuehlenhoff) [11:54:25] 6operations: Meta task "Revamp user authentication" - https://phabricator.wikimedia.org/T116747#1798883 (10MoritzMuehlenhoff) [12:00:43] jynus: thanks, I've subscribed to the bug [12:01:19] I think the sensible option is to link dynamicaly to the openssl we are using at each time [12:02:27] 6operations: mobileapps service_checker flapping on scb - https://phabricator.wikimedia.org/T118383#1798889 (10Joe) Ok this starts to make sense. The expected response is what follows: ``` "response": { "body": { "description":... [12:02:46] <_joe_> mobrovac: ^^ the problem the service checker exposes is realy anyways [12:03:11] although debian disagrees: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=787118 [12:05:01] 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1798891 (10jcrespo) @JanZerebecki how bad would it be to use one of the ciphers provided by yassl? Debian doesn't like linking it to OpenSSL, it seems: https://bugs.debian.org/cgi-bin/b... [12:05:09] Debian is wrong here, we should really use OpenSSL. I'll followup on the Phab task later [12:05:57] or rather "some annoying, yet vocal minority of Debian is wrong" :-) [12:08:01] ok [12:08:04] :-) [12:18:42] 6operations: mobileapps service_checker flapping on scb - https://phabricator.wikimedia.org/T118383#1798897 (10mobrovac) >>! In T118383#1798866, @fgiunchedi wrote: > looks like the offending response looks like this > > ``` > w.. w...HTTP/1.1 200 OK > access-control-allow-origin: * > access-control-allow-header... [12:19:11] 6operations, 6Services, 3Mobile-Content-Service: mobileapps service_checker flapping on scb - https://phabricator.wikimedia.org/T118383#1798908 (10mobrovac) [12:19:29] godog: _joe_: ^^ [12:19:42] <_joe_> uhm so maybe a red herring from godog? [12:20:43] yeah could be a red herring from me [12:21:10] <_joe_> anyways, do we all agree we need to print repr(body) in the error message? [12:21:21] <_joe_> that would help us [12:21:45] (03PS1) 10ArielGlenn: puppetize salt master pub/priv keys for labs and prod masters [puppet] - 10https://gerrit.wikimedia.org/r/252424 (https://phabricator.wikimedia.org/T118385) [12:22:05] definitely would, not sure if nagios/nrpe enforces length limits on messages [12:22:34] (03CR) 10jenkins-bot: [V: 04-1] puppetize salt master pub/priv keys for labs and prod masters [puppet] - 10https://gerrit.wikimedia.org/r/252424 (https://phabricator.wikimedia.org/T118385) (owner: 10ArielGlenn) [12:22:59] (03CR) 10ArielGlenn: "secret not yet added, don't know if this is the right approach even." [puppet] - 10https://gerrit.wikimedia.org/r/252424 (https://phabricator.wikimedia.org/T118385) (owner: 10ArielGlenn) [12:25:29] 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1798916 (10MoritzMuehlenhoff) The OpenSSL license is incompatible to the GPL due to some poor phrasing back in the days when OpenSSL/SSLeay was founded: https://en.wikipedia.org/wiki/Open... [12:25:50] 6operations, 6Services, 3Mobile-Content-Service: mobileapps service_checker flapping on scb - https://phabricator.wikimedia.org/T118383#1798918 (10mobrovac) Ok, so I managed to get the error on `scb1002`, cf P2301 . The strange part is that the response is there, but we still get the complaint from the check... [12:26:07] godog: _joe_: ^^ [12:27:58] mobrovac: sweet! [12:28:15] yeah in this case clearly having the body printed wouldn't have helped much heh [12:28:32] euh godog? [12:28:32] (03PS2) 10ArielGlenn: puppetize salt master pub/priv keys for labs and prod masters [puppet] - 10https://gerrit.wikimedia.org/r/252424 (https://phabricator.wikimedia.org/T118385) [12:28:34] (03CR) 10Alexandros Kosiaris: [C: 031] restbase: move to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) (owner: 10Filippo Giunchedi) [12:28:46] that's what i did - print the body of each result godog [12:28:58] and the paste there clearly shows the response [12:29:08] but the checker doesn't like it for some reason [12:29:28] (03CR) 10jenkins-bot: [V: 04-1] puppetize salt master pub/priv keys for labs and prod masters [puppet] - 10https://gerrit.wikimedia.org/r/252424 (https://phabricator.wikimedia.org/T118385) (owner: 10ArielGlenn) [12:30:02] mobrovac: my first suggestion to improve service_checker in this case was to print the body, which would have ended up in the alert but obviously can't fit in there anyways [12:30:18] ah that kk godog :) [12:33:40] (03PS1) 10ArielGlenn: disbale git deploy/debdeploy activity on neodymium for now [puppet] - 10https://gerrit.wikimedia.org/r/252425 [12:34:22] (03CR) 10Alexandros Kosiaris: "looks ok technically, not sure however how often that would trigger. I am also wondering what exactly happens internally to redis and how " [puppet] - 10https://gerrit.wikimedia.org/r/252396 (https://phabricator.wikimedia.org/T118331) (owner: 10Ori.livneh) [12:34:57] (03CR) 10ArielGlenn: [C: 032] disbale git deploy/debdeploy activity on neodymium for now [puppet] - 10https://gerrit.wikimedia.org/r/252425 (owner: 10ArielGlenn) [12:35:42] godog: _joe_: the pprinted response is here - https://phabricator.wikimedia.org/P2301 [12:37:08] PROBLEM - Restbase endpoints health on restbase2001 is CRITICAL: NRPE: Command check_endpoints_restbase not defined [12:37:18] PROBLEM - Restbase root url on restbase2001 is CRITICAL: Connection refused [12:37:58] PROBLEM - cassandra CQL 10.192.16.152:9042 on restbase2001 is CRITICAL: Connection refused [12:37:59] godog: stop restbase on restbase2001 please, we'll do a deploy and that'll fix it [12:38:52] mobrovac: yeah it isn't running [12:39:32] kk [12:40:41] mobrovac: are you going to update the copy on tin? [12:40:50] godog: did it already [12:41:06] ok! [12:41:33] godog: but let's keep it stopped for the time being [12:41:59] (03CR) 10Filippo Giunchedi: [C: 04-1] "I agree with the general idea, though it seems we could piggy back on existing check_graphite checks? in other words I can't think of a ca" [puppet] - 10https://gerrit.wikimedia.org/r/251675 (owner: 10Ori.livneh) [12:42:28] mobrovac: kk [12:52:19] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [12:54:09] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [12:58:26] (03PS1) 10Filippo Giunchedi: cassandra: add restbase2001 instance [puppet] - 10https://gerrit.wikimedia.org/r/252426 (https://phabricator.wikimedia.org/T95253) [12:59:59] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1798945 (10ArielGlenn) minion and master keys copied over from palladium, test of one minion completed. [13:03:59] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [13:09:40] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [13:10:16] (03PS1) 10BBlack: add X-Client-IP to as clientip= in X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/252427 [13:11:44] (03PS2) 10BBlack: add X-Client-IP to as clientip= in X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/252427 [13:15:39] PROBLEM - puppet last run on rdb2002 is CRITICAL: CRITICAL: puppet fail [13:15:53] 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1798957 (10jcrespo) [Offtopic] BTW, that makes no sense, MySQL is not GPL, it is GPL + free software exception https://github.com/mysql/mysql-server/blob/5.7/README . [13:21:21] 6operations, 7Tracking: Meta task "Revamp user authentication" - https://phabricator.wikimedia.org/T116747#1798958 (10Aklapper) meta = #tracking [13:22:48] commons site outage? [13:24:01] and back! [13:25:09] 6operations, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1798963 (10Krenair) There just keeps being more and more things broken... ```krenair@mw1152:/var/log/mediawi... [13:27:32] hmm, but I am still getting 404s for images / upload.wm.o [13:27:55] <_joe_> addshore: can you give me an example? [13:28:07] bah, that also just started working :p [13:28:59] well, to start with I just couldn't even get the commons main page, then that fixed but all images and thumbs gave me 404 [13:29:31] Request from 10.20.0.182 via cp3047 frontend ([10.20.0.182]:80), Varnish XID 562457667 [13:29:31] Forwarded for: 62.252.189.65, 10.20.0.182 [13:29:31] Error: 403, Requested target domain not allowed. at Wed, 11 Nov 2015 13:29:10 GMT [13:29:38] _joe_: ^^ its back [13:30:31] <_joe_> addshore: uh that is bad [13:30:47] <_joe_> so it happens from time to time requesting the main page? [13:30:51] yup [13:31:03] all images are back to 404s for me now [13:31:11] <_joe_> addshore: ok 1 sec [13:32:49] <_joe_> I need to test it someway, as I am always getting a good response atm [13:33:30] mainpage is back for me now, images still 404ing [13:36:28] _joe_: apperaring again [13:36:54] <_joe_> addshore: yeah I can't reproduce your problem [13:37:01] same general error but Varnish XID 563592773 [13:37:17] <_joe_> addshore: are you using some socks proxy maybe? [13:37:29] nope :/ [13:38:04] <_joe_> are you on a unix-like system? if so, what happens if you request the page via curl? [13:38:14] <_joe_> sorry, trying to understand what could be going on [13:38:53] <_joe_> addshore: the error you pasted me before, was it for the main page or a single image? [13:39:10] the error is from the main page [13:39:15] individual image files give me a 404 [13:40:03] <_joe_> ok, the main page should /not/ be served by the host which is answering you [13:40:15] <_joe_> addshore: what OS are you using? [13:40:30] Windows 7 with Chrome browser [13:40:55] <_joe_> did you ever touch your hosts file? [13:41:05] <_joe_> I don't remember where it is on a windows system [13:41:46] oh of course ;) C:\Windows\System32\drivers\etc [13:42:00] but I doubt anything in there would be causing an issue :/ [13:42:04] <_joe_> if you have a line about commons, remove it [13:42:29] RECOVERY - puppet last run on rdb2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:42:33] <_joe_> if not, please open cmd.exe and paste me the result of running nslookup commons.wikimedia.org [13:42:36] nothing related to wikimedia things in there, just localhost domains [13:42:57] !log restbase deploying 0d961a2 [13:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:43:14] *** UnKnown can't find commons.wikimedia.org: No response from server [13:43:21] <_joe_> addshore: my working hypothesis is that you're getting bad name resolution [13:43:35] from the nslookup, that seems likely :/ [13:43:54] <_joe_> well, don't trust me with instructions to run on windows [13:43:58] <_joe_> let me check it :P [13:44:05] addshore: are you using a vpn for any purpose? [13:44:06] Pinging commons.wikimedia.org [62.252.172.241] [13:44:12] no vpn running [13:44:34] that is not a WMF ip [13:44:36] 241.172.252.62.in-addr.arpa domain name pointer know-sspiprxy-vip.network.virginmedia.net. [13:44:44] <_joe_> addshore: yeah I was about to say that [13:44:45] your ISP is probably trying to hijack you into their proxy via DNS [13:44:48] <_joe_> what mark said [13:44:55] bah [13:45:06] and they are doing it badly... [13:46:15] addshore: are you using Virgin's "WebSafe"? [13:46:41] bblack: I'm not actually in my own house currently, so I am not entirely sure. [13:46:49] Virgin Media now appears to be doing DNS hijacking. This practice means that they are redirecting DNS traffic to their own servers regardless of the fact that you as a customer have set different DNS servers on your equipment. Instead DNS request are sent to the Virgin DNS servers and Unlocator is bypassed. [13:46:49] Virgin calls this “service” Advanced Error Search. You can opt out of this “service” here https://my.virginmedia.com/advancederrorsearch/settings [13:47:00] apparently they have some kind of "save the children" web proxy filter thing called WebSafe, which is known to cause these issues [13:47:06] <_joe_> chasemp: that is very common [13:47:13] <_joe_> and it's not what we're seeing here [13:47:22] of couse this doesn't explain why you ended up on upload-lb for a request to text-lb [13:47:25] buggy proxy possibly [13:47:25] <_joe_> chasemp: it's done by vodafone in Italy as well [13:47:32] <_joe_> paravoid: buggy proxy, yes [13:47:41] <_joe_> paravoid: also that happens intermittently [13:47:48] yeah buggy proxy I assume. or they tried to manually configure their proxy for the special case of wikimedia and failed [13:48:10] <_joe_> bblack: or one proxy was brought back into the pool with some old config [13:48:20] <_joe_> and addshore is ending up on it intermittently [13:48:27] :( [13:48:38] as long as I've been here, and I'm sure for some time before that, we've had upload on a separate IP from other things though [13:48:47] it's not like they could've picked that up last week/month/year [13:48:49] yes [13:48:51] (legitimately) [13:48:55] <_joe_> bblack: that is true, right [13:49:05] <_joe_> so yeah just crappy, buggy proxy [13:49:09] bblack: I think we separated out upload. in... 2004 [13:49:11] maybe 2005 [13:49:13] maybe the proxy is getting confused by the same https certificate? [13:49:20] and thinks it's the same site [13:49:24] <_joe_> addshore: also, I assume they are decrypting all of your traffic [13:49:36] <_joe_> if you didn't configure the proxy in your browser [13:49:40] ...how? they can't [13:49:49] unless they owned a CA or something [13:49:51] paravoid: likely scenario, yeah. they might see the same wildcard-matching unified cert on both IPs and assume they can alias them [13:49:52] :) [13:49:57] bblack: yeah, that [13:49:57] if they have a CA cert installed [13:49:59] RECOVERY - Restbase endpoints health on restbase2001 is OK: All endpoints are healthy [13:49:59] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.112 second response time [13:50:18] paravoid: or unless the user installed Virgin's WebSafe software, which adds new root certs to the local machine... [13:50:20] <_joe_> paravoid: they installed some CA on the user's computer when the software got installed [13:50:35] <_joe_> that is the most likely scenario from what I read [13:50:35] doubt it's _that_ evil [13:50:41] there is no virgin software on my laptop :P [13:50:42] it's common practice [13:50:47] I think it's common for browser-filter software [13:50:49] <_joe_> addshore: oh it's your laptop [13:50:59] for corporate networks, not for end user ISPs [13:51:02] yes, my laptop, but not my house / internet connection [13:51:30] <_joe_> ok so this websafe thing is activated via a control panel, no software installed [13:51:51] <_joe_> how can it work with SSL? [13:51:58] http://store.virginmedia.com/discover/broadband/security/web-safe-notice.html [13:52:14] it doesn't filter encrypted requests, just passes them along I'm guessing? [13:52:20] and also blocks on an IP level [13:52:25] okay, basically I have decided I can either load image files or commons pages, but not bother at the same time... [13:52:54] <_joe_> addshore: ha. I am pretty sure they are doing something really evil [13:53:06] * addshore has never had anything like this before [13:53:11] yeah, it sounds like they're assuming aliasing when they see both sites close together or something [13:53:27] yup [13:53:28] but why would I still be getting a wikimedia foundation error? [13:53:43] so the problem here of course is that we're emitting a 403 which might make users think it's our problem [13:53:49] <_joe_> addshore: they are forwarding your request to the wrong wmf server [13:53:49] because it's sending a request for one of our domains to a server that actually serves a different one of our domains [13:54:07] bah [13:54:39] !log restbase deployed 0d961a2 (deploy end) [13:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:55:08] I guess we could put some custom varnish errorpage hacks on text-lb and upload-lb (for req.http.host == upload or != upload, respectively) to say "Hey your ISP / DNS service is doing something awful" [13:55:22] <_joe_> bblack: probably yes [13:55:24] 6operations, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1798992 (10jcrespo) @krenair The first too lines are normal warnings. The last one does not provide any usef... [13:55:29] heh, and now commons.* is back but upload.* is gone ;) [13:55:32] love it... [13:55:37] <_joe_> addshore: oh, my [13:55:39] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [13:55:46] addshore: does it affect other wiki sites? commons should be similar to e.g. enwiki for all these purposes [13:55:49] <_joe_> ok I need to fix mobileapps [13:55:52] we could also set an error page for client.ip == 62.252.172.241 that says "your isp is crap, you should change ISPs" *g* [13:55:59] only half-kidding :P [13:56:01] <_joe_> paravoid: ahah [13:56:08] bblack: as far as I can tell it is only happening with commons and upload [13:56:19] <_joe_> paravoid: that would be a bit controversial I guess [13:56:22] paravoid: haha [13:56:31] addshore: are you starting out with https:// in the URL in both cases? [13:56:49] bblack: yes [13:57:16] so they can't be seeing the URL itself, to e.g. notice that commons+upload have some common URL patterns between them or whatever (if they even do) [13:57:42] they don't, other than filenames themselves [13:57:59] I wonder why commons is special [13:58:44] so if he changes his dns servers to say 4.2.2.2 4.2.2.4 does that circumvent this entirely? [13:59:00] oh, surely not... but maybe they were exploiting the InstantCommons HTTPS exception in their proxy, using it to filter traffic via HTTP proxied? [13:59:08] it was just removed a couple days ago... [13:59:19] and was allowing commons.wm.o + a settable header to bypass the HTTPS redirect... [14:00:49] it still wouldn't have worked on the upload IP though [14:01:29] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [14:02:00] but maybe they had some crazy logic running where upload requests got hijacked and proxied/rewritten to http://commons with user-agent:mediawiki, and now that config is broken and it's causing other broken things [14:02:22] it seems insane they'd bother to figure out how to do that, though [14:05:25] insane indeed [14:05:31] bblack: so their dns servers just always answer everything is at their proxies IP and that is their version of transparent proxy? seems ignoring their dns would avoid [14:05:48] just ocurred to me that's probably all www.unblock-us.com/ type stuff does to avoid region blocking [14:05:49] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase2001 instance [puppet] - 10https://gerrit.wikimedia.org/r/252426 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [14:06:02] chasemp: except they probably hijack DNS requests to other servers anyways. But you could try [14:06:09] ill give it a shot [14:06:19] addshore: if you want to try that, reconfigure your laptop to explicitly use 8.8.8.8 for DNS instead of whatever the DHCP/ISP offers you [14:06:38] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [14:06:39] yeah if it's aggressive like that no dice but my bet is they aren't doing that as you can opt-out of this feature [14:06:45] but idk [14:07:42] yup, switching to google dns servers fixes it (or seems to) [14:07:52] hmmm maybe the commons distinction is just that they're both in mediawiki.org, in addition to the unified cert aliasing [14:08:01] so their feature opt-in is literally just "give me stupid dns" [14:08:11] chasemp: it seems that way :P [14:08:23] they might have some logic that says "if these IPs have the same cert and both request hostnames are in the same 2nd-level domain, consider them aliases" [14:08:28] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [14:08:52] mobrovac: restbase is running on restbase2001 btw [14:08:53] in which case the commons+upload problem might manifest for e.g. meta.wikimedia.org + upload [14:09:13] (but not enwiki and such) [14:09:34] bblack: if it really is some kind of meant-for-content-filtering proxy they do tend to equate a ip/hostname combo as a way to block people just putting in IP's to route around proxies iirc [14:10:06] godog: yup, did a deploy [14:10:09] route around the filtering I mean [14:10:21] mobrovac: kk [14:11:20] trying to recall the stupid tricks we used to get around the proxy whitelist, putting hex in the url bar type of stuff firefox would honor but the literal proxy whitelist didn't look for [14:12:00] there was some list years ago for alternate domain representations ff would honor we tried them all so proxies got wise at some point I imagine and by wise I mean stupid in this way maybe [14:12:16] addshore: if you have time to mess with it, can you try going back to their DNS and seeing if meta.wikimedia.org + upload.wikimedia.org exhibit the same issue as commons+upload? [14:13:05] *goes to try it* [14:13:41] heh, switch it abck and it imediatly goes back to a broken state! [14:14:32] so browsing to meta and no images load (same on commons) *waits for the brokenness to switch to images loading but pages not loading* [14:14:51] ok, well that confirms why it's commons then [14:15:19] they're assuming all sites that match *.wikimedia.org and share the same cert are identical regardless of the IP we serve (as in, assuming the different IPs are all part of the same pool) [14:15:24] interesting [14:16:10] or they're even dumber and not even looking at TLS, and just assuming *.wikimedia.org IPs are all interchangeable (probably for *.anything.whatever) [14:16:20] but every now and again they switch the IP they are using, hence sometimes pages load and sometimes images load? :P [14:16:29] yeah probably [14:16:44] maybe they're doing their own balancing over what they think is our pool of IPs for *.wikimedia.org [14:16:45] cache expires and it's nondeterministic I guess [14:16:49] or that [14:17:09] unrelated blast from the past on obscuring url's that used to work for most proxies back in the day http://www.pc-help.org/obscure.htm [14:17:21] just bringing back memories :) [14:17:38] probably someone started assuming that every site on the web is foo.com + www.foo.com aliased to the same IP, and wrote some logic basic on that :P [14:18:03] good god I bet your right [14:18:06] with a wildcard match thrown in just in case [14:18:32] !log start cassandra instance a on restbase2001 [14:18:33] and yeh, so when commons pages die and I get images back meta also stays up [14:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:18:48] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [14:18:56] So, how do we move forward from here? ;P tweet them telling them they make dumb assumptions? [14:20:29] apergos: https://phabricator.wikimedia.org/T115291 silly easy task to fix :) [14:21:08] hilarious! [14:23:43] well we can tell them somehow that their DNS is doing Bad Things. On our end, we can try to cope with this better by giving a better error message on server hostname mismatches. [14:23:58] we probably don't even need to get all fancy with the varnish hacks specific to upload [14:23:58] and gone. [14:24:23] there is no http response code for 'get a better isp' unfortunately [14:24:28] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [14:26:05] 6operations, 6Commons, 10Wikimedia-Media-storage, 7Monitoring: Monitor [[Special:ListFiles]] for non 200 HTTP statuses in thumbnails - https://phabricator.wikimedia.org/T106937#1799006 (10chasemp) >>! In T106937#1798210, @Dzahn wrote: >>>! In T106937#1485653, @chasemp wrote: >> @mark is this worthy of a ca... [14:26:06] chasemp: 666 should give them an indication, your ISP is the devil... [14:26:33] 6xx ISP Failure [14:27:16] *switches back to 8.8.8.8 & 8.8.4.4* [14:27:32] so this case is handled very different on text and upload in general [14:27:52] text-lb requests for a bad hostname like "upload.wikimedia.org" get handled by MediaWiki, which throws a generic 404 (could be improved?) [14:28:26] upload-lb requests for hostnames other than upload.wikimedia.org give a 403 directly from varnish (and there is no mediawiki behind them anyways, just swift and such) [14:28:47] I wonder what would happen if we didn't 403 there? [14:29:18] (maybe swift ignores the request hostname and it would treat it as if it were upload?) [14:29:48] 421 Misdirected Request ? [14:30:47] 10Ops-Access-Requests, 6operations: Update SSH key - https://phabricator.wikimedia.org/T118392#1799008 (10Mholloway) 3NEW [14:31:49] yeah, ms-fe.svc ignores the Host: header [14:32:04] if we weren't blocking non-upload hostnames in varnish, people could map random ones there and pollute the cache I guess [14:32:19] unless we also ignored req.http.host in vcl_hash for the upload cluster as well [14:32:45] it would still fail with a 404 for e.g. a commons main page request though [14:34:29] probably (a) MediaWiki should do a special/better 404 on invalid domainnames, and (b) upload-cache VCL should do similar instead of the simple 403 it gives now [14:38:39] will file a couple tasks [14:38:56] =] [14:39:11] bblack: feeel free to CC me so that I can give them a read! [14:40:13] (03PS1) 10Mobrovac: Mathoid: Enable texvcinfo generation [puppet] - 10https://gerrit.wikimedia.org/r/252429 [14:43:10] 6operations, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1799029 (10Krenair) I tried it again manually and DeadendPages succeeded on arwiki but still fails on enwiki,... [14:44:26] 10Ops-Access-Requests, 6operations: Update mholloway's SSH key - https://phabricator.wikimedia.org/T118392#1799031 (10Dzahn) [14:47:08] (03PS5) 10Faidon Liambotis: Remove classes snapshot::common, snapshot::packages [puppet] - 10https://gerrit.wikimedia.org/r/245616 [14:47:10] (03PS2) 10Faidon Liambotis: snapshot: create a proper role::snapshot [puppet] - 10https://gerrit.wikimedia.org/r/246828 [14:47:12] (03PS2) 10Faidon Liambotis: dataset: remove system::role from the dataset module [puppet] - 10https://gerrit.wikimedia.org/r/246827 [14:47:14] (03PS2) 10Faidon Liambotis: dataset: inline the non-role role classes [puppet] - 10https://gerrit.wikimedia.org/r/246826 [14:47:32] 6operations, 10Traffic: cache_upload should give an informative 404 rather than 403 on req.http.host != upload.wikimedia.org - https://phabricator.wikimedia.org/T118394#1799045 (10BBlack) 3NEW [14:48:11] https://phabricator.wikimedia.org/T118393 + https://phabricator.wikimedia.org/T118394 for the above DNS stuff... [14:51:15] (03PS1) 10Dzahn: admin: remove mholloway's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/252431 (https://phabricator.wikimedia.org/T118392) [14:52:01] 6operations, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1799066 (10jcrespo) > I tried it again manually [...] but still fails on enwiki, and probably others. Does i... [14:52:11] 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1799067 (10JanZerebecki) >>! In T111654#1798891, @jcrespo wrote: > @JanZerebecki @Dzahn how bad would it be to use one of the ciphers provided by yassl? The low DHE key size is probably... [14:56:54] (03PS1) 10DCausse: Rename timestamp to ts for CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/252432 (https://phabricator.wikimedia.org/T117873) [14:57:32] 6operations, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1799076 (10jcrespo) I see it now on the logs. I think it is being killed automatically by mediawiki (because... [14:57:56] (03CR) 10DCausse: [C: 04-1] "We should deploy Ie575f471 first." [puppet] - 10https://gerrit.wikimedia.org/r/252432 (https://phabricator.wikimedia.org/T117873) (owner: 10DCausse) [14:57:58] (03CR) 10Ottomata: "Awesome." [puppet] - 10https://gerrit.wikimedia.org/r/252427 (owner: 10BBlack) [14:59:18] (03PS1) 10Muehlenhoff: Enable ferm on the remaining kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/252433 [15:00:38] (03PS2) 10Physikerwelt: Mathoid: Enable texvcinfo generation [puppet] - 10https://gerrit.wikimedia.org/r/252429 (owner: 10Mobrovac) [15:02:35] (03CR) 10Jcrespo: [C: 031] "Can be applied at any time." [puppet] - 10https://gerrit.wikimedia.org/r/240055 (owner: 10Muehlenhoff) [15:02:37] (03CR) 10Physikerwelt: [C: 031] Mathoid: Enable texvcinfo generation [puppet] - 10https://gerrit.wikimedia.org/r/252429 (owner: 10Mobrovac) [15:04:45] !log disabled puppet on kafka1018, kafka1020, kafka1022 (for enabling ferm) [15:04:47] <_joe_> mobrovac: ok I give up :P [15:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:07:24] _joe_: on the checker? [15:08:32] <_joe_> mobrovac: yes, as soon as I made it better [15:08:38] hehehe [15:08:40] <_joe_> it never got an error again [15:08:49] super strange [15:10:03] <_joe_> well the alarm is not going off since some time [15:10:12] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, and 2 others: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1799088 (10chasemp) [15:10:13] 6operations, 6Labs, 10Labs-Infrastructure, 10netops, and 3 others: Allocate labs subnet in dallas - https://phabricator.wikimedia.org/T115491#1799086 (10chasemp) 5Open>3Resolved >>! In T115491#1795538, @faidon wrote: > @chasemp, is this done? yes, I believe we can call this done. The hosts in this new... [15:12:44] 6operations, 6Labs, 10wikitech.wikimedia.org: wikitech regularly looses session directly after login - https://phabricator.wikimedia.org/T118395#1799090 (10JanZerebecki) 3NEW [15:13:42] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on the remaining kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/252433 (owner: 10Muehlenhoff) [15:14:43] (03PS2) 10Dzahn: admin: remove mholloway's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/252431 (https://phabricator.wikimedia.org/T118392) [15:14:45] (03PS1) 10Jcrespo: Adding ferm to db1035, activating performance_schema [puppet] - 10https://gerrit.wikimedia.org/r/252435 [15:15:08] (03PS1) 10Rush: elastic: timeout on wmfelastic collector [puppet] - 10https://gerrit.wikimedia.org/r/252436 (https://phabricator.wikimedia.org/T117461) [15:15:11] (03CR) 10Dzahn: [C: 032] admin: remove mholloway's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/252431 (https://phabricator.wikimedia.org/T118392) (owner: 10Dzahn) [15:16:23] (03PS2) 10Rush: elastic: timeout on wmfelastic collector [puppet] - 10https://gerrit.wikimedia.org/r/252436 (https://phabricator.wikimedia.org/T117461) [15:18:19] (03CR) 10Rush: [C: 032] elastic: timeout on wmfelastic collector [puppet] - 10https://gerrit.wikimedia.org/r/252436 (https://phabricator.wikimedia.org/T117461) (owner: 10Rush) [15:19:38] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Update mholloway's SSH key - https://phabricator.wikimedia.org/T118392#1799116 (10Dzahn) @mholloway the current key has been removed on bast1001 and will be removed on other servers as soon as puppet runs again. we will follow-up with the new key. [15:19:50] !log stopping kafka1018 to enable ferm [15:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:20:13] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [10.0] [15:20:25] shhhh, ijust scheduled downtime in icinga [15:20:29] alert beat me to it [15:21:39] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Update mholloway's SSH key - https://phabricator.wikimedia.org/T118392#1799118 (10Mholloway) @Dzahn sounds great, thank you! [15:22:21] !log restarted kafka1018 to enable ferm [15:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:22:32] (03CR) 10BBlack: "@ottomata - can we bring X-Analytics fields back into e.g. the log files on oxygen easily, already broken back out? If not, we might be b" [puppet] - 10https://gerrit.wikimedia.org/r/252427 (owner: 10BBlack) [15:23:09] 6operations, 5Patch-For-Review: diamond doesn't gracefully handled elasticsearch failure - https://phabricator.wikimedia.org/T117461#1799121 (10chasemp) 5Open>3Resolved So I poked at this a bit and the failure mode was indeed weird on nobelium. I //think// we should have had a time on the urllib2 call to... [15:24:04] (03CR) 10Dzahn: "@Legoktm i had actually checked https before merging the last change but i saw a cert error" [puppet] - 10https://gerrit.wikimedia.org/r/252398 (owner: 10Legoktm) [15:24:08] (03CR) 10Gilles: [C: 031] redis: prohibit commands CONFIG, SLAVEOF and DEBUG by default [puppet] - 10https://gerrit.wikimedia.org/r/251800 (owner: 10Ori.livneh) [15:24:20] bblack, x-analytics is already on the logs on oxygen, no? [15:24:26] are you saying you want it as its own field there? [15:25:03] (03CR) 10Dzahn: [C: 032] "woot.. verified by Letsencrypt indeed" [puppet] - 10https://gerrit.wikimedia.org/r/252398 (owner: 10Legoktm) [15:25:05] (03PS6) 10Giuseppe Lavagetto: pybal: don't write pool files using confd [puppet] - 10https://gerrit.wikimedia.org/r/252242 [15:25:15] (03PS2) 10Dzahn: [Planet Wikimedia] Use HTTPS for Legoktm's blog [puppet] - 10https://gerrit.wikimedia.org/r/252398 (owner: 10Legoktm) [15:25:27] !log stopping kafka1020 to enable ferm [15:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:31] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1799126 (10chasemp) 5Open>3Resolved >>! In T117097#1798052, @Andrew wrote: > All the boxes now have an OS installed and puppet and salt signed and run... [15:26:32] (03CR) 10Giuseppe Lavagetto: [C: 032] pybal: don't write pool files using confd [puppet] - 10https://gerrit.wikimedia.org/r/252242 (owner: 10Giuseppe Lavagetto) [15:30:07] ottomata: oh you're right, it is. can we break out the field though? [15:30:46] this seems sort of like storing a set of JSON k=v inside a giant TEXT field in a relational database (having one of the json keys in the logs be X-analytics with its own format for subfields) [15:31:26] or is it too early in the pipeline for that? [15:31:52] !log restarted kafka broker on kafka1020 [15:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:32:38] (03PS3) 10Dzahn: [Planet Wikimedia] Use HTTPS for Legoktm's blog [puppet] - 10https://gerrit.wikimedia.org/r/252398 (owner: 10Legoktm) [15:33:58] (03CR) 10Dzahn: [C: 031] dataset: remove system::role from the dataset module [puppet] - 10https://gerrit.wikimedia.org/r/246827 (owner: 10Faidon Liambotis) [15:34:39] (03PS1) 10Giuseppe Lavagetto: confd: don't exclude inclusion of confd in confd::file [puppet] - 10https://gerrit.wikimedia.org/r/252437 [15:34:51] !log stopping kafka broker on kafka1022 (to enable ferm) [15:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:34] (03PS2) 10Giuseppe Lavagetto: confd: don't exclude inclusion of confd in confd::file [puppet] - 10https://gerrit.wikimedia.org/r/252437 [15:35:47] 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1799137 (10jcrespo) Please note than I am aiming for OpenSSL integration and the recommend cipher (I've just recompiled and about to test the new package now). I wanted to know if we had... [15:37:26] (03CR) 10Giuseppe Lavagetto: [C: 032] confd: don't exclude inclusion of confd in confd::file [puppet] - 10https://gerrit.wikimedia.org/r/252437 (owner: 10Giuseppe Lavagetto) [15:38:28] (03CR) 10Dzahn: snapshot: create a proper role::snapshot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/246828 (owner: 10Faidon Liambotis) [15:38:32] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 1.00% above the threshold [1.0] [15:39:56] 6operations: recommended ssh ciphers/kexalgorithms combination doesn't work for ilo - https://phabricator.wikimedia.org/T111698#1799140 (10JanZerebecki) The ones that have no match on the client as requested from the server side: KexAlgorithms diffie-hellman-group14-sha1,diffie-hellman-group1-sha1 Ciphers aes256... [15:41:25] (03CR) 10Dzahn: "compiler reports an error http://puppet-compiler.wmflabs.org/1245/" [puppet] - 10https://gerrit.wikimedia.org/r/246828 (owner: 10Faidon Liambotis) [15:43:51] (03PS5) 10Dzahn: contint: stop gerrit replication to gallium [puppet] - 10https://gerrit.wikimedia.org/r/244498 (https://phabricator.wikimedia.org/T86661) (owner: 10Hashar) [15:44:51] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [15:44:55] (03CR) 10Jforrester: "Won't adding this present it as "latest stable MediaWiki"?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252399 (owner: 10Florianschmidtwelzow) [15:47:29] !log restarted kafka broker on kafka1022 [15:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:49:45] bblack, we break x-analytlics fields into a map in the webrequest refinined table, so they are individually selectable in hive [15:49:51] but, that is post processed [15:50:09] it does seem like it would be cooler if we could make varnishkafka write nested json with those key,value pairs [15:50:20] buuut, i think it would require a lot of change in analytics stuff to deal with that [15:50:29] bblack, but, those oxygen logs are generated by kafkatee [15:50:38] which can pipe through an arbitrary script [15:50:52] so, if you want to do some transformation of the logs before they are written to a file, should be fairly easy [15:51:23] 6operations, 10Wikimedia-General-or-Unknown: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1799156 (10Krenair) a:5Krenair>3None [15:52:15] (03PS1) 10BBlack: Exclude WMF Office from ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/252439 [15:52:26] bblack https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/logging.pp#L344 [15:52:31] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [15:52:33] ok [15:55:06] (03PS3) 10BBlack: add X-Client-IP to as client_ip= in X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/252427 [15:56:40] PROBLEM - service on lvs4002 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [15:59:31] PROBLEM - service on lvs3001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [15:59:40] _joe_: ^ ? [15:59:45] (03CR) 10Ottomata: "+1," [puppet] - 10https://gerrit.wikimedia.org/r/252427 (owner: 10BBlack) [15:59:51] <_joe_> bblack: damn race condition [15:59:54] ok [15:59:58] (03CR) 10Ottomata: [C: 031] add X-Client-IP to as client_ip= in X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/252427 (owner: 10BBlack) [16:00:02] 6operations, 7Graphite: http 500 errors from check_graphite on rate of media upload checks - https://phabricator.wikimedia.org/T118398#1799168 (10fgiunchedi) 3NEW a:3fgiunchedi [16:00:05] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151111T1600). Please do the needful. [16:00:16] <_joe_> bblack: or better, I still need to remove the service, and without templates it won't start [16:01:02] PROBLEM - service on lvs2002 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [16:01:18] (03PS1) 10Giuseppe Lavagetto: pybal: remove most confd-related references [puppet] - 10https://gerrit.wikimedia.org/r/252440 [16:01:23] <_joe_> bblack: meh, this is going to happen everywhere I guess [16:02:25] (03PS2) 10Giuseppe Lavagetto: pybal: remove most confd-related references [puppet] - 10https://gerrit.wikimedia.org/r/252440 [16:02:43] <_joe_> bblack: so I'm applying this change ^^ instead and running apt-get remove --purge confd by hand [16:02:45] !log uploaded jenkins 1.625.2 to apt.wm.org, upgrading on gallium [16:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:02:56] stopping jenkins for upgrade [16:03:01] who wants to do swat this morning? I can but woke up ~10 min ago...so i'd rather not :) [16:03:04] started again [16:03:18] but i can in ~20 min which is still in the window [16:03:32] (03CR) 10BBlack: "@joal - for context "X-Client-IP" is a request header we're setting internally, which is the result of VCL code that accurately decodes th" [puppet] - 10https://gerrit.wikimedia.org/r/252427 (owner: 10BBlack) [16:03:57] <_joe_> jenkins is down [16:04:33] PROBLEM - service on lvs2006 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [16:04:49] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 6 below the confidence bounds [16:05:17] 6operations, 10Continuous-Integration-Infrastructure, 7Jenkins, 7WorkType-Maintenance: Please refresh Jenkins package on apt.wikimedia.org to 1.625.1 - https://phabricator.wikimedia.org/T118158#1799184 (10Dzahn) @hashar i did the same for 1.625.2 right now because of further fixes they released [16:05:42] <_joe_> bblack: I'm just going to ack these alarms for now [16:08:00] ebernhardson: I can SWAT this morning as long as you're around to babysit your patches. jhobs ping for SWAT. [16:08:21] _joe_: it's back, was restarting after upgrade [16:08:22] thcipriani: I'm here [16:08:34] ACKNOWLEDGEMENT - service on lvs2001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive Giuseppe Lavagetto confd doesnt start because of having no templates. [16:08:35] ACKNOWLEDGEMENT - service on lvs2002 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive Giuseppe Lavagetto confd doesnt start because of having no templates. [16:08:35] ACKNOWLEDGEMENT - service on lvs2006 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive Giuseppe Lavagetto confd doesnt start because of having no templates. [16:08:35] ACKNOWLEDGEMENT - service on lvs3001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive Giuseppe Lavagetto confd doesnt start because of having no templates. [16:08:35] ACKNOWLEDGEMENT - service on lvs4002 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive Giuseppe Lavagetto confd doesnt start because of having no templates. [16:08:35] ACKNOWLEDGEMENT - service on lvs4004 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive Giuseppe Lavagetto confd doesnt start because of having no templates. [16:08:41] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 6 below the confidence bounds [16:08:42] ACKNOWLEDGEMENT - service on lvs2001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive Giuseppe Lavagetto confd doesnt start because of having no templates. [16:08:42] ACKNOWLEDGEMENT - service on lvs2002 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive Giuseppe Lavagetto confd doesnt start because of having no templates. [16:08:42] ACKNOWLEDGEMENT - service on lvs2006 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive Giuseppe Lavagetto confd doesnt start because of having no templates. [16:08:42] ACKNOWLEDGEMENT - service on lvs3001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive Giuseppe Lavagetto confd doesnt start because of having no templates. [16:08:42] ACKNOWLEDGEMENT - service on lvs4002 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive Giuseppe Lavagetto confd doesnt start because of having no templates. [16:08:43] ACKNOWLEDGEMENT - service on lvs4004 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive Giuseppe Lavagetto confd doesnt start because of having no templates. [16:08:53] <_joe_> uhg [16:09:54] kk, jhobs since yours looks small and easy, let's do that first. [16:10:35] thcipriani: sounds good. This is my first SWAT deploy btw. As I understand it, you probably won't need much from me other than testing it when complete, but just lemme know if you need anything :) [16:11:33] thcipriani: thanks [16:11:54] jhobs: that is correct. Is there anything that goes along with this deploy other than pushing out the config? [16:12:15] 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1799188 (10jcrespo) This the result after linking to openssl, reinstalling and configuring the server with `ssl-cipher=TLSv1.2`. I tnink it has a more reasonable list of available ciphers... [16:12:16] thcipriani: no, we already deployed the preparation patch with the train yesterday [16:12:21] PROBLEM - service on lvs3002 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [16:12:22] nice, ok. [16:12:34] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251133 (https://phabricator.wikimedia.org/T113443) (owner: 10Jdlrobson) [16:12:53] PROBLEM - service on lvs2005 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [16:13:24] (03Merged) 10jenkins-bot: First QuickSurvey for reader segmentation research - external survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251133 (https://phabricator.wikimedia.org/T113443) (owner: 10Jdlrobson) [16:14:08] 6operations, 10Continuous-Integration-Infrastructure, 7Jenkins, 7WorkType-Maintenance: Please refresh Jenkins package on apt.wikimedia.org to 1.625.1 / 1.625.2 - https://phabricator.wikimedia.org/T118158#1799197 (10Dzahn) [16:14:11] PROBLEM - service on lvs4003 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [16:14:41] 6operations, 5Patch-For-Review: Add monitoring of upload rate on commons to icinga alerts - https://phabricator.wikimedia.org/T92322#1799199 (10fgiunchedi) 5Resolved>3Open reopening, blocked by T118398 [16:14:50] (03PS1) 10BBlack: geoip.inc: use X-Client-IP [puppet] - 10https://gerrit.wikimedia.org/r/252442 [16:14:52] 6operations, 7Graphite: http 500 errors from check_graphite on rate of media upload checks - https://phabricator.wikimedia.org/T118398#1799168 (10fgiunchedi) [16:14:53] 6operations, 5Patch-For-Review: Add monitoring of upload rate on commons to icinga alerts - https://phabricator.wikimedia.org/T92322#1799203 (10fgiunchedi) [16:15:51] (03PS2) 10BBlack: geoip.inc: use X-Client-IP [puppet] - 10https://gerrit.wikimedia.org/r/252442 [16:16:36] (03CR) 10Giuseppe Lavagetto: [C: 032] pybal: remove most confd-related references [puppet] - 10https://gerrit.wikimedia.org/r/252440 (owner: 10Giuseppe Lavagetto) [16:16:58] (03PS3) 10BBlack: geoip.inc: use X-Client-IP [puppet] - 10https://gerrit.wikimedia.org/r/252442 (https://phabricator.wikimedia.org/T89688) [16:17:31] bblack, curious q, why X-Client-Ip vs X-Real-IP? [16:17:35] seems like X-Real-IP is a header that already is known by the world? [16:17:38] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: First QuickSurvey for reader segmentation research - external survey [[gerrit:251133]] (duration: 00m 30s) [16:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:17:50] ^ jhobs check please [16:18:32] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 6 below the confidence bounds [16:18:34] (03CR) 10Ottomata: "We'd have to talk about this change, but I think in ideal world we'd make this a top level field in the webrequest data that varnishkafka " [puppet] - 10https://gerrit.wikimedia.org/r/252427 (owner: 10BBlack) [16:18:40] ottomata: what do you mean by "known by the world"? [16:18:53] uhhh, meaning i've seen it in some code and if you google it it comes up :) [16:18:57] sure [16:19:23] maybe it is equally as made up [16:19:26] in our case, they both have internal meaning. We already had X-Real-IP in our headers for quite some time though, and something or other might have come to rely on what it means already, so I left it alone [16:19:33] ah ok [16:19:34] cool [16:19:35] makes sense [16:19:42] PROBLEM - service on lvs3003 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [16:20:11] for us, X-Real-IP is "the real outside-world IP address that contacted our edge caches", and X-Client-IP is "The actual client IP, which is often the same as X-Real-IP, but might not be if there was a trusted external proxy's address at X-Real-IP" [16:20:33] (such as OperaMini) [16:21:01] PROBLEM - service on lvs2004 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [16:21:49] (in other words, unlike XRIP, XCIP "sees through" 3rd-party proxies we've chosen to trust) [16:22:04] ok cool [16:22:09] makes sense [16:22:18] i like X-Client-IP [16:25:24] thcipriani: looks good [16:25:36] jhobs: cool, thanks for checking! [16:25:50] currently the third-party proxy dataset is managed by the Zero team in their zero metadata, but there's a ticket out there about moving it somewhere more shared/public so more eyes are updating it [16:25:58] thcipriani: thanks for doing it! [16:26:45] ebernhardson: looking over your patches, does wpevents.php go before the js? vice-versa? [16:26:49] jhobs: np :) [16:26:54] https://phabricator.wikimedia.org/T89838 for the proxy thing above [16:27:01] thcipriani: php file first [16:27:07] kk [16:27:46] currently all that's really in the data is OperaMini's proxies and a similar set of proxies from Nokia for their older/smaller phones. [16:28:36] (03PS6) 10Dzahn: contint: stop gerrit replication to gallium [puppet] - 10https://gerrit.wikimedia.org/r/244498 (https://phabricator.wikimedia.org/T86661) (owner: 10Hashar) [16:28:38] (03PS1) 10Jcrespo: Allow enabling ssl (tls, in MariaDB's terminilogy) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/252445 [16:28:41] (03PS1) 10Giuseppe Lavagetto: nrpe: fix description for monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/252446 [16:29:36] 6operations, 6Labs, 3Labs-Sprint-101: Make Labs NFS alerts paging - https://phabricator.wikimedia.org/T101650#1799216 (10yuvipanda) 5Open>3Resolved These are paging from icinga now. [16:29:54] (03PS2) 10Jcrespo: Allow enabling ssl (tls, in MariaDB's terminology) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/252445 [16:30:42] (03CR) 10Giuseppe Lavagetto: [C: 032] nrpe: fix description for monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/252446 (owner: 10Giuseppe Lavagetto) [16:30:48] (03PS7) 10Muehlenhoff: openldap: Allow configurable ACLs [puppet] - 10https://gerrit.wikimedia.org/r/251272 (https://phabricator.wikimedia.org/T101299) [16:31:10] (03PS8) 10Muehlenhoff: openldap: Allow configurable ACLs [puppet] - 10https://gerrit.wikimedia.org/r/251272 (https://phabricator.wikimedia.org/T101299) [16:31:46] (03PS4) 10BBlack: geoip.inc: use X-Client-IP [puppet] - 10https://gerrit.wikimedia.org/r/252442 (https://phabricator.wikimedia.org/T89688) [16:31:58] 6operations, 10Incident-20150825-Redis, 5Patch-For-Review: Enable memory overcommit for all redis hosts with persistance - https://phabricator.wikimedia.org/T91498#1799223 (10yuvipanda) a:5yuvipanda>3None [16:32:28] (03PS2) 10Giuseppe Lavagetto: pybal: allow turning on using etcd for configuration [puppet] - 10https://gerrit.wikimedia.org/r/252243 [16:32:37] 6operations, 7discovery-system: Make puppet ca certificate world readable - https://phabricator.wikimedia.org/T110020#1799228 (10yuvipanda) a:5yuvipanda>3None [16:32:42] <_joe_> bblack: ^^ looks good to you? [16:33:04] 6operations, 6Labs, 3Labs-sprint-112, 5Patch-For-Review: labstore1002 out of space in vg to create new snapshots - https://phabricator.wikimedia.org/T109954#1799230 (10yuvipanda) 5Open>3Resolved Verified. [16:33:42] !log thcipriani@tin Synchronized php-1.27.0-wmf.6/extensions/WikimediaEvents/WikimediaEvents.php: SWAT: Restore satisfaction schema and fix the performance issue that it had part I [[gerrit:252347]] (duration: 00m 40s) [16:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:33:59] 10Ops-Access-Requests, 6operations: Give access to stat1002 to mobrovac - https://phabricator.wikimedia.org/T118399#1799234 (10mobrovac) 3NEW [16:34:04] (03CR) 10Jcrespo: [C: 032] Allow enabling ssl (tls, in MariaDB's terminology) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/252445 (owner: 10Jcrespo) [16:34:20] !log thcipriani@tin Synchronized php-1.27.0-wmf.6/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: SWAT: Restore satisfaction schema and fix the performance issue that it had part II [[gerrit:252347]] (duration: 00m 30s) [16:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:34:26] ^ ebernhardson check please [16:35:03] looking [16:35:29] (03CR) 10Muehlenhoff: [C: 032 V: 032] openldap: Allow configurable ACLs [puppet] - 10https://gerrit.wikimedia.org/r/251272 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [16:37:06] (03CR) 10BBlack: [C: 031] pybal: allow turning on using etcd for configuration [puppet] - 10https://gerrit.wikimedia.org/r/252243 (owner: 10Giuseppe Lavagetto) [16:37:09] thcipriani: looks good [16:37:26] _joe_: that template is awful to read, but seems reasonable! :) [16:37:26] ebernhardson: kk, continuing with .5 [16:37:44] <_joe_> bblack: it's actually wrong, but well, fixing it [16:38:32] (03PS1) 10Jcrespo: Enable SSL configuration on db1067 [puppet] - 10https://gerrit.wikimedia.org/r/252447 [16:38:55] (03PS3) 10Giuseppe Lavagetto: pybal: allow turning on using etcd for configuration [puppet] - 10https://gerrit.wikimedia.org/r/252243 [16:39:47] (03Abandoned) 10BBlack: Turn on instrumentation for pybal 1.12+ [puppet] - 10https://gerrit.wikimedia.org/r/252216 (owner: 10BBlack) [16:41:59] (03CR) 10BBlack: [C: 04-1] "Beta was fixed with varnish restarts. This shouldn't be an issue, but leaving this hanging around for now in case we need it for some oth" [puppet] - 10https://gerrit.wikimedia.org/r/252385 (https://phabricator.wikimedia.org/T118362) (owner: 10Faidon Liambotis) [16:43:14] (03CR) 10Giuseppe Lavagetto: [C: 032] pybal: allow turning on using etcd for configuration [puppet] - 10https://gerrit.wikimedia.org/r/252243 (owner: 10Giuseppe Lavagetto) [16:45:57] !log thcipriani@tin Synchronized php-1.27.0-wmf.5/extensions/WikimediaEvents/WikimediaEvents.php: SWAT: Restore satisfaction schema and fix the performance issue that it had part I [[gerrit:252349]] (duration: 00m 30s) [16:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:46:35] !log thcipriani@tin Synchronized php-1.27.0-wmf.5/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: SWAT: Restore satisfaction schema and fix the performance issue that it had part II [[gerrit:252349]] (duration: 00m 30s) [16:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:46:39] ^ ebernhardson check please [16:46:51] (03PS2) 10Jcrespo: Enable SSL configuration on db1067 [puppet] - 10https://gerrit.wikimedia.org/r/252447 [16:47:18] looking [16:49:15] (03PS3) 10Jcrespo: Enable SSL configuration on db1067 [puppet] - 10https://gerrit.wikimedia.org/r/252447 [16:50:17] thcipriani: had an event come in with the wrong revision id, will need to wait a few minutes to see if it's a recurring problem or just a one-off deploy thing [16:50:27] ebernhardson: kk [16:52:49] thcipriani: other events flowing through kafka look correct, i think that was just a one off. should be all set [16:53:00] ebernhardson: kk, thanks for checking! [16:53:49] (03PS4) 10Jcrespo: Enable SSL configuration on db1067 [puppet] - 10https://gerrit.wikimedia.org/r/252447 [16:55:01] (03CR) 10Jcrespo: [C: 032] Enable SSL configuration on db1067 [puppet] - 10https://gerrit.wikimedia.org/r/252447 (owner: 10Jcrespo) [16:55:44] 6operations, 6Services: Switch RESTBase to use service::node - https://phabricator.wikimedia.org/T118401#1799268 (10GWicke) 3NEW [17:02:50] those "102 pending alerts" are scary [17:03:18] (not alerts, checks) [17:04:35] 6operations, 7Monitoring: improve reqstats error alerts - https://phabricator.wikimedia.org/T98450#1799310 (10Nuria) [17:04:37] 6operations, 6Analytics-Backlog, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Turn off sqstat udp2log instance - https://phabricator.wikimedia.org/T117727#1799309 (10Nuria) [17:04:40] 6operations, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1799308 (10Nuria) 5Open>3Resolved [17:06:32] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [17:06:57] 6operations, 10Deployment-Systems, 6Performance-Team, 6Release-Engineering-Team, 7HHVM: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#1799331 (10bd808) >>! In T103886#1671438, @mmodell wrote: >>>! In T103886#1402775, @faidon wrote: >>... [17:09:29] godog: it looks like rollup agregation for daily to weekly metrics in graphite wouldnt actually be that bad. as min, max and last would all be usefull. just avg and sum would be pointless! [17:13:32] 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1799344 (10jcrespo) This patch allows the configuration: https://gerrit.wikimedia.org/r/#/c/252447 Please let me continue using SSL terminology, as it is the one used by mysql server for... [17:14:47] addshore: mhh you'd get to pick one aggregation though? unless going through statsd then yes you get derived metrics [17:16:21] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [17:20:43] godog: I started trying to look at alternatives I could just set up on labs, any suggestions? I already have bluefloo,d influxdb and opentsdb [17:21:54] dcausse: let me know when you are done testing and i will cr your latests patch [17:22:35] nuria: dunno if I'll have time to finish this today :( [17:22:45] dcausse: np at all [17:22:50] I'll add a comment to the patch when I'm done [17:22:56] dcausse: you let me know, [17:23:21] 6operations, 10Continuous-Integration-Infrastructure: puppet compiler wrongly indicates errors when dealing with subrepositories - https://phabricator.wikimedia.org/T118406#1799356 (10jcrespo) 3NEW [17:25:33] for some reason strontium failed to get updated from palladium on my commit [17:27:03] oh, I know what it is [17:27:14] if you merge manualy a subrepository [17:27:30] and then merge a change that includes a subrepository update [17:27:49] that has no effect, the hook fails to update the subrepository on the slave [17:28:02] if that makes sense to you [17:28:10] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [17:28:33] and this is something that joe already warned me about [17:29:00] subrepos are evil [17:29:31] (03PS3) 10Faidon Liambotis: dataset: remove system::role from the dataset module [puppet] - 10https://gerrit.wikimedia.org/r/246827 [17:29:36] (03CR) 10Faidon Liambotis: [C: 032 V: 032] dataset: remove system::role from the dataset module [puppet] - 10https://gerrit.wikimedia.org/r/246827 (owner: 10Faidon Liambotis) [17:29:52] addshore: no specific suggestion sadly, influxdb seems promising but young [17:30:22] (03PS6) 10Faidon Liambotis: Remove classes snapshot::common, snapshot::packages [puppet] - 10https://gerrit.wikimedia.org/r/245616 [17:30:28] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Remove classes snapshot::common, snapshot::packages [puppet] - 10https://gerrit.wikimedia.org/r/245616 (owner: 10Faidon Liambotis) [17:32:08] (03PS3) 10Faidon Liambotis: snapshot: create a proper role::snapshot [puppet] - 10https://gerrit.wikimedia.org/r/246828 [17:32:10] (03CR) 10Faidon Liambotis: [C: 032 V: 032] snapshot: create a proper role::snapshot [puppet] - 10https://gerrit.wikimedia.org/r/246828 (owner: 10Faidon Liambotis) [17:32:40] (03PS3) 10Faidon Liambotis: dataset: inline the non-role role classes [puppet] - 10https://gerrit.wikimedia.org/r/246826 [17:32:49] (03CR) 10Faidon Liambotis: [C: 032 V: 032] dataset: inline the non-role role classes [puppet] - 10https://gerrit.wikimedia.org/r/246826 (owner: 10Faidon Liambotis) [17:34:45] 6operations, 6Labs: labs precise instance not accessible after provisioning - https://phabricator.wikimedia.org/T117673#1799370 (10fgiunchedi) yup, works for me, both instances are up but can be deleted/recreated at will for tests too [17:39:13] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: Connection refused [17:42:01] that's me ^ still bootstrapping [17:45:47] 6operations, 5Patch-For-Review: Alert when used_memory gets too high for redis queues - https://phabricator.wikimedia.org/T118331#1799375 (10fgiunchedi) what would be the action on this page? also generally I think the higher level the pages are the better, in other words could we alert on the effect as percei... [17:49:15] (03PS1) 10Jhobs: Copy wmgQuickSurveysConfig to wgQuickSurveysConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252456 (https://phabricator.wikimedia.org/T113443) [17:57:33] (03CR) 10BryanDavis: [C: 031] Copy wmgQuickSurveysConfig to wgQuickSurveysConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252456 (https://phabricator.wikimedia.org/T113443) (owner: 10Jhobs) [17:58:14] (03CR) 10Florianschmidtwelzow: [C: 04-1] Copy wmgQuickSurveysConfig to wgQuickSurveysConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252456 (https://phabricator.wikimedia.org/T113443) (owner: 10Jhobs) [17:59:25] (03CR) 10Florianschmidtwelzow: First QuickSurvey for reader segmentation research - external survey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251133 (https://phabricator.wikimedia.org/T113443) (owner: 10Jdlrobson) [17:59:55] (03CR) 10Jhobs: Copy wmgQuickSurveysConfig to wgQuickSurveysConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252456 (https://phabricator.wikimedia.org/T113443) (owner: 10Jhobs) [18:04:18] 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1799413 (10jcrespo) a:3jcrespo [18:04:31] 7Puppet, 6Phabricator, 6Release-Engineering-Team: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1799414 (10chasemp) [18:07:02] 6operations, 10Wikimedia-General-or-Unknown, 7HHVM: HHVM gives incorrect results for certain PCRE patterns - https://phabricator.wikimedia.org/T73922#1799417 (10Anomie) After some further testing, it //is// a PCRE bug, and it looks like it was fixed in 8.32 (we're using 8.31). The reason it only occurs in HH... [18:07:44] PROBLEM - puppet last run on snapshot1003 is CRITICAL: CRITICAL: puppet fail [18:08:22] 6operations, 10Wikimedia-General-or-Unknown, 7HHVM: HHVM and PCRE v8.31 gives incorrect results for certain PCRE patterns - https://phabricator.wikimedia.org/T73922#1799423 (10Anomie) [18:08:41] (03PS1) 10Jcrespo: Repool db2067 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252459 [18:09:21] (03PS2) 10Faidon Liambotis: sentry: use $::mail_smarthost for SMTP_HOST [puppet] - 10https://gerrit.wikimedia.org/r/250373 (https://phabricator.wikimedia.org/T116709) (owner: 10Gergő Tisza) [18:09:36] (03CR) 10Faidon Liambotis: [C: 032] sentry: use $::mail_smarthost for SMTP_HOST [puppet] - 10https://gerrit.wikimedia.org/r/250373 (https://phabricator.wikimedia.org/T116709) (owner: 10Gergő Tisza) [18:13:30] (03CR) 10Faidon Liambotis: [C: 031] geoip.inc: use X-Client-IP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/252442 (https://phabricator.wikimedia.org/T89688) (owner: 10BBlack) [18:16:23] (03CR) 10Faidon Liambotis: [C: 04-1] "I think this should be a top-level field, not an X-Analytics subfield. This would allow us using it in e.g. the JSON logs. Once we have th" [puppet] - 10https://gerrit.wikimedia.org/r/252427 (owner: 10BBlack) [18:16:40] (03CR) 10Jcrespo: [C: 032] Repool db2067 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252459 (owner: 10Jcrespo) [18:18:30] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2067 after maintenance (duration: 00m 29s) [18:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:20:34] there was a db1045 spike of failed connections some minutes ago [18:20:45] (before the deploy) [18:21:09] 6operations, 10RESTBase: API portal works on domains without RESTBase, but lacks styling - https://phabricator.wikimedia.org/T118410#1799471 (10Krenair) 3NEW [18:21:33] (03CR) 10Faidon Liambotis: [C: 04-1] "I'm on the fence about it. The office isn't doing anything very special with NAT -- NAT sucks, but is also pretty common." [puppet] - 10https://gerrit.wikimedia.org/r/252439 (owner: 10BBlack) [18:21:36] 6operations, 10RESTBase: API portal loads on domains without RESTBase, but lacks styling - https://phabricator.wikimedia.org/T118410#1799478 (10Krenair) [18:22:20] not very large, I must say [18:26:04] nothing to worry, I can see, just some queries pending to be optimized [18:35:34] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit nfs-exports is failed [18:37:04] (03PS2) 10Jhobs: Refactor wmgQuickSurveysConfig to wgQuickSurveysConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252456 (https://phabricator.wikimedia.org/T113443) [18:50:08] (03PS1) 10Muehlenhoff: Assign salt grains for some additional hosts [puppet] - 10https://gerrit.wikimedia.org/r/252487 [18:50:10] (03PS1) 10Muehlenhoff: Assign salt grains for labstore [puppet] - 10https://gerrit.wikimedia.org/r/252488 [18:50:12] (03PS1) 10Muehlenhoff: Assign salt grains for puppetmaster frontend [puppet] - 10https://gerrit.wikimedia.org/r/252489 [18:50:14] (03PS1) 10Muehlenhoff: Assign salt grains for nodepool [puppet] - 10https://gerrit.wikimedia.org/r/252490 [18:53:36] (03PS2) 10Muehlenhoff: Assign salt grains for some additional hosts [puppet] - 10https://gerrit.wikimedia.org/r/252487 [18:53:43] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for some additional hosts [puppet] - 10https://gerrit.wikimedia.org/r/252487 (owner: 10Muehlenhoff) [18:54:46] (03PS2) 10Muehlenhoff: Assign salt grains for labstore [puppet] - 10https://gerrit.wikimedia.org/r/252488 [18:54:59] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for labstore [puppet] - 10https://gerrit.wikimedia.org/r/252488 (owner: 10Muehlenhoff) [18:55:31] (03PS2) 10Muehlenhoff: Assign salt grains for puppetmaster frontend [puppet] - 10https://gerrit.wikimedia.org/r/252489 [18:55:40] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for puppetmaster frontend [puppet] - 10https://gerrit.wikimedia.org/r/252489 (owner: 10Muehlenhoff) [18:56:14] (03PS2) 10Muehlenhoff: Assign salt grains for nodepool [puppet] - 10https://gerrit.wikimedia.org/r/252490 [18:59:27] (03PS1) 10Andrew Bogott: Move wikitech to the keystone v3 api. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252491 [19:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151111T1900). Please do the needful. [19:05:18] !log started nfs-exports service on labstore1001 [19:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:05:55] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is OK: OK - nfs-exports is active [19:09:02] !log nfs-exports on labstore1001 failed because of http failure of wikitech [19:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:14:44] 6operations, 10Continuous-Integration-Infrastructure, 7Jenkins, 7WorkType-Maintenance: Please refresh Jenkins package on apt.wikimedia.org to 1.625.1 / 1.625.2 - https://phabricator.wikimedia.org/T118158#1799543 (10hashar) The pre announce of the security release was the reason to bump the LTS version. Th... [19:22:27] Raymond_, hi [19:22:43] Krenair: hi [19:22:54] do you know who runs wikilovesmonuments.org? [19:23:20] I ask for two reasons [19:23:28] 6operations, 5Patch-For-Review: setup / deploy nobelium for elastic-search testing in labs - https://phabricator.wikimedia.org/T113282#1799556 (10yuvipanda) a:5yuvipanda>3EBernhardson @Ebernhardson I guess this is kindof 'done' now? Can you update ticket? [19:23:44] 1) They didn't update their MX records when the WMF mail servers changed, and mail going into OTRS is probably broken now [19:23:47] Depends on the definition of "runs". [19:23:52] Krenair: asking my other half. just a second [19:23:54] 2) https://phabricator.wikimedia.org/T118388 [19:25:12] Krenair: maybe jzerebecki can fix it :p [19:25:18] because the domain is with WMDE [19:26:15] Krenair: maybe Lodewijk knows more [19:27:40] Krenair: may wife was in the international WLM orga 2 years ago but she does not know anything about the current situation [19:27:58] mutante|away: is ns.namespace4you.de. domainfactory? [19:28:26] jzerebecki: yea [19:29:21] eh Organisation: FOURTYSIX Rechenzentrum GmbH [19:29:38] for namespace4you.de itself [19:30:08] but at the same street address :) [19:30:12] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1799567 (10Kelson) I guess mwoffliner instances have already reached sporadically this limit because under high load it seems time to time the API simply starts refusing to answer. [19:30:13] mutante|away, Krenair: ok then I can fix the MX record, but instead of doing that in the webinterface we should point it to the wmf ns servers [19:30:23] mutante|away: yea also matches with wikimedia.de [19:30:55] i meant perhaps we should point it to the wmf ns servers [19:31:50] should be ok-ed with a few other people before actually doing it [19:33:48] yea..on the "ok-ed with a few other people".. *nod*, that sounds right [19:33:57] paravoid: ^ does it? [19:37:28] !log deploying 1.27.0-wmf.6 to group1 [19:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:38:18] (03PS1) 1020after4: group1 wikis to 1.27.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252494 [19:39:49] (03CR) 1020after4: [C: 032] group1 wikis to 1.27.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252494 (owner: 1020after4) [19:40:10] (03PS1) 10Muehlenhoff: Assign salt grains for db servers [puppet] - 10https://gerrit.wikimedia.org/r/252495 [19:40:15] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252494 (owner: 1020after4) [19:43:25] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for nodepool [puppet] - 10https://gerrit.wikimedia.org/r/252490 (owner: 10Muehlenhoff) [19:43:59] (03PS2) 10Muehlenhoff: Assign salt grains for db servers [puppet] - 10https://gerrit.wikimedia.org/r/252495 [19:48:39] (03PS3) 10Muehlenhoff: Assign salt grains for db servers [puppet] - 10https://gerrit.wikimedia.org/r/252495 [19:48:48] (03PS4) 10Muehlenhoff: Assign salt grains for db servers [puppet] - 10https://gerrit.wikimedia.org/r/252495 [19:48:57] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.6 [19:49:02] fwrite(): send of 61 bytes failed with errno=32 Broken pipe in /srv/mediawiki/php-1.27.0-wmf.5/vendor/nmred/kafka-php/src/Kafka/Socket.php on line 330 [19:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:49:29] ^ that log message is in the hhvm.log 10356289 times over [19:50:56] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1799612 (10BBlack) I'm hesitant about this. 50/s is considered fairly high - we intend to eventually **lower** that number as we improve the ratelimiter to avoid special cases in more-natural... [19:51:42] !log Started rebuildItemsPerSite for wikidatawiki on mw1152. Feel free to kill, should it cause troubles. [19:51:48] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for db servers [puppet] - 10https://gerrit.wikimedia.org/r/252495 (owner: 10Muehlenhoff) [19:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:56:20] 6operations, 10Wikimedia-Mailing-lists: Rename usergroups@ to usergroup-applications@ - https://phabricator.wikimedia.org/T108099#1799634 (10Nemo_bis) Why is https://lists.wikimedia.org/mailman/listinfo/usergroups still available? At least the description should link its actual location. [19:57:46] 6operations, 10Wikimedia-Mailing-lists: Rename usergroups@ to usergroup-applications@ - https://phabricator.wikimedia.org/T108099#1799637 (10JohnLewis) because it was re-created in T99443. [19:58:54] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [20:04:14] (03CR) 10BBlack: "My counterpoints would be:" [puppet] - 10https://gerrit.wikimedia.org/r/252439 (owner: 10BBlack) [20:10:01] twentyafterfour: that would be the search logging to Kafka by ebernhardson and the discovery folks [20:11:56] twentyafterfour: enabled by -- https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L4356-L4361 [20:23:57] Redis exception on server "rdb1001.eqiad.wmnet" 66243x in the last 15 minutes [20:24:04] Trending on https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki-errors [20:26:12] AaronSchulz: They all come from job runners [20:26:36] Started 19:48 after wikiversions was switched [20:27:15] twentyafterfour: [20:33:55] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:34:31] this is me checking something ^^ [20:37:03] testing in production again, gwicke? :) [20:37:33] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [20:37:42] (03PS1) 10Merlijn van Deen: toollabs: make sure /tmp and swap are large for all exec hosts [puppet] - 10https://gerrit.wikimedia.org/r/252506 [20:37:44] YuviPanda: ^. Probably will get -1'ed by jenkins, but please check if this makes sense anyway [20:38:23] YuviPanda: debugging an interaction with Parsoid that requires specific job queue requests that we don't have in staging [20:38:34] see -parsoid [20:38:52] valhallasw`cloud: hmm I think the inheritance will fuck things up, I remember trying to do that and nope-ing out at some point [20:40:25] it might [20:40:27] it's puppet [20:40:27] :( [20:40:39] Krinkle: ... [20:41:21] twentyafterfour: It started the same minute as the version switch, it overtakes all other error channels. [20:42:50] so, roll back? anyone have a clue what changed? [20:44:31] I don't know. I just noticed it. [20:44:42] I figured someone else would've noticed it by now, but apparently not. [20:44:49] Not sure what the impact is. [20:45:17] yeah it didn't even show up on the fatalmonitor-group1 dashboard [20:46:00] they're not page view fatals though [20:46:09] they're caught exceptions from job runners [20:46:14] yeah [20:46:22] I'm gonna roll it back [20:47:40] valhallasw`cloud: yeah, should make it minimal and just include it in the web role [20:48:09] (03CR) 10Tim Landscheidt: toollabs: make sure /tmp and swap are large for all exec hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/252506 (owner: 10Merlijn van Deen) [20:48:50] Hm.. interesting fatalmonitor in logstash also includes notices and warnings? [20:49:11] yes, YuviPanda, what /is/ the problem for the precise instances =p [20:49:28] hmm? [20:49:30] oh [20:49:32] (03PS1) 1020after4: group1 wikis to 1.27.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252509 [20:49:33] what is the problem you mean? [20:49:37] ya [20:49:44] and why can;t we apply and reboot later [20:49:50] we probably can sure [20:49:56] but that's a ticking time bomb! [20:50:09] hmm [20:50:11] webnodes are easier to restart [20:50:13] than exec nodes [20:50:15] so maybe not [20:51:00] but why is it? [20:51:07] it doesn't change stuff now and fixes stuff later =p [20:51:35] (03CR) 1020after4: [C: 032] group1 wikis to 1.27.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252509 (owner: 1020after4) [20:51:35] it does change stuff now, no? it sets up swap [20:52:39] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252509 (owner: 1020after4) [20:52:46] valhallasw`cloud: but that's cool too. I just am afraid that the inheritance is a bigger change than it looks [20:52:54] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.5 [20:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:53:00] puppet compiler! oh, wait :p [20:53:46] valhallasw`cloud: I'm also not fully sure if Coren specifically asked to not do swap for webgrid nodes for some reason - i know he said something about 'we need not have memory limits in gridengine for it because X' [20:53:53] not fully sure what X is [20:58:35] valhallasw`cloud: what do you think we should do? I'm happy to just go with what you think is good, and be your +2ing meat puppet. I personally think that for today we should just put the swap definition in the webnode [20:59:06] YuviPanda: or we can leave it for now [20:59:15] kmlexport is alive again. for a bit =p [21:00:04] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151111T2100). Please do the needful. [21:01:03] valhallasw`cloud: :P [21:01:06] valhallasw`cloud: that's an option [21:01:11] valhallasw`cloud: can you open a bug about missing swap? [21:01:44] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [5000000.0] [21:02:43] !log starting parsoid deploy [21:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:05:00] (03Abandoned) 10Muehlenhoff: Move base::firewall include in the kibana and logstash roles [puppet] - 10https://gerrit.wikimedia.org/r/251019 (owner: 10Muehlenhoff) [21:05:11] valhallasw`cloud: cool that works :D [21:05:25] ? [21:05:33] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [5000000.0] [21:05:43] you mean kmlexport? :P [21:05:55] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/252435 (owner: 10Jcrespo) [21:06:04] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [21:06:09] at least now it will be killed by SGE rather than just not being able to allocate memory [21:06:47] !log synced new code + restarted parsoid on wtp1001 as canary; monitoring graphs for a little bit [21:06:50] valhallasw`cloud: the ticket I Mean [21:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:06:58] aaah [21:07:19] * YuviPanda is still at the k8s hackathon, writing go [21:07:33] good to see you've dumped perl again ;-D [21:07:54] perl6 you mean [21:07:57] no relation to perl5 [21:08:04] :> [21:08:25] valhallasw`cloud: http://www.eclipse.org/che/ is super fascinating [21:09:05] YuviPanda: ooooh [21:09:14] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [21:10:28] valhallasw`cloud: so they're working on setting up templates with openshift, so you can basically be like 'gimme a flask app' and then write it and deploy it from there [21:11:05] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [21:12:04] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1799843 (10Ottomata) Is it time to consider creating a standalone repo for these schemas? If so, then that means it is time for repo name bikeshed,... [21:12:09] (03CR) 10Merlijn van Deen: [C: 04-1] toollabs: make sure /tmp and swap are large for all exec hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/252506 (owner: 10Merlijn van Deen) [21:13:15] !log finished deploying parsoid sha 7ca999c1 [21:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:17:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [21:20:10] (03CR) 10Andrew Bogott: [C: 04-2] "Hm, nope, this breaks things." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252491 (owner: 10Andrew Bogott) [21:20:24] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [21:22:23] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0] [21:22:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 5 below the confidence bounds [21:26:35] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 5 below the confidence bounds [21:30:34] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [21:36:13] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 8 below the confidence bounds [21:44:59] * YuviPanda is excited and pushing things to gerrit [21:59:29] 6operations, 5Patch-For-Review: Alert when used_memory gets too high for redis queues - https://phabricator.wikimedia.org/T118331#1799944 (10aaron) >>! In T118331#1799375, @fgiunchedi wrote: > what would be the action on this page? > also generally I think the higher level the pages are the better, in other wo... [22:00:14] PROBLEM - Host deployment-parsoidcache02 is DOWN: CRITICAL - Host Unreachable (10.68.16.145) [22:00:19] YuviPanda, why does that keep happening? [22:00:27] I thought that had been replaced? [22:03:50] * YuviPanda is conferencing [22:07:20] (03PS1) 10Krinkle: graphite: Clarify monitoring of graphite_threshold for reqstats.5xx [puppet] - 10https://gerrit.wikimedia.org/r/252584 [22:07:48] (03PS2) 10Krinkle: graphite: Clarify description of graphite_threshold for reqstats.5xx [puppet] - 10https://gerrit.wikimedia.org/r/252584 [22:21:18] twentyafterfour: can you try the deploy again with https://gerrit.wikimedia.org/r/#/c/252588/ ? [22:21:39] (03CR) 10Florianschmidtwelzow: Refactor wmgQuickSurveysConfig to wgQuickSurveysConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252456 (https://phabricator.wikimedia.org/T113443) (owner: 10Jhobs) [22:40:25] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 9 below the confidence bounds [22:44:14] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [23:04:00] (03PS1) 10Yuvipanda: k8s: Switch to using new cmdline for kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/252594 [23:04:31] (03PS2) 10Yuvipanda: k8s: Switch to using new cmdline for kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/252594 [23:04:41] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Switch to using new cmdline for kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/252594 (owner: 10Yuvipanda) [23:06:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 7 below the confidence bounds [23:07:40] (03PS1) 10Yuvipanda: k8s: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/252595 [23:08:31] (03PS2) 10Yuvipanda: k8s: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/252595 [23:08:40] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/252595 (owner: 10Yuvipanda) [23:11:56] bblack, hey. have you seen https://phabricator.wikimedia.org/T118428 ? There's not supposed to be some sort of magic varnish rewriting of purges going on here, is there? [23:16:14] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 15 data above and 7 below the confidence bounds [23:28:25] AaronSchulz: ok [23:31:10] no idea why the change would be a problem, but it's suspiciously recent [23:34:20] (03PS1) 10Yuvipanda: dynamicproxy: Do not include redis collector manually [puppet] - 10https://gerrit.wikimedia.org/r/252599 [23:34:43] (03CR) 10jenkins-bot: [V: 04-1] dynamicproxy: Do not include redis collector manually [puppet] - 10https://gerrit.wikimedia.org/r/252599 (owner: 10Yuvipanda) [23:37:18] (03PS2) 10Yuvipanda: dynamicproxy: Do not include redis collector manually [puppet] - 10https://gerrit.wikimedia.org/r/252599 [23:37:38] (03CR) 10Yuvipanda: [C: 032 V: 032] dynamicproxy: Do not include redis collector manually [puppet] - 10https://gerrit.wikimedia.org/r/252599 (owner: 10Yuvipanda) [23:38:55] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 2 below the confidence bounds [23:44:57] !log twentyafterfour@tin Synchronized php-1.27.0-wmf.6/includes/jobqueue/aggregator/: sync https://gerrit.wikimedia.org/r/#/c/252588/ (duration: 00m 32s) [23:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:45:44] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:46:30] (03PS1) 1020after4: group1 wikis to 1.27.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252600 [23:46:43] (03CR) 1020after4: [C: 032] group1 wikis to 1.27.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252600 (owner: 1020after4) [23:47:03] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252600 (owner: 1020after4) [23:47:08] (03PS1) 10Yuvipanda: k8s: Add UidEnforcer admission controller [puppet] - 10https://gerrit.wikimedia.org/r/252601 [23:47:31] (03PS2) 10Yuvipanda: k8s: Add UidEnforcer admission controller [puppet] - 10https://gerrit.wikimedia.org/r/252601 [23:47:33] (03CR) 10jenkins-bot: [V: 04-1] k8s: Add UidEnforcer admission controller [puppet] - 10https://gerrit.wikimedia.org/r/252601 (owner: 10Yuvipanda) [23:47:44] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Add UidEnforcer admission controller [puppet] - 10https://gerrit.wikimedia.org/r/252601 (owner: 10Yuvipanda) [23:48:37] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1800106 (10GWicke) @bblack: The basic issue is that we are using a blanket limit across different APIs with vastly different costs. Some batch APIs let you submit 500 expensive logical requests... [23:48:50] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.6 [23:49:37] AaronSchulz: looks like it worked? [23:49:49] I'm not seeing a spike in kibana [23:52:04] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [23:56:15] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/252581/1 is also a simply optimization related to that area [23:56:31] * AaronSchulz is working on something more long term as well [23:57:44] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 3 below the confidence bounds