[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160218T0000). Please do the needful. [00:00:04] matt_flaschen: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:38] (03CR) 10Thcipriani: [C: 031] "Probably also ought to remove 'scap/scap' in hieradata/common/role/deployment.yaml doesn't matter too much for this patch though." [puppet] - 10https://gerrit.wikimedia.org/r/271442 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [00:00:51] twentyafterfour: chasemp lgtm [00:00:59] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: puppet fail [00:01:11] thcipriani: that's a blood oath so good enough [00:01:28] (03CR) 10Rush: [C: 032] "please test this some post but seems in order :)" [puppet] - 10https://gerrit.wikimedia.org/r/271442 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [00:01:29] :D [00:02:20] twentyafterfour: gtg? [00:03:04] https://gerrit.wikimedia.org/r/#/c/269561/ sits today for rebase and it's own window [00:03:38] chasemp: thanks! [00:03:59] I have more improvements to do on 269561 anyway, thanks to what I learned today with my labs testing [00:04:17] I should be able to get everything back to stable at least, plus deploy the mega cool updates [00:04:28] forms are coming? [00:04:28] in 1 hour [00:04:29] twentyafterfour: Is the coast clear for me to do the SWAT? [00:04:36] yep forms and a lot of other cool stuff [00:04:42] RoanKattouw: swat away [00:04:42] yay! [00:04:51] thanks chasemp ! (and bblack !) [00:05:02] OK cool [00:05:02] I saw a lot of activity from a bunch of releng people so I wasn't sure [00:05:16] :) we shouldn't be that scary [00:05:23] matt_flaschen, bmansurov: You guys here for your SWAT entries? [00:05:33] (03PS2) 10Bmansurov: Labs: Enable the structured language overlay and increase the instrumentation rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271443 (https://phabricator.wikimedia.org/T123980) [00:05:35] yes [00:06:30] PROBLEM - puppet last run on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:07:00] RoanKattouw, yep. [00:07:19] bmansurov: The header says Dependency: https://gerrit.wikimedia.org/r/#/c/271263/ but that change was only merged this morning and is not deployed. Is the config change still OK to go out? [00:07:38] Oh, wait, it's -labs, nvm [00:07:47] (03CR) 10Catrope: [C: 032] Labs: Enable the structured language overlay and increase the instrumentation rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271443 (https://phabricator.wikimedia.org/T123980) (owner: 10Bmansurov) [00:08:00] PROBLEM - RAID on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:08:15] (03Merged) 10jenkins-bot: Labs: Enable the structured language overlay and increase the instrumentation rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271443 (https://phabricator.wikimedia.org/T123980) (owner: 10Bmansurov) [00:09:41] RECOVERY - RAID on restbase1008 is OK: OK: Active: 7, Working: 7, Failed: 0, Spare: 0 [00:09:51] so, do we care about restbase1008? [00:09:57] oh, look at that, it's back [00:11:29] RoanKattouw: thanks, presumably it takes some time before I see the change? [00:11:45] bmansurov: Yeah lemme check on the status of beta-scap-eqiad [00:11:52] ok [00:12:17] It's supposed to be quick, but it's been breaking a lot recentl [00:12:47] https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ [00:12:55] in-progress, and has been green most of the day :P [00:13:09] labs is currently updating its code based on a timer that went off earlier [00:13:11] Once that finishes, it will update again with your config change [00:13:22] greg-g: 1008 is being reimaged [00:13:32] gwicke: gotcha [00:13:48] I can see if I can schedule a maintenance window in icinga [00:13:58] Also holy crap https://integration.wikimedia.org/ci/job/beta-scap-eqiad/90427/consoleFull shows a LOT of HHVM crashes [00:14:16] RoanKattouw: got it, thanks [00:14:18] yeah, stupid lightprocess [00:14:36] bmansurov: You might have to wait another 10 mins or so [00:14:43] np [00:15:15] 6Operations: Rise in "parent, LightProcess exiting" fatals - https://phabricator.wikimedia.org/T124956#2037335 (10greg) [00:15:31] (03PS1) 10BBlack: Revert "Depool ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/271446 (https://phabricator.wikimedia.org/T127094) [00:16:35] 6Operations, 10MediaWiki-Interface, 10Traffic: Purge pages cached with mobile editlinks - https://phabricator.wikimedia.org/T125841#2037339 (10BBlack) Probably should've put this here instead of in the parent ticket: ``` I've executed bans for date matching anything up through (and including) Feb 5th now, w... [00:17:07] 6Operations, 7HHVM: Rise in "parent, LightProcess exiting" fatals - https://phabricator.wikimedia.org/T124956#1970973 (10greg) Soooooo, this is everywhere all the time. I've been told repeatedly that it's "known". This is the only task in Phabricator I can find about it. Can someone in the know tell us what's... [00:17:49] ACKNOWLEDGEMENT - Restbase root url on restbase1008 is CRITICAL: Connection refused gwicke This host is currently being reimaged - The acknowledgement expires at: 2016-02-18 05:16:43. [00:17:49] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.32.187:9042 on restbase1008 is CRITICAL: Connection refused gwicke This host is currently being reimaged - The acknowledgement expires at: 2016-02-18 05:16:43. [00:17:50] ACKNOWLEDGEMENT - cassandra-a service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed gwicke This host is currently being reimaged - The acknowledgement expires at: 2016-02-18 05:16:43. [00:17:50] ACKNOWLEDGEMENT - restbase endpoints health on restbase1008 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.178, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) gwicke This host is currently being reimaged - The acknowledgement expires at: 2016-02-18 05:16:43. [00:18:21] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: puppet fail [00:18:30] !log restart cassandra-a on restbase1008 after extending /srv [00:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:18:40] mobrovac urandom ^ [00:18:54] <-- off [00:19:01] wooow [00:19:03] thnx godog! [00:20:03] godog: thanks, indeed [00:20:10] and goodnight [00:25:19] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [00:26:04] matt_flaschen: Going to deploy your Flow change now; I missed the fact that it'd merged 8 mins ago because grrrit-wm restarted right then [00:27:00] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [00:27:05] Thanks [00:27:34] Whee and my screen just got flooded with LightProcess errors [00:27:37] RoanKattouw, this is the page move fix, so after it's deployed I'll sync up with quiddity to see if we can do the move tonight. [00:27:57] OK [00:28:10] 00:27:56 3 proxies had sync errors [00:28:12] wat [00:28:21] Ahm, has someone been messing with puppet? [00:28:25] !log catrope@tin Synchronized php-1.27.0-wmf.13/extensions/Flow: SWAT (duration: 02m 06s) [00:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:28:33] /usr/bin/sync-common is missing on 3 proxies and a whole lot of apaches [00:28:44] !log 00:28:25 64 apaches had sync errors , /usr/bin/sync-common missing [00:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Mr. Obvious [00:29:12] matt_flaschen: Don't do anything yet, your change is only on about 2/3 of the cluster [00:29:45] RoanKattouw, I like those odds! j/k [00:29:59] RoanKattouw: yeah, there was a move to move to a packaged scap [00:30:21] what's your local path to sync-dir? is it in /usr/bin or /srv/deployment/scap/scap/bin ? [00:30:33] /usr/bin [00:30:54] $ which sync-dir [00:30:56] /usr/bin/sync-dir [00:31:15] My guess is that puppet hasn't yet run on the servers that failed. [00:31:23] Right [00:31:31] thcipriani: that's probably right [00:31:44] so if you run the sync-dir from /srv/deployment/scap/scap/bin/sync-common it'll likely work [00:31:46] If I rerun sync-dir from the old path, would that work? [00:31:50] OK, will try [00:31:51] I didn't intend to have that patch merge right before swat ;) [00:34:09] !log catrope@tin Synchronized php-1.27.0-wmf.13/extensions/Flow: Trying again (duration: 01m 50s) [00:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:34:14] (03PS2) 10Mobrovac: RESTBase: Enable purging and minor config style changes [puppet] - 10https://gerrit.wikimedia.org/r/271436 [00:34:17] gah, my bad for poking people :/ [00:34:32] OK that succeeded [00:34:32] Thanks twentyafterfour [00:34:36] matt_flaschen: All set [00:34:40] thanks thcipriani :) [00:35:10] Oh right, my bad, thcipriani is the one who helped me :) [00:35:42] all us relengers blur together :D [00:36:03] thcipriani: is just faster...and thus more blurry [00:36:17] (03CR) 10BBlack: [C: 032] Revert "Depool ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/271446 (https://phabricator.wikimedia.org/T127094) (owner: 10BBlack) [00:39:53] 6Operations, 10Traffic, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2037466 (10BBlack) 5stalled>3Resolved resolving for now, unless something new pops up [00:45:57] twentyafterfour: shall we also do the "port 25" one? i would, if puppet is enabled again [00:46:21] or was there a reason not to that i didnt see [00:52:43] (03CR) 10Smalyshev: [C: 031] Don't yet allow wikidatasparql graph urls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271336 (owner: 10Aude) [00:54:03] mutante: yeah [00:54:27] I'm going to enable puppet in a few minutes once the database dump is finished (in case I need to revert ) [00:54:55] (03PS1) 10BBlack: lvs: remove rt_cache_rebuild_count sysctl [puppet] - 10https://gerrit.wikimedia.org/r/271453 [00:54:58] twentyafterfour: works for me:) [00:56:23] 6Operations, 6Performance-Team, 10Traffic, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2037572 (10BBlack) 5Open>3Resolved This experiment is done [00:56:51] 6Operations, 10ops-ulsfo, 10Traffic, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2037576 (10BBlack) 5Open>3Resolved [00:57:38] 6Operations, 10MediaWiki-Interface, 10Traffic: Purge pages cached with mobile editlinks - https://phabricator.wikimedia.org/T125841#2037579 (10BBlack) 5Open>3Resolved a:3BBlack [00:59:24] twentyafterfour: I'm going to head out for a bit, but godspeed and I'm looking forward to checking it out tonight :) [00:59:44] greg-g: thanks [01:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160218T0100). Please do the needful. [01:08:14] !log stopped phd and started dumping phabricator's database to /srv/dumps/20160218.phabricator.sql.gz (just in case I need to roll back the update) [01:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:24:37] (03CR) 10Ori.livneh: [C: 04-1] navtiming: Improve parse_ua and add unit tests (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/271307 (https://phabricator.wikimedia.org/T112594) (owner: 10Krinkle) [01:29:03] (03CR) 10Mattflaschen: "Thanks, James. Let me know when that's done." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270460 (https://phabricator.wikimedia.org/T49662) (owner: 10Mattflaschen) [01:45:30] PROBLEM - puppet last run on mw1147 is CRITICAL: CRITICAL: Puppet has 78 failures [01:47:20] PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: Puppet last ran 2 days ago [01:47:57] iridium is known, 1147 looking [01:49:01] RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [01:49:19] !log ran puppet on iridium for testing [01:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:49:28] :) [01:49:28] !log about to bring down phabricator to do the upgrade [01:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:50:03] prepares for users joining and asking if phab is down :) [01:51:03] !log phab pre-upgrade: http://pastebin.com/RTmXfDhp [01:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:51:20] PROBLEM - HHVM rendering on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:51:37] 6Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-GettingStarted: GettingStarted on Beta Cluster periodically loses its Redis index - https://phabricator.wikimedia.org/T100515#2037655 (10Mattflaschen) 5Open>3Resolved a:3Mattflaschen It hasn't happened recently that I know of. We can a... [01:52:29] PROBLEM - Apache HTTP on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:52:40] PROBLEM - RAID on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:52:49] PROBLEM - SSH on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:53:00] PROBLEM - configured eth on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:09] PROBLEM - Check size of conntrack table on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:19] PROBLEM - dhclient process on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:30] PROBLEM - DPKG on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:34] that one is frozen [01:54:09] PROBLEM - salt-minion processes on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:55:26] (03CR) 10Mobrovac: "The change has been verified to work in deployment-prep. The puppet compiler is happy as well - https://puppet-compiler.wmflabs.org/1796/" [puppet] - 10https://gerrit.wikimedia.org/r/271436 (owner: 10Mobrovac) [01:56:17] who broke phabricator? :) [01:56:20] FilesystemException [01:56:20] File system entity '/srv/phab/phabricator/resources/sprite/manifest/main-header.json' does not exist. [01:56:25] https://phabricator.wikimedia.org/T92127#2037658 [01:56:41] 17:54 < mutante> prepares for users joining and asking if phab is down :) [01:56:48] lol [01:56:50] yurik: see topic :) [01:56:59] now you tell me :-P [01:57:00] PROBLEM - nutcracker port on mw1147 is CRITICAL: Timeout while attempting connection [01:57:19] PROBLEM - HHVM processes on mw1147 is CRITICAL: Timeout while attempting connection [01:57:20] PROBLEM - nutcracker process on mw1147 is CRITICAL: Timeout while attempting connection [01:57:26] !log powercycled frozen mw1147 [01:57:27] shouldn't we show something nicer than a missing file system entity? [01:57:29] PROBLEM - Disk space on mw1147 is CRITICAL: Timeout while attempting connection [01:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:57:46] like what google shows for 404,with a broken robot :) [01:58:01] i'm sure we could point people to some lolcats [01:59:00] RECOVERY - HHVM processes on mw1147 is OK: PROCS OK: 1 process with command name hhvm [01:59:01] RECOVERY - nutcracker process on mw1147 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [01:59:01] RECOVERY - Disk space on mw1147 is OK: DISK OK [01:59:09] sad wikipe-tan ? [01:59:22] RECOVERY - salt-minion processes on mw1147 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:59:40] RECOVERY - RAID on mw1147 is OK: OK: no RAID installed [01:59:50] RECOVERY - SSH on mw1147 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [02:00:09] RECOVERY - configured eth on mw1147 is OK: OK - interfaces up [02:00:10] RECOVERY - Check size of conntrack table on mw1147 is OK: OK: nf_conntrack is 2 % full [02:00:10] RECOVERY - HHVM rendering on mw1147 is OK: HTTP OK: HTTP/1.1 200 OK - 69177 bytes in 0.916 second response time [02:00:21] RECOVERY - dhclient process on mw1147 is OK: PROCS OK: 0 processes with command name dhclient [02:00:39] RECOVERY - nutcracker port on mw1147 is OK: TCP OK - 0.000 second response time on port 11212 [02:00:39] RECOVERY - DPKG on mw1147 is OK: All packages OK [02:01:21] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 499 bytes in 0.425 second response time [02:01:51] twentyafterfour, if possible, it would be good to stop Phabricator from accessing its emails during upgrades, so we could still do actions by email, then after it went up again, it could go through and process them. [02:01:52] I got an error just now from it trying unsuccessfully to process an email during the upgrade. [02:03:29] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:03:33] hmm, stopping exim should do that [02:04:13] mutante: wanna apply the port 25 patch for phab? [02:04:18] twentyafterfour: ok, yes [02:04:48] (03PS5) 10Dzahn: phabricator: allow port 25 for ipv6 too [puppet] - 10https://gerrit.wikimedia.org/r/270960 (https://phabricator.wikimedia.org/T127053) (owner: 10ArielGlenn) [02:06:21] (03CR) 10Dzahn: [C: 032] phabricator: allow port 25 for ipv6 too [puppet] - 10https://gerrit.wikimedia.org/r/270960 (https://phabricator.wikimedia.org/T127053) (owner: 10ArielGlenn) [02:07:50] the ferm rule has been created, service refresh now [02:08:26] @iridium:~# ip6tables -L | grep smtp [02:08:26] ACCEPT tcp mx1001.wikimedia.org anywhere tcp dpt:smtp [02:08:30] there we go [02:08:39] email to phab now also via v6 [02:08:59] which should remove that strange delay that happened before when it tried until fallback to v4 [02:09:25] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 378 seconds [02:09:45] (03CR) 10Dzahn: "@iridium:~# ip6tables -L | grep smtp" [puppet] - 10https://gerrit.wikimedia.org/r/270960 (https://phabricator.wikimedia.org/T127053) (owner: 10ArielGlenn) [02:10:36] eh, there is no "comment" anymore in phab ? [02:10:40] but "add action" [02:11:37] 6Operations, 6Phabricator, 5Patch-For-Review: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2037668 (10Dzahn) @iridium:~# ip6tables -L | grep smtp ACCEPT tcp mx1001.wikimedia.org anywhere tcp dpt:smtp ACCEPT tcp mx2001.wikimedia... [02:12:31] works fine though :) [02:16:27] twentyafterfour: Is the Phab update done? Phab seems to work for viewing tasks, but https://phabricator.wikimedia.org/project/sprint/board/1384/query/open/ fatals [02:17:14] RoanKattouw: still working on it ... but it's mostly usable [02:18:20] !log phabricator is back online, sprint extension is broken, I'm investigating [02:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:19:54] twentyafterfour: It's OK, https://phabricator.wikimedia.org/project/board/1384/query/open/ works and it has most of the same functionality [02:20:09] In the previous version, avatars of assignees weren't shown in the normal board view, only in the sprint view [02:20:20] That's the main reason I used it [02:20:25] The new board view has every feature that I care about from the sprint view, it looks like [02:20:37] and more [02:20:37] (I don't think it shows points but I don't care) [02:23:13] 6Operations, 10Wikimedia-Etherpad, 7Icinga: etherpad http monitoring, false positive - https://phabricator.wikimedia.org/T127269#2037704 (10Dzahn) [02:23:43] ACKNOWLEDGEMENT - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: Connection refused daniel_zahn https://phabricator.wikimedia.org/T127269 [02:25:39] twentyafterfour: Is the notification server enabled because when i go to and click on the Notification button. At the bottom of it says this [02:25:39] Notification Server not enabled. [02:25:55] When you click on it, It redirects to https://secure.phabricator.com/book/phabricator/article/notifications/ [02:26:06] paladox: that's not new [02:26:12] it's been like that forevery [02:26:14] forever [02:26:21] twentyafterfour: Oh ok. [02:29:17] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.13) (duration: 13m 55s) [02:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:33:42] 6Operations, 6Phabricator: T127053 - https://phabricator.wikimedia.org/T127270#2037729 (10Dzahn) [02:34:00] 6Operations, 6Phabricator: test ticket (T127053) - https://phabricator.wikimedia.org/T127270#2037730 (10Dzahn) [02:34:11] 6Operations, 6Phabricator: test ticket (T127053) - https://phabricator.wikimedia.org/T127270#2037720 (10Dzahn) 5Open>3Resolved [02:34:26] 6Operations, 6Phabricator: test ticket (T127053) - https://phabricator.wikimedia.org/T127270#2037720 (10Dzahn) [02:34:29] 6Operations, 6Phabricator, 5Patch-For-Review: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2037732 (10Dzahn) [02:34:39] 6Operations, 6Phabricator: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2030881 (10Dzahn) [02:40:32] 6Operations, 6Phabricator: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2037739 (10Dzahn) 5Open>3Resolved a:3Dzahn i created the linked test ticket by mailing task@ and it happen pretty much immediately. also: ``` root@mx1001:/etc/exim4# telnet -6 iridi... [02:41:00] nice little differences, like when pasting code blocks in ticket comments, it's now blue [02:41:10] which looks better [02:41:43] 7Blocked-on-Operations, 6Phabricator, 10scap, 3Scap3, 7WorkType-Maintenance: scap::target should use scap's debian package instead of trebuchet - https://phabricator.wikimedia.org/T127215#2037742 (10mmodell) 5Open>3Resolved [02:43:58] twentyafterfour: Is Phabricator still undergoing maintenance? I can't file a task. [02:44:42] Leah: i already created new tasks after the upgrade [02:44:56] what error are you getting? [02:46:01] Access Denied: Maniphest [02:46:01] You do not have permission to edit task policies. [02:46:01] Users with the "Can Edit Task Policies" capability: [02:46:01] This object has a custom policy controlling who can take this action. [02:46:16] This is when using the form at https://phabricator.wikimedia.org/maniphest/task/edit/form/1/ [02:46:48] whoahhowow, i can now perform non-trivial task changes while also leaving a comment :O yay [02:47:01] MatmaRex: Yeah, it's nice. [02:49:14] Leah: hmm, did it involve changing the defaults for "visible to" / "editable by" or all default public? [02:49:25] I didn't touch those fields. [02:49:49] the same worked for me and i'm also not an admin .. [02:50:02] let me make another one [02:50:44] MatmaRex: Does filing a task work for you? [02:50:55] Leah: it worked a few minutes ago. [02:50:56] (03PS1) 1020after4: Forward maniphest security extension to release/2016-02-18/2 [puppet] - 10https://gerrit.wikimedia.org/r/271461 [02:51:11] i just made https://phabricator.wikimedia.org/T127271 [02:51:21] last one i filed was https://phabricator.wikimedia.org/T127267 [02:51:22] and we can turn that into a bug report for the issue you describe [02:51:26] mutante: would you mind +2 on this? https://gerrit.wikimedia.org/r/#/c/271461/ [02:51:27] * Leah looks. [02:51:36] it's just to get things consistent with the state of the repos on iridium [02:51:37] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.14) (duration: 11m 20s) [02:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:51:59] twentyafterfour: ok [02:54:12] (03CR) 10Dzahn: [C: 032] Forward maniphest security extension to release/2016-02-18/2 [puppet] - 10https://gerrit.wikimedia.org/r/271461 (owner: 1020after4) [02:55:24] mutante: Thanks, I updated https://phabricator.wikimedia.org/T127271 [02:55:30] Maybe my account is just weird? [02:56:35] Leah: oh, it's you :) [02:56:39] :-) [02:57:07] Another user report in #mediawiki about what seems like the same issue. [02:57:23] Leah: the ticket looks great now, the issue is being investigated already [02:57:28] Cool. [03:01:01] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Feb 18 03:01:01 UTC 2016 (duration 9m 24s) [03:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:02:23] (03CR) 10Alex Monk: "Shouldn't these be in /static/images instead of upload.wikimedia.org (where they'd all need to be protected and would be subject to differ" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270460 (https://phabricator.wikimedia.org/T49662) (owner: 10Mattflaschen) [03:02:42] Leah: Can you make a task now? [03:04:00] Trying. [03:05:43] twentyafterfour: Works now. [03:08:58] cool [03:11:40] PROBLEM - puppet last run on mw2147 is CRITICAL: CRITICAL: puppet fail [03:13:14] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0 seconds [03:13:40] !log running puppet one last time on iridium. Phabricator upgrade successful with just a few minor issues now resolved. [03:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:18:11] twentyafterfour: FYI I filed https://phabricator.wikimedia.org/T127276 ; should I tag it with something to indicate it's an upgrade regression? [03:26:34] RoanKattouw: just badly worded. [03:26:58] the warning is sort-of correct. It's just warning that the policy controls are fully functional and shouldn't be changed unless you know what you are doing [03:27:24] I'll solicit feedback on a better way to present that [03:27:50] twentyafterfour: Maybe if the dangerous form fields themselves could be marked as such, that'd be better? [03:28:08] The thing is the warning is correct, but it also appears on every single edit form I ever see, because I have the right to change those fields [03:28:58] So it'll be annoying for me (and others with these rights) to constantly see "BEEP BEEP BEEP WATCH OUT WARNING DANGER" for every edit, and we'll just get desensitized to it eventually [03:31:40] RoanKattouw: I toned it down a bit for now. Will try to come up with a better solution [03:40:09] RECOVERY - puppet last run on mw2147 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [04:03:14] ahem, i try to create a new page on metawiki, and it looses my session cookie on save [04:03:23] is that just me? [04:04:42] hmm, weird, works from privacy mode after login [04:15:48] 6Operations, 6Discovery, 10Wikimedia-Logstash, 3Discovery-Search-Sprint, 7Elasticsearch: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697#2037984 (10Deskana) [04:35:21] (03CR) 10Tim Landscheidt: [C: 04-1] ci: split and move role classes to modules/role/ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/260939 (owner: 10Dzahn) [04:37:01] (03CR) 10Andrew Bogott: [C: 031] "Looks good to me -- if you've tested it and it works, I will merge tonight so we're ready for reboots tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/271171 (https://phabricator.wikimedia.org/T127129) (owner: 10BryanDavis) [04:37:12] 6Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 3 others: Migrate CXServer to Node 4.2 and Jessie - https://phabricator.wikimedia.org/T107307#2038016 (10mobrovac) Thank you guys! [04:40:37] (03CR) 10Tim Landscheidt: [C: 031] postgres: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260611 (owner: 10Dzahn) [04:43:36] (03CR) 10Tim Landscheidt: [C: 031] swift: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260610 (owner: 10Dzahn) [04:53:41] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 59.09% of data above the critical threshold [5000000.0] [05:00:50] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [05:21:10] RECOVERY - RAID on db1063 is OK: OK: optimal, 1 logical, 2 physical [05:44:32] (03CR) 10BryanDavis: "I haven't tested it as applied by puppet, but I developed and tested the upstart script itself on shaved-yak.mediawiki-core-team.eqiad.wmf" [puppet] - 10https://gerrit.wikimedia.org/r/271171 (https://phabricator.wikimedia.org/T127129) (owner: 10BryanDavis) [05:48:57] (03PS1) 10Andrew Bogott: Move the labs pdns db server into hiera. [puppet] - 10https://gerrit.wikimedia.org/r/271465 [05:49:49] (03CR) 10Andrew Bogott: "ok, I won't do a blind merge at bedtime then :)" [puppet] - 10https://gerrit.wikimedia.org/r/271171 (https://phabricator.wikimedia.org/T127129) (owner: 10BryanDavis) [05:53:11] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [100000000.0] [06:09:59] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [06:11:09] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [06:11:09] 6Operations, 10Monitoring, 10Wikimedia-Etherpad, 7Icinga: etherpad http monitoring, false positive - https://phabricator.wikimedia.org/T127269#2038039 (10Peachey88) [06:18:00] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [06:20:30] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:21:40] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:29:40] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:10] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:11] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:40] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:51] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:41] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:00] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:10] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:30] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 3 failures [06:34:32] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:01] PROBLEM - puppet last run on db2012 is CRITICAL: CRITICAL: puppet fail [06:55:51] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:56:19] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:56:20] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:56:51] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:57:01] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:57:20] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:57:20] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:57:29] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:57:41] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:19] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:03:41] RECOVERY - puppet last run on db2012 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:20:40] (03PS2) 10Giuseppe Lavagetto: logrotate: Convert geoipupdate to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270930 (https://phabricator.wikimedia.org/T127025) (owner: 10Volans) [07:21:14] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] logrotate: Convert geoipupdate to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270930 (https://phabricator.wikimedia.org/T127025) (owner: 10Volans) [07:58:08] (03PS2) 10Giuseppe Lavagetto: role::memcached: add a second redis instance on mc20{01,16} [puppet] - 10https://gerrit.wikimedia.org/r/271258 [07:58:11] (03PS2) 10Giuseppe Lavagetto: ipresolve: add PTR resolution, tests [puppet] - 10https://gerrit.wikimedia.org/r/271259 [07:58:12] (03PS3) 10Giuseppe Lavagetto: role::memcached: add cross-dc Ipsec for the various shards. [puppet] - 10https://gerrit.wikimedia.org/r/271260 (https://phabricator.wikimedia.org/T126470) [07:58:15] (03PS3) 10Giuseppe Lavagetto: role::memcached: create cross-dc replication for sessions [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) [07:58:17] (03PS4) 10Giuseppe Lavagetto: nutcracker: re-organize redis servers list [puppet] - 10https://gerrit.wikimedia.org/r/270969 [08:07:06] (03PS3) 10Giuseppe Lavagetto: role::memcached: add a second redis instance on mc20{01,16} [puppet] - 10https://gerrit.wikimedia.org/r/271258 [08:22:17] (03PS4) 10Giuseppe Lavagetto: role::memcached: add a second redis instance on mc20{01,16} [puppet] - 10https://gerrit.wikimedia.org/r/271258 [08:25:00] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [08:26:10] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:26:51] <_joe_> uhm let's see [08:27:16] <_joe_> this is a big peak, looking [08:32:20] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:33:29] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:33:32] only thing I see is puppet disabled on cp3034 [08:34:26] wasn't a datacenter depooled recently? [08:35:13] ah, ulsfo, not esams [08:38:54] <_joe_> yes [08:39:05] <_joe_> good morning btw :) [08:39:10] * _joe_ brb [08:40:06] I've woken up today almost an hour before you :-) [08:40:09] good morning [08:40:24] jynus: which must be some kind of world record [08:43:37] (03PS1) 10Alexandros Kosiaris: Fix etherpad monitoring [puppet] - 10https://gerrit.wikimedia.org/r/271472 (https://phabricator.wikimedia.org/T127269) [08:44:08] lol [08:44:21] <_joe_> jynus: uhm, you woke up at 6 am? [08:44:27] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Fix etherpad monitoring [puppet] - 10https://gerrit.wikimedia.org/r/271472 (https://phabricator.wikimedia.org/T127269) (owner: 10Alexandros Kosiaris) [08:44:30] more or less [08:44:36] <_joe_> or, to be fair, before 6 am? [08:44:40] no [08:56:05] 6Operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1979046 (10Naveenpf) We are completing 15 years of Wikipedia. Maps in Wikipedia needs an update .The effort required for creating static maps for many projects is tiresome... [09:09:26] 6Operations, 5Patch-For-Review, 7audits-data-retention: Broken log rotation for many services (was nginx and varnishkafka on cpXXXX) - https://phabricator.wikimedia.org/T127025#2030067 (10fgiunchedi) confirmed working for swift, thanks @volans @joe ! ``` ms-fe2001:~$ sudo ls -latr /var/log/swift-access.log*... [09:11:24] twentyafterfour: much nicer phabricator theme, looks good! [09:16:15] <_joe_> godog: uh? I didn't notice much changes [09:16:30] * _joe_ not a frontend guy [09:18:16] _joe_: yeah it is more blue, also background for ``` blocks is blue/grey not yellow [09:24:19] I guess the phabricator upgrade is completed? [09:26:52] godog: Yes it is all done as far as im aware. [09:28:22] thanks paladox ! [09:29:00] godog the upgrade was done early this morning. [09:29:32] paladox: indeed, looks done according to https://tools.wmflabs.org/sal/log/AVLyXmxoW8txF7J0uT7r [09:30:04] godog Ok. [09:31:16] (03PS16) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) [09:33:21] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 7907 bytes in 0.008 second response time [09:36:00] 6Operations, 10Monitoring, 10Wikimedia-Etherpad, 7Icinga, 5Patch-For-Review: etherpad http monitoring, false positive - https://phabricator.wikimedia.org/T127269#2038322 (10akosiaris) 5Open>3Resolved a:3akosiaris Ah, indeed. sorry about that. Fixed in the above commit. Resolving [09:48:25] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271276 (owner: 10Elukey) [09:49:41] !log remove old restbase metrics under restbase.* from graphite1001 and graphite2001 [09:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:51:49] (03CR) 10Giuseppe Lavagetto: "DTRT according to the compiler:" [puppet] - 10https://gerrit.wikimedia.org/r/271258 (owner: 10Giuseppe Lavagetto) [09:55:39] (03PS4) 10Elukey: Add kafka1012 back to the pool of kafka brokers in wmf-config. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271276 [10:01:24] 6Operations, 10RESTBase-Cassandra: Grafana bugginess; Graph scales sometimes off by an order of magnitude - https://phabricator.wikimedia.org/T121789#1888407 (10fgiunchedi) I think that has to do on how we aggregated `.count` metrics ``` [count] aggregationMethod = sum pattern = \.count$ xFilesFactor = 0.01 `... [10:01:26] 7Blocked-on-Operations, 10Datasets-Archiving, 10Dumps-Generation, 10Flow, 3Collaboration-Team-Current: Publish recurring Flow dumps at http://dumps.wikimedia.org/ - https://phabricator.wikimedia.org/T119511#2038408 (10fgiunchedi) a:3ArielGlenn [10:05:10] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 669 [10:10:10] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 617 [10:10:26] !log restarting hhvm on mw1* to put glibc update into effect [10:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:14:01] (03CR) 10Giuseppe Lavagetto: [C: 031] "Looks good to me. I added a few nitpick/general code comments, but I think this can be merged now." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) (owner: 10Gehel) [10:14:13] (03PS5) 10Giuseppe Lavagetto: role::memcached: add a second redis instance on mc20{01,16} [puppet] - 10https://gerrit.wikimedia.org/r/271258 [10:15:02] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] role::memcached: add a second redis instance on mc20{01,16} [puppet] - 10https://gerrit.wikimedia.org/r/271258 (owner: 10Giuseppe Lavagetto) [10:17:50] !log rebooted kafka1014 for maintenance [10:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:18:04] (03PS1) 10Filippo Giunchedi: admin: add nuria/mforns/milimetric to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/271479 (https://phabricator.wikimedia.org/T126752) [10:19:21] PROBLEM - Host kafka1014 is DOWN: PING CRITICAL - Packet loss = 100% [10:19:24] 6Operations, 5Patch-For-Review: puppet should try to mount all mountable swift filesystems - https://phabricator.wikimedia.org/T126574#2038466 (10fgiunchedi) a:3fgiunchedi [10:20:09] ---^ me [10:20:11] RECOVERY - Host kafka1014 is UP: PING OK - Packet loss = 0%, RTA = 1.90 ms [10:20:34] 6Operations, 7Graphite, 5Patch-For-Review: provide aggregated cluster data with graphite, similar to ganglia - https://phabricator.wikimedia.org/T119520#2038471 (10fgiunchedi) a:3fgiunchedi [10:25:10] RECOVERY - check_mysql on db1008 is OK: Uptime: 2573214 Threads: 2 Questions: 18625110 Slow queries: 17204 Opens: 5533 Flush tables: 2 Open tables: 404 Queries per second avg: 7.238 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 48 [10:25:41] 6Operations, 6Discovery, 7Elasticsearch: elastic - large slow query logs / server runs out of disk space - https://phabricator.wikimedia.org/T122832#2038480 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi there's compression now ``` -rw-r--r-- 1 elasticsearch elasticsearch 2.1M Feb 18 08:14 production-s... [10:28:10] (03CR) 10Gehel: Ship Elasticsearch logs to logstash (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) (owner: 10Gehel) [10:28:17] 6Operations, 10Traffic: compressed http responses without content-length not cached by varnish - https://phabricator.wikimedia.org/T124195#2038487 (10fgiunchedi) p:5Triage>3Normal still pending an audit of what varnish backends might be affected, particularly apache [10:28:42] PROBLEM - HHVM rendering on mw1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:29:11] <_joe_> interesting, that machine is depooled [10:29:26] <_joe_> so I can bet a beer I know what the problem is without looking [10:29:31] <_joe_> anyone wanna bet? [10:30:31] RECOVERY - HHVM rendering on mw1046 is OK: HTTP OK: HTTP/1.1 200 OK - 69170 bytes in 0.220 second response time [10:31:00] <_joe_> heh, already recovered, you just lost one beer :P [10:31:11] (03PS17) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) [10:31:15] 6Operations: Decom/reclaim berkelium/curium - https://phabricator.wikimedia.org/T125962#2038492 (10MoritzMuehlenhoff) [10:31:33] <_joe_> gehel: merge it at will as far as I'm concerned :) [10:31:39] so what was it then? [10:31:42] <_joe_> did you ever merge a patch in prod? [10:31:50] <_joe_> apergos: I didn't look [10:31:55] :-D [10:31:55] _joe_: not yet [10:32:26] as this one requires a restart of the elasticsearch cluster, I'm going to wait for the upgrade to 1.7.5 to be ready and restart the cluster only once [10:33:04] <_joe_> oh yeah for sure [10:33:23] 6Operations: Decom/reclaim berkelium/curium - https://phabricator.wikimedia.org/T125962#2038499 (10MoritzMuehlenhoff) Chris, can you comment whether these should be reclaimed to spares or decomissioned? berkelium and curium were both bought in Jan 2011. [10:34:01] gehel: there's also the glibc update that we could bundle with the elastic 1.7.5 rolling restart [10:34:23] dcausse: yep, gehel and myself sorted that out [10:34:35] cool, thanks [10:35:10] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 649 [10:35:25] dcausse: glibc already updated, but will actually be active only after restart [10:35:34] sure [10:36:10] moritzm: I remember some discussion about rebooting the machines and not just individual processes. What was the decision? [10:38:28] (03PS1) 10Alex Monk: phabricator: redirect old task creation URL to new one [puppet] - 10https://gerrit.wikimedia.org/r/271485 (https://phabricator.wikimedia.org/T127286) [10:40:11] RECOVERY - check_mysql on db1008 is OK: Uptime: 2574115 Threads: 2 Questions: 18650892 Slow queries: 17222 Opens: 5533 Flush tables: 2 Open tables: 404 Queries per second avg: 7.245 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:40:35] 6Operations, 5Patch-For-Review: puppet should try to mount all mountable swift filesystems - https://phabricator.wikimedia.org/T126574#2038525 (10fgiunchedi) p:5Triage>3Normal [10:41:23] gehel: restarting the libc-using processes has turned out to be simpler so far and we have already made a cluster-wide reboot three weeks ago, so our kernels are currently all up-to-date [10:43:18] moritzm: ok, so same strategy for elasticsearch ... [10:43:58] yep [10:48:54] !log rebooted kafka1018 for maintenance [10:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:51:54] 6Operations: pinentry-gtk2 pulls in a lot of unneeded Gnome/GTK libs - https://phabricator.wikimedia.org/T127054#2030914 (10fgiunchedi) I think we shouldn't have mutt either, as bsd-mailx is already installed anyways and both come from exim4-base via ``` neodymium:~$ aptitude why mutt i exim4-base Suggests m... [10:56:26] 6Operations: pinentry-gtk2 pulls in a lot of unneeded Gnome/GTK libs - https://phabricator.wikimedia.org/T127054#2038548 (10MoritzMuehlenhoff) I'm not sure even sure why mutt is installed, "mail-reader" is only a Suggests: of exim4-base after all? [10:59:20] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [10:59:20] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [10:59:21] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [10:59:30] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1018_v4,kafka1018_v6 [10:59:30] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [10:59:31] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [10:59:31] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [10:59:31] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [10:59:31] PROBLEM - IPsec on cp4016 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [10:59:31] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [10:59:32] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [10:59:41] PROBLEM - IPsec on cp3006 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [10:59:41] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [10:59:41] PROBLEM - IPsec on cp3019 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1018_v4,kafka1018_v6 [10:59:51] PROBLEM - IPsec on cp4003 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1018_v4,kafka1018_v6 [11:00:00] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [11:00:00] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:01] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [11:00:01] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [11:00:01] PROBLEM - IPsec on cp3021 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1018_v4,kafka1018_v6 [11:00:01] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [11:00:01] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:02] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:10] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:10] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:10] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:12] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:21] PROBLEM - IPsec on cp3022 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1018_v4,kafka1018_v6 [11:00:22] PROBLEM - IPsec on cp3012 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [11:00:22] PROBLEM - IPsec on cp4004 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1018_v4,kafka1018_v6 [11:00:22] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:22] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:22] PROBLEM - IPsec on cp3003 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [11:00:22] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:23] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:23] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [11:00:24] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [11:00:30] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:30] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1018_v4,kafka1018_v6 [11:00:30] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:31] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:32] PROBLEM - IPsec on cp4008 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [11:00:41] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:41] PROBLEM - IPsec on cp3013 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [11:00:41] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:41] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:42] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:42] PROBLEM - IPsec on cp4001 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1018_v4,kafka1018_v6 [11:00:42] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:42] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:43] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [11:00:44] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:44] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:44] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [11:00:45] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:45] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [11:00:51] PROBLEM - IPsec on cp3009 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [11:00:51] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [11:00:51] PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1018_v4,kafka1018_v6 [11:00:51] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:00:51] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [11:00:52] PROBLEM - IPsec on cp3014 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [11:01:11] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [11:01:11] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [11:01:11] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [11:01:11] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:01:20] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 [11:01:20] PROBLEM - IPsec on cp4002 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1018_v4,kafka1018_v6 [11:02:49] ---^ kafka1018 is not coming up FYI, I may have already guessed it [11:02:56] *you may [11:03:18] I can't connect to the mgmt console either [11:04:52] <_joe_> uhm what do you mean? [11:05:14] it hangs without responding [11:05:21] (the ssh connection) [11:05:38] I tried with kafka1014 and it works [11:06:18] <_joe_> it's dead to pings too [11:06:46] yep [11:07:00] <_joe_> one possibility is we have a bad config in the dns [11:07:50] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [10.0] [11:07:53] 6Operations: pinentry-gtk2 pulls in a lot of unneeded Gnome/GTK libs - https://phabricator.wikimedia.org/T127054#2038562 (10fgiunchedi) indeed, it looks like it gets dragged in on the first install (even before puppet) ``` neodymium:~$ zgrep -e mutt -e mailx -e exim /var/log/dpkg.log.3.gz | grep 'install ' 201... [11:08:13] v*olans rised a good point about replication lag- with the new system, we will be getting the absolute number compared to the original slave [11:08:13] a similar/same error has also occured for other reboots, I think that happens if the system startup is failing _really_ early [11:08:30] last case was one of the elastic* nodes which had flagged an internal CPU error [11:08:36] that is great for graphing an mediawiki (logic) [11:08:41] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [10.0] [11:08:46] not so sure about icinga alerts [11:09:05] I'm pretty sure Chris will need to look into it [11:09:11] <_joe_> elukey: yeah I guess the machine is fried [11:09:20] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 72.41% of data above the critical threshold [10.0] [11:09:28] I am not lucky with kafka recently [11:09:33] <_joe_> elukey: you probably need to ack it's down :P [11:09:41] <_joe_> and depool it ofc in mediawiki-config [11:10:02] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 65.22% of data above the critical threshold [10.0] [11:10:05] yep I was about to say the same thing [11:11:00] when you create a Phab task about it and make sure to add "ops-eqiad" in addition to "operations" to the Projects [11:11:04] _joe_ what do you mean ack it's down? I put a temporary block on icinga, I guess that I'll add a more permanent one [11:11:17] moritzm: susre [11:11:19] sure [11:11:20] <_joe_> elukey: at the kafka level, you don't need to do anything? [11:12:00] !log Jenkins: reloading configuration from disk. Some metadata are corrupted T127294 [11:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:12:21] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 78.26% of data above the critical threshold [10.0] [11:12:56] elukey: you can use the "Acknowledge checked hosts(s) problem" option in Icinga for that [11:13:09] which leads me to philosophical questions like "what is a service alert" Should we alert if there is nothing wrong with the host, but there is something else creating problems for that host? [11:13:11] _joe_ theoretically no, I am going to launch a leader election but 1018 is part of some partition repliacation and icinga will complain [11:13:29] <_joe_> ok, that is what I meant as "ack it" [11:13:31] <_joe_> :) [11:13:31] moritzm: all right doing it [11:13:45] okok :) [11:13:59] elukey: I'm getting a console on kafka1018 now, BTW [11:14:35] or rather I can now connect via SSH, but the serial console is entirely blank [11:15:05] not really good I suppose [11:15:09] :D [11:15:09] 6Operations, 10netops: mr1-ulsfo booted from backup JunOS - https://phabricator.wikimedia.org/T127295#2038580 (10faidon) [11:15:14] <_joe_> moritzm: I'd powercycle the machine after ensuring it's rebooting via the HDD [11:20:53] moritzm: sorry for the basic question - I don't find anything to ack in icinga, the previous alerts shouldn't have generated a page. Am I missing something? (the answer is probably yes) [11:21:56] service > check > Select command > Acknowledfe Checked Service(s) Problem [11:21:57] the ipsec alerts are hard to silence down, since they affect the other hosts with which the link is made [11:22:33] or if you go to the "host detail" page, you can tick the affected system there [11:24:36] Or Service Information > Service Commands > Acknowledge this service problem [11:25:30] <_joe_> moritzm: we'd need servicedependencies [11:25:44] <_joe_> but it's tricky and not exactly always what we want [11:25:58] <_joe_> we could have those checks not shower down on us here, maybe [11:26:03] dificult to implement as it dependes on the message, not the service [11:27:05] !log Jenkins web UI busy with 'jenkins.model.RunIdMigrator doMigrate' while it migrate build records. I did a bunch of cleanup yesterday. Jenkins runs jobs in the background just fine though. T127294 [11:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:29:12] !log mr1-ulsfo: "request system snapshot media internal slice alternate" + reboot (T127295) [11:29:15] 6Operations, 10netops: mr1-ulsfo booted from backup JunOS - https://phabricator.wikimedia.org/T127295#2038580 (10Stashbot) {nav icon=file, name=Mentioned in SAL, href=https://tools.wmflabs.org/sal/log/AVL0JBqh-0X0Il_jxsnQ} [2016-02-18T11:29:12Z] mr1-ulsfo: "request system snapshot media internal sli... [11:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:29:31] ACKNOWLEDGEMENT - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 Elukey Kafka1018 down after reboot [11:29:31] ACKNOWLEDGEMENT - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 Elukey Kafka1018 down after reboot [11:29:31] ACKNOWLEDGEMENT - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 Elukey Kafka1018 down after reboot [11:29:31] ACKNOWLEDGEMENT - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1018_v4,kafka1018_v6 Elukey Kafka1018 down after reboot [11:29:31] ACKNOWLEDGEMENT - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1018_v4,kafka1018_v6 Elukey Kafka1018 down after reboot [11:29:50] jynus thanks! [11:30:15] 6Operations, 10hardware-requests, 5Patch-For-Review: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#2038660 (10fgiunchedi) >>! In T119935#2035934, @fgiunchedi wrote: > plan: > * as soon as the rebuild has finished (~6h) remount `/srv` and restart cassa... [11:31:07] not everybody is familiar with icinga, raw nagios is still very much in use [11:31:40] PROBLEM - Host mr1-ulsfo is DOWN: CRITICAL - Network Unreachable (198.35.26.194) [11:32:30] !log logical import of db1021 starting for data consistency check and defragmenting purposes [11:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:33:10] 534% cpu usage for the import, good enough [11:34:31] RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 76.13 ms [11:34:49] !log Hard restarting Jenkins T127294 [11:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:36:22] !log upgrading mr1-ulsfo to its pre-recovery version and rebooting (T127295) [11:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:36:26] 6Operations, 10netops: mr1-ulsfo booted from backup JunOS - https://phabricator.wikimedia.org/T127295#2038687 (10Stashbot) {nav icon=file, name=Mentioned in SAL, href=https://tools.wmflabs.org/sal/log/AVL0Kqi2W8txF7J0uUA4} [2016-02-18T11:36:22Z] upgrading mr1-ulsfo to its pre-recovery version and re... [11:43:49] (03PS5) 10Giuseppe Lavagetto: nutcracker: re-organize redis servers list [puppet] - 10https://gerrit.wikimedia.org/r/270969 (https://phabricator.wikimedia.org/T126470) [11:56:30] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.178, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [11:56:50] PROBLEM - Restbase root url on restbase1008 is CRITICAL: Connection refused [11:58:47] ACKNOWLEDGEMENT - Restbase root url on restbase1008 is CRITICAL: Connection refused Filippo Giunchedi down for raid expansion [11:58:47] ACKNOWLEDGEMENT - restbase endpoints health on restbase1008 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.178, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Filippo Giunchedi down for raid expansion [12:01:42] PROBLEM - Host mr1-ulsfo is DOWN: CRITICAL - Network Unreachable (198.35.26.194) [12:02:16] 6Operations, 10DBA, 5Patch-For-Review: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on icinga - https://phabricator.wikimedia.org/T114752#2038776 (10jcrespo) [12:04:27] 6Operations, 10DBA, 5Patch-For-Review: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on icinga - https://phabricator.wikimedia.org/T114752#2038786 (10jcrespo) This is semi-working now. We have to decide what to show on icinga for multi-tier slaves and fix some issues on multi-sourc... [12:05:45] 6Operations, 10CirrusSearch, 6Discovery, 7Elasticsearch, and 2 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2038791 (10faidon) 9300 is the internal transport port, right? As these are independent clusters, we're not using that across the datacenter barrier,... [12:06:01] RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.17 ms [12:06:40] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 28 ESP OK [12:06:40] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 38 ESP OK [12:06:41] RECOVERY - IPsec on cp4003 is OK: Strongswan OK - 20 ESP OK [12:06:50] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 38 ESP OK [12:06:51] RECOVERY - IPsec on cp4018 is OK: Strongswan OK - 28 ESP OK [12:06:51] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 38 ESP OK [12:06:51] RECOVERY - IPsec on cp3021 is OK: Strongswan OK - 20 ESP OK [12:06:51] RECOVERY - IPsec on cp3031 is OK: Strongswan OK - 28 ESP OK [12:06:52] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 38 ESP OK [12:06:52] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 38 ESP OK [12:06:52] RECOVERY - IPsec on cp3005 is OK: Strongswan OK - 28 ESP OK [12:06:52] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 38 ESP OK [12:07:01] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 28 ESP OK [12:07:01] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 28 ESP OK [12:07:01] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 38 ESP OK [12:07:01] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 20 ESP OK [12:07:01] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 38 ESP OK [12:07:02] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 38 ESP OK [12:07:03] 6Operations, 10CirrusSearch, 6Discovery, 7Elasticsearch, and 2 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2038800 (10faidon) That reminds me: nginx has been working fairly well for our frontend use-case (which is a huge plus), but on the other hand perhap... [12:07:12] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 38 ESP OK [12:07:12] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 38 ESP OK [12:07:12] RECOVERY - IPsec on cp4004 is OK: Strongswan OK - 20 ESP OK [12:07:12] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 38 ESP OK [12:07:12] RECOVERY - IPsec on cp3022 is OK: Strongswan OK - 20 ESP OK [12:07:12] RECOVERY - IPsec on cp3003 is OK: Strongswan OK - 28 ESP OK [12:07:12] RECOVERY - IPsec on cp3012 is OK: Strongswan OK - 28 ESP OK [12:07:12] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 38 ESP OK [12:07:12] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 38 ESP OK [12:07:21] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 38 ESP OK [12:07:21] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 28 ESP OK [12:07:21] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 38 ESP OK [12:07:21] RECOVERY - IPsec on cp4008 is OK: Strongswan OK - 28 ESP OK [12:07:21] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 38 ESP OK [12:07:21] RECOVERY - IPsec on cp3013 is OK: Strongswan OK - 28 ESP OK [12:07:22] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 38 ESP OK [12:07:22] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 28 ESP OK [12:07:22] RECOVERY - IPsec on cp4001 is OK: Strongswan OK - 20 ESP OK [12:07:23] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 38 ESP OK [12:07:30] RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 20 ESP OK [12:07:31] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 38 ESP OK [12:07:31] RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 38 ESP OK [12:07:32] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 28 ESP OK [12:07:32] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 20 ESP OK [12:07:32] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 38 ESP OK [12:07:32] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 28 ESP OK [12:07:32] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 38 ESP OK [12:07:32] RECOVERY - IPsec on cp3009 is OK: Strongswan OK - 28 ESP OK [12:07:33] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 38 ESP OK [12:07:33] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 28 ESP OK [12:07:34] RECOVERY - IPsec on cp3020 is OK: Strongswan OK - 20 ESP OK [12:07:34] RECOVERY - IPsec on cp3014 is OK: Strongswan OK - 28 ESP OK [12:07:51] RECOVERY - IPsec on cp4010 is OK: Strongswan OK - 28 ESP OK [12:07:53] RECOVERY - IPsec on cp4002 is OK: Strongswan OK - 20 ESP OK [12:07:53] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 38 ESP OK [12:08:00] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 28 ESP OK [12:08:01] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 38 ESP OK [12:08:01] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 20 ESP OK [12:08:01] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 38 ESP OK [12:08:02] RECOVERY - IPsec on cp4017 is OK: Strongswan OK - 28 ESP OK [12:08:02] RECOVERY - IPsec on cp4009 is OK: Strongswan OK - 28 ESP OK [12:08:02] RECOVERY - IPsec on cp4016 is OK: Strongswan OK - 28 ESP OK [12:08:10] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 28 ESP OK [12:08:10] RECOVERY - IPsec on cp3004 is OK: Strongswan OK - 28 ESP OK [12:08:10] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 28 ESP OK [12:08:10] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 28 ESP OK [12:08:21] RECOVERY - IPsec on cp3019 is OK: Strongswan OK - 20 ESP OK [12:08:21] RECOVERY - IPsec on cp3006 is OK: Strongswan OK - 28 ESP OK [12:08:22] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 38 ESP OK [12:08:22] --^ wrong fstab on kafka1018, it works now.. really weird [12:09:49] !sal [12:09:49] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [12:10:26] 6Operations, 10netops: mr1-ulsfo booted from backup JunOS - https://phabricator.wikimedia.org/T127295#2038851 (10faidon) 5Open>3Resolved I did the following: - `request system snapshot media internal slice alternate` to reformat the primary partition in case it was corrupted - `request system reboot`, whic... [12:12:59] (03PS1) 10Jcrespo: Allow heartbeat table to replicate to dbstore* hosts [puppet] - 10https://gerrit.wikimedia.org/r/271495 [12:16:17] 6Operations, 10RESTBase, 10hardware-requests: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842#1998563 (10fgiunchedi) for the sake of normalization, at the end of the current expansion we'll have: eqiad: 9x machines / 64GB ram / 2x processors / 5x 1TB... [12:18:09] 6Operations, 10Traffic, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#1502780 (10hashar) Might cause T127294. Calls to https://integration.wikimedia.org/ci/api/json ends up being corrupted somehow and that started happening when commits for t... [12:21:05] (03CR) 10Jcrespo: [C: 031] "The open source command line utility can monitor, but not manage the controllers/disks." [puppet] - 10https://gerrit.wikimedia.org/r/267262 (https://phabricator.wikimedia.org/T97998) (owner: 10Faidon Liambotis) [12:21:22] (03PS1) 10MarcoAurelio: Enabling translation notifications at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271496 (https://phabricator.wikimedia.org/T126901) [12:21:26] !log expand raid0 on restbase1008 to sdd and sde [12:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:23:22] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1014 is OK: OK: Less than 50.00% above the threshold [1.0] [12:24:31] !log increase stripe_cache_size to 32470 on restbase1008 [12:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:26:31] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 50.00% above the threshold [1.0] [12:26:31] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1012 is OK: OK: Less than 50.00% above the threshold [1.0] [12:27:31] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1013 is OK: OK: Less than 50.00% above the threshold [1.0] [12:27:52] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] [12:29:21] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:29:32] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:30:04] 6Operations, 10Continuous-Integration-Infrastructure, 10Traffic: ERROR nodepool.NodePool: Unable to check status of gallium.wikimedia.org - https://phabricator.wikimedia.org/T127294#2038934 (10hashar) I noticed that after some X time, the request pass just fine but subsequent ones all fail. I gave it a try... [12:30:15] 6Operations, 10Continuous-Integration-Infrastructure, 10Traffic: ERROR nodepool.NodePool: Unable to check status of gallium.wikimedia.org - https://phabricator.wikimedia.org/T127294#2038943 (10hashar) [12:30:50] 6Operations, 10Continuous-Integration-Infrastructure, 10Traffic: https://integration.wikimedia.org/ci/api/json is corrupted when required more than one time in a raw - https://phabricator.wikimedia.org/T127294#2038534 (10hashar) [12:34:05] https://phabricator.wikimedia.org/maniphest/task/create/ 404??? [12:35:06] Danny_B: You have to got to https://phabricator.wikimedia.org/maniphest/task/edit/form/1/ now. [12:35:33] Danny_B: Phabricator was updated earlier today with some big changes. [12:37:08] yeah, pretty much all my search queris got lost/broken [12:38:51] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [12:39:39] (03CR) 10Paladox: [C: 031] phabricator: redirect old task creation URL to new one [puppet] - 10https://gerrit.wikimedia.org/r/271485 (https://phabricator.wikimedia.org/T127286) (owner: 10Alex Monk) [12:40:27] !log decrease raid min_speed to 10000 on restbase1008 [12:40:30] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [12:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:41:08] !log rebooted kafka1020 for kernel upgrade. [12:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:42:55] any of you planning on attending DevopsDays Amsterdam (http://www.devopsdays.org/events/2016-amsterdam/) [12:44:52] PROBLEM - DPKG on analytics1027 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:45:10] PROBLEM - puppet last run on analytics1027 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:45:11] PROBLEM - salt-minion processes on analytics1027 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:45:11] PROBLEM - Disk space on analytics1027 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:45:21] PROBLEM - SSH on analytics1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:45:50] PROBLEM - configured eth on analytics1027 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:02] PROBLEM - RAID on analytics1027 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:10] PROBLEM - Check size of conntrack table on analytics1027 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:21] PROBLEM - dhclient process on analytics1027 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:30] joal --^ :( [12:46:49] (03PS1) 10Hoo man: Bump $wgCacheEpoch on Wikidata after Property conversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271501 [12:46:55] hmm elukey [12:47:08] elukey: What is that the symptom of ? [12:47:37] (03CR) 10Hoo man: [C: 032] Bump $wgCacheEpoch on Wikidata after Property conversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271501 (owner: 10Hoo man) [12:47:38] I guess it is due to the fact that it is extremely overloaded [12:47:55] I think so too elukey u [12:48:02] I don't know what to do though :( [12:48:07] (03Merged) 10jenkins-bot: Bump $wgCacheEpoch on Wikidata after Property conversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271501 (owner: 10Hoo man) [12:48:35] huh phab returns "Our servers are currently experiencing a technical problem. " [12:49:22] joal: I'll finish rebooting the last kafka broker and then I'll try to check with your assistance, hue and oozie are having a party in there :) [12:49:45] elukey: yeah, I know that [12:50:24] !log hoo@tin Synchronized wmf-config/Wikibase.php: Bump $wgCacheEpoch for Wikidata (duration: 01m 54s) [12:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:51:04] elukey: Can't even ssh to an1027 [12:51:24] yeah I suspected that [12:51:29] I'll try with the console [12:51:59] !log decrease raid min_speed to 8000 on restbase1008 [12:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:52:53] <_joe_> Danny_B: worksforme? [12:55:25] !log rebooted kafka1022.eqiad.wmnet for kernel upgrade [12:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:55:41] RECOVERY - DPKG on analytics1027 is OK: All packages OK [12:56:00] RECOVERY - salt-minion processes on analytics1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:56:00] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 51 minutes ago with 0 failures [12:56:01] RECOVERY - Disk space on analytics1027 is OK: DISK OK [12:56:10] RECOVERY - SSH on analytics1027 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [12:56:32] RECOVERY - configured eth on analytics1027 is OK: OK - interfaces up [12:56:51] RECOVERY - RAID on analytics1027 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [12:57:00] RECOVERY - Check size of conntrack table on analytics1027 is OK: OK: nf_conntrack is 0 % full [12:57:11] RECOVERY - dhclient process on analytics1027 is OK: PROCS OK: 0 processes with command name dhclient [12:59:10] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:00:13] (03PS1) 10Giuseppe Lavagetto: logrotate: explicit ownership and permissions [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/271505 (https://phabricator.wikimedia.org/T127025) [13:00:26] <_joe_> volans: ^^ [13:00:32] <_joe_> when you have time [13:00:52] _joe_: in the meantime... yes [13:01:01] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [13:01:32] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 4 failures [13:01:36] <_joe_> it was down for more than one request? [13:02:56] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2039156 (10Johan) [13:03:48] 6Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 6 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2039160 (10Amire80) Is there anything left to do in ContentTranslation or cxserv... [13:05:11] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:13:10] !log decreased raid md2 sync_speed_max to 6000 on restbase1008 [13:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:01] PROBLEM - puppet last run on cp3034 is CRITICAL: Timeout while attempting connection [13:17:27] 6Operations, 6Labs, 10Labs-Infrastructure: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#973669 (10hashar) [13:17:44] 6Operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1979046 (10Gehel) To answer @Naveenpf, to the best of my knowledge, there is no budget issue in the sense of "we do not have enough money for this, let's kill the project"... [13:20:12] (03CR) 10Volans: [C: 031] "Change looks good." [puppet] - 10https://gerrit.wikimedia.org/r/271495 (owner: 10Jcrespo) [13:20:56] 6Operations, 6WMF-Communications, 10Wikimedia-Blog: Redirect blog.wikipedia.org to the Wikimedia Blog - https://phabricator.wikimedia.org/T119888#2039241 (10Peachey88) [13:21:41] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Puppet has 1 failures [13:22:31] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:22:51] PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: Puppet has 1 failures [13:22:51] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:33:41] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:34:01] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:41:02] 6Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 6 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2039294 (10BBlack) >>! In T110474#2039160, @Amire80 wrote: > Is there anything l... [13:46:41] RECOVERY - puppet last run on ms-be3003 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [13:49:11] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:50:17] 6Operations, 10Parsoid, 10Traffic, 10VisualEditor, and 2 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1578478 (10Amire80) [13:50:58] 6Operations, 10Parsoid, 10Traffic, 10VisualEditor, and 2 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2039324 (10Amire80) >>! In T110474#2039294, @BBlack wrote: >>>! In T110474#2039160, @Amire80 wrote: >> Is there anything left to do in ContentTranslat... [13:55:06] 6Operations, 10Wikimedia-Etherpad: Specific etherpad link hangs then shows JavaScript error - https://phabricator.wikimedia.org/T126379#2039345 (10akosiaris) 5Open>3Resolved a:3akosiaris >>! In T126379#2035594, @MBinder_WMF wrote: > It's an etherpad miracle. :) That was awesome, thanks folks. Thank htt... [13:59:15] 6Operations, 10CirrusSearch, 6Discovery, 7Elasticsearch, and 2 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2039351 (10Gehel) As far as I can see (grep in operations/puppet) we are not using either stud nor hitch. I'd prefer not to add a new dependency if w... [14:05:50] 6Operations, 10hardware-requests, 5Patch-For-Review: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#2039362 (10fgiunchedi) >>! In T119935#2035934, @fgiunchedi wrote: > plan: > * as soon as the rebuild has finished (~6h) remount `/srv` and restart cassa... [14:06:02] !log restarting apache on gallium (integration) [14:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:12:16] 6Operations, 10hardware-requests: additional graphite machines request, 1x per DC - https://phabricator.wikimedia.org/T126253#2039380 (10fgiunchedi) a:5fgiunchedi>3None not sure why this got self-assigned, anyways up for grabs [14:14:38] (03PS6) 10Krinkle: navtiming: Improve parse_ua and add unit tests [puppet] - 10https://gerrit.wikimedia.org/r/271307 (https://phabricator.wikimedia.org/T112594) [14:16:24] (03CR) 10jenkins-bot: [V: 04-1] navtiming: Improve parse_ua and add unit tests [puppet] - 10https://gerrit.wikimedia.org/r/271307 (https://phabricator.wikimedia.org/T112594) (owner: 10Krinkle) [14:17:52] (03PS7) 10Krinkle: navtiming: Improve parse_ua and add unit tests [puppet] - 10https://gerrit.wikimedia.org/r/271307 (https://phabricator.wikimedia.org/T112594) [14:17:54] (03CR) 10Krinkle: navtiming: Improve parse_ua and add unit tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/271307 (https://phabricator.wikimedia.org/T112594) (owner: 10Krinkle) [14:19:18] (03CR) 10jenkins-bot: [V: 04-1] navtiming: Improve parse_ua and add unit tests [puppet] - 10https://gerrit.wikimedia.org/r/271307 (https://phabricator.wikimedia.org/T112594) (owner: 10Krinkle) [14:20:44] (03PS8) 10Krinkle: navtiming: Improve parse_ua and add unit tests [puppet] - 10https://gerrit.wikimedia.org/r/271307 (https://phabricator.wikimedia.org/T112594) [14:22:53] 6Operations, 10hardware-requests: additional graphite machines request, 1x per DC - https://phabricator.wikimedia.org/T126253#2039407 (10mark) Perhaps we can use one of restbase1001-1006 for this? @RobH: can you see what's needed to move this forward? [14:26:04] (03PS3) 10Krinkle: webperf: Create new navtiming metric with higher value limit [puppet] - 10https://gerrit.wikimedia.org/r/270725 (https://phabricator.wikimedia.org/T125381) (owner: 10Phedenskog) [14:26:17] (03PS4) 10Krinkle: webperf: Create new navtiming metric with higher value limit [puppet] - 10https://gerrit.wikimedia.org/r/270725 (https://phabricator.wikimedia.org/T125381) (owner: 10Phedenskog) [14:26:32] (03CR) 10Krinkle: [C: 04-1] "See comments on PS2." [puppet] - 10https://gerrit.wikimedia.org/r/270725 (https://phabricator.wikimedia.org/T125381) (owner: 10Phedenskog) [14:31:00] 6Operations: pinentry-gtk2 pulls in a lot of unneeded Gnome/GTK libs - https://phabricator.wikimedia.org/T127054#2039428 (10fgiunchedi) p:5Triage>3Low [14:31:38] 6Operations, 6WMF-Communications, 10Wikimedia-Blog: Redirect blog.wikipedia.org to the Wikimedia Blog - https://phabricator.wikimedia.org/T119888#2039435 (10jrbs) 5Open>3declined a:3jrbs Seems pretty obvious to me that this isn't a supported course of action. Closing as declined for historical reference. [14:34:10] error 503 [14:38:11] PROBLEM - puppet last run on oxygen is CRITICAL: CRITICAL: Puppet has 1 failures [14:41:40] Vito: ? [14:41:43] 6Operations, 10Pybal, 10Traffic: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747#2039471 (10mark) This was added recently in https://gerrit.wikimedia.org/r/#/c/267008/ [14:42:13] bblack: ^ dunno if you want to resolve that task now, and/or when it's actually deployed everywhere [14:42:24] deployed! :) [14:42:27] bblack: some random error 503 being serviced by Amsterdam [14:42:40] Vito: do you have more details? [14:42:53] ok then I'll resolve now :) [14:43:11] oh I already refreshed the relevant tab :| [14:43:23] 6Operations, 10Pybal, 10Traffic: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747#2039473 (10mark) 5Open>3Resolved a:3mark And according to @bblack this is already deployed, so resolving. :) [14:43:26] works back but I no longer have details [14:43:49] mark: :P [14:46:12] 6Operations, 10Pybal, 10Traffic: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747#2039498 (10BBlack) 5Resolved>3Open No, it's not :P [14:46:28] ... [14:47:03] deployed! == your option of "when it's actually deployed everywhere" [14:47:18] as opposed to "now!" [14:47:30] yay ambiguity [14:50:01] 6Operations, 10Swift, 5Patch-For-Review: install/setup/deploy ms-be2016-2021 - https://phabricator.wikimedia.org/T116842#2039518 (10fgiunchedi) 5Open>3Resolved ms-be2016 -> ms-be2021 now to weight 3500, resolving [14:53:24] 6Operations: Clean up some accidental restbase metrics - https://phabricator.wikimedia.org/T120870#2039546 (10fgiunchedi) 5Open>3Resolved done [14:58:21] 6Operations, 10RESTBase-Cassandra, 5Patch-For-Review: puppet should safely manage cassandra start/stop - https://phabricator.wikimedia.org/T103134#2039557 (10fgiunchedi) 5Open>3Resolved resolving this as we have consensus with `systemctl mask` [14:59:04] <_joe_> mark: there was a problem with udp probes for ipv6 endpoints [14:59:14] <_joe_> and I have had no time to look into it more [15:03:22] 6Operations: acpi_pad runaway processes on praseodymium - https://phabricator.wikimedia.org/T123924#2039575 (10fgiunchedi) p:5Triage>3Normal [15:03:52] RECOVERY - puppet last run on oxygen is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:08:56] 6Operations, 10Salt: salt cmd.run reports an empty dictionary instead of empty string sometimes - https://phabricator.wikimedia.org/T113217#2039585 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi from new salt master neodymium I don't seem to be able to reproduce anymore, resolving for now [15:09:48] _joe_: ok [15:10:31] !log restarting hadoop services on analytics103* hosts for security upgrades [15:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:50] (03CR) 10Hashar: [C: 031] "I have never used that script nor am I familiar with it. Nonetheless Kunal is already using it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/269328 (owner: 10Legoktm) [15:18:30] 6Operations, 10Monitoring: Add RAID monitoring for Cisco servers - https://phabricator.wikimedia.org/T85529#2039643 (10fgiunchedi) 5stalled>3Invalid we're going to decommission ciscos anyways [15:23:09] mark: I'm assuming https://phabricator.wikimedia.org/T84524 is confirmed and working and can be closed ? [15:25:16] 6Operations, 5Patch-For-Review: consider hybrid caching options for ssd+disk - https://phabricator.wikimedia.org/T88992#2039656 (10fgiunchedi) [15:25:18] 6Operations, 7Graphite, 5Patch-For-Review: use graphite1002 to test dm-cache - https://phabricator.wikimedia.org/T88994#2039654 (10fgiunchedi) 5Open>3stalled I've since disabled dm-cache on graphite1002 after causing some kernel panics under load, stalling pending another test [15:29:02] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:29:21] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:29:40] (03CR) 10Ottomata: [C: 032 V: 032] Adding debian/ dir from Ubuntu source package and releasing 1.1~b6 [debs/python-gevent] (debian) - 10https://gerrit.wikimedia.org/r/271306 (https://phabricator.wikimedia.org/T126075) (owner: 10Ottomata) [15:30:01] (03CR) 10Ottomata: [C: 032 V: 032] Add python-gevent (>= 1.1b6) to Depends [debs/python-pykafka] (debian) - 10https://gerrit.wikimedia.org/r/271308 (https://phabricator.wikimedia.org/T126075) (owner: 10Ottomata) [15:31:56] godog: uh, not sure until I check what the situation is [15:32:01] i haven't had any issues anyway ;) [15:32:34] " Esams HTTP 5xx reqs/min" [15:32:56] i noticed that btw. had requests failing here in .nl [15:33:04] in case that is not known yet [15:33:57] !log restarting apache on silver/wikitech [15:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:36:34] <_joe_> thedj: it was a temporary issue; see https://grafana.wikimedia.org/dashboard/db/varnish-http-errors [15:36:47] <_joe_> thedj: or, are you continuing to have issues? [15:37:08] 6Operations, 10ops-eqiad: db1063 degraded RAID - https://phabricator.wikimedia.org/T126969#2039735 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson No Problems! Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Onli... [15:37:11] nope no issues right now [15:37:11] !log restarting hadoop services on analytics102* nodes for security update [15:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:21] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:38:40] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:39:02] (03PS6) 10Giuseppe Lavagetto: nutcracker: re-organize redis servers list [puppet] - 10https://gerrit.wikimedia.org/r/270969 (https://phabricator.wikimedia.org/T126470) [15:42:02] 6Operations, 10ops-eqiad: Failed drive in labstore1001 array - https://phabricator.wikimedia.org/T127076#2039779 (10chasemp) hey @cmjohnson I don't see anything in dmesg, and I don't see the device in /dev. Can you reinsert, check drive LED's, etc? Maybe we can try another drive? Hopefully this isn't a backp... [15:42:08] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2039783 (10mark) Let's go ahead, considering these only just expired and we don't have budget for them otherwise. Although we try to avoid that, considering the use,... [15:43:06] (03PS1) 10Cmjohnson: Removing user akumar and mnoushad from bastion only access and penetration tests group until T126012 is resolved during operations meeting. [puppet] - 10https://gerrit.wikimedia.org/r/271534 [15:43:34] 6Operations: spare/unused disks on application servers - https://phabricator.wikimedia.org/T106381#2039799 (10fgiunchedi) still some machines with spare `sdb`, though some might be slated for decommission/replacement ``` root@neodymium:~# salt -b 50 -t 20 --output=raw mw* cmd.run 'grep -q sdb /etc/fstab || grep... [15:43:37] (03PS7) 10Giuseppe Lavagetto: nutcracker: re-organize redis servers list [puppet] - 10https://gerrit.wikimedia.org/r/270969 (https://phabricator.wikimedia.org/T126470) [15:44:04] !log restarting hadoop services on analytics104* nodes for security updates [15:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:45:24] (03CR) 10Cmjohnson: [C: 032] Removing user akumar and mnoushad from bastion only access and penetration tests group until T126012 is resolved during operations meeting. [puppet] - 10https://gerrit.wikimedia.org/r/271534 (owner: 10Cmjohnson) [15:48:25] (03PS2) 10Andrew Bogott: Move the labs pdns db server into hiera. [puppet] - 10https://gerrit.wikimedia.org/r/271465 [15:48:27] (03PS1) 10Andrew Bogott: Glance policy: Add the 'glanceadmin' role to allow image manipulation [puppet] - 10https://gerrit.wikimedia.org/r/271536 (https://phabricator.wikimedia.org/T127310) [15:50:17] (03CR) 10Nuria: [C: 031] admin: add nuria/mforns/milimetric to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/271479 (https://phabricator.wikimedia.org/T126752) (owner: 10Filippo Giunchedi) [15:51:34] (03PS2) 10Andrew Bogott: Glance policy: Add the 'glanceadmin' role to allow image manipulation [puppet] - 10https://gerrit.wikimedia.org/r/271536 (https://phabricator.wikimedia.org/T127310) [15:52:49] !log creating adywiki indices in codfw [15:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:55:41] PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: Puppet has 1 failures [15:56:17] (03PS3) 10Filippo Giunchedi: reprepro: add HP's MCP repository to updates [puppet] - 10https://gerrit.wikimedia.org/r/267262 (https://phabricator.wikimedia.org/T97998) (owner: 10Faidon Liambotis) [15:56:23] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] reprepro: add HP's MCP repository to updates [puppet] - 10https://gerrit.wikimedia.org/r/267262 (https://phabricator.wikimedia.org/T97998) (owner: 10Faidon Liambotis) [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160218T1600). [16:00:46] (03CR) 10Andrew Bogott: [C: 032] Glance policy: Add the 'glanceadmin' role to allow image manipulation [puppet] - 10https://gerrit.wikimedia.org/r/271536 (https://phabricator.wikimedia.org/T127310) (owner: 10Andrew Bogott) [16:01:12] (03PS3) 10Andrew Bogott: Glance policy: Add the 'glanceadmin' role to allow image manipulation [puppet] - 10https://gerrit.wikimedia.org/r/271536 (https://phabricator.wikimedia.org/T127310) [16:03:10] (03CR) 10Filippo Giunchedi: [C: 031] "FYI, there's a swiftrepl running already for global-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266609 (https://phabricator.wikimedia.org/T91869) (owner: 10Aaron Schulz) [16:04:48] (03PS8) 10Giuseppe Lavagetto: nutcracker: re-organize redis servers list [puppet] - 10https://gerrit.wikimedia.org/r/270969 (https://phabricator.wikimedia.org/T126470) [16:09:54] (03CR) 10Giuseppe Lavagetto: "seems safe, see https://puppet-compiler.wmflabs.org/1804/" [puppet] - 10https://gerrit.wikimedia.org/r/270969 (https://phabricator.wikimedia.org/T126470) (owner: 10Giuseppe Lavagetto) [16:12:47] 6Operations, 10ops-eqiad, 10Incident-Labs-NFS-20151216, 6Labs, 10Labs-Infrastructure: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#2039935 (10chasemp) [16:13:09] (03PS1) 10Alexandros Kosiaris: otrs: Provision mpm_prefork.conf [puppet] - 10https://gerrit.wikimedia.org/r/271543 [16:14:22] 6Operations, 10hardware-requests: new labstore hardware for eqiad - https://phabricator.wikimedia.org/T126089#2039952 (10mark) I'll review the new layout in more detail a bit, but overall I think we can proceed with getting hardware quotes in the mean time. @RobH: can you create a procurement ticket out of th... [16:14:24] We have an issue with a maintenance cron deployed on 2 machines (T127322). Quick look indicates that we manage those maintenance script as includes in role::mediawiki::maintenance, which means that the code is executed only if the machine has the maintenance role. So no cleanup possible for machine which are not in the role [16:14:24] T127322: Completion Suggester: make sure that the cronjob mediawiki::maintenance::cirrussearch is deployed to only one machine (terbium) - https://phabricator.wikimedia.org/T127322 [16:14:48] How would you usually go for this kind of cleanup? [16:16:03] 6Operations, 10Analytics-Cluster, 10hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#2039959 (10mark) Old RESTbase machines indeed seem good for these, but won't be available for use for realistically another 2 weeks at least. [16:16:06] gehel: I'd say either manual or if that is impractical/messy, ensure => absent [16:16:42] <_joe_> gehel: which machine has that cronjob? [16:16:45] ensure => absent would be nice, but as the class is NOT included, I don't really know where to ensure [16:16:56] mw1152 [16:17:13] <_joe_> gehel: what crons are still active there? [16:17:39] 2 cirrus search maintenance crons [16:17:49] <_joe_> gehel: btw, that is the "test reimage" mediawiki server, so anything that has to do with maintenance comes from puppet. [16:17:51] cirrus_build_completion_indices_eqiad & cirrus_build_completion_indices_codfw [16:18:08] <_joe_> if the role is not applied there anymore, we have removed it (and I don't remember doing it) [16:18:21] <_joe_> gehel: anyways, rm(1) is a good bet in these cases [16:18:28] (03PS2) 10Filippo Giunchedi: admin: add nuria/mforns/milimetric to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/271479 (https://phabricator.wikimedia.org/T126752) [16:18:32] (03PS4) 10Thcipriani: Beta: Move deployment server [puppet] - 10https://gerrit.wikimedia.org/r/270343 (https://phabricator.wikimedia.org/T126377) [16:18:34] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] admin: add nuria/mforns/milimetric to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/271479 (https://phabricator.wikimedia.org/T126752) (owner: 10Filippo Giunchedi) [16:19:20] removing the role will not remove the cron. So manual cleanup is the usual way to go (and I should not try to fix puppet code) as it happens not very often ? [16:19:44] 6Operations, 10Ops-Access-Requests, 6Analytics-Kanban, 5Patch-For-Review: All members of analytics team need to have sudo -u hdfs on cluster {hawk} [2 pts] - https://phabricator.wikimedia.org/T126752#2039966 (10fgiunchedi) 5Open>3Resolved merged, access will be granted at the next puppet run, please re... [16:19:46] (03CR) 10Volans: [C: 031] "LGTM" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/271505 (https://phabricator.wikimedia.org/T127025) (owner: 10Giuseppe Lavagetto) [16:19:50] <_joe_> gehel: I think that our general tendency is [16:19:57] <_joe_> if you change role to a server, reimage it [16:20:21] <_joe_> there is really no point in absenting resources if you are not evolving something [16:20:35] <_joe_> having said that, look at the hiera data for mw1152 [16:20:50] 6Operations, 10hardware-requests, 5Patch-For-Review: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#2039974 (10GWicke) > widen hint window by a safe margin e.g. 30h With 2.1, this might push it a bit. There are optimizations around hints in [3.0](http... [16:20:53] ottomata: so what's the verdict with pykafka? [16:20:59] the verdict? [16:21:00] <_joe_> hieradata/hosts/mw1152.yaml vs hieradata/hosts/terbium.yaml [16:21:01] oh [16:21:06] for debian? [16:21:12] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [16:21:14] _joe_: sounds reasonable ... [16:21:29] <_joe_> gehel: all the enable/disable of cronjobs is done there [16:22:12] _joe_: yep, just seen that. I'll fix it there... [16:22:28] (03CR) 10Thcipriani: [C: 031] "This is a no-op in prod, and is already cherry-picked in beta: https://puppet-compiler.wmflabs.org/1807/" [puppet] - 10https://gerrit.wikimedia.org/r/270343 (https://phabricator.wikimedia.org/T126377) (owner: 10Thcipriani) [16:22:28] <_joe_> gehel: also, it's most probably my fault :P [16:22:35] (03PS3) 10Mobrovac: RESTBase: Enable purging and minor config style changes [puppet] - 10https://gerrit.wikimedia.org/r/271436 [16:23:09] _joe_: honestly, I don't really care who's fault it is, but I can git blame if you want to be sure ;-) [16:23:10] (03PS9) 10Giuseppe Lavagetto: nutcracker: re-organize redis servers list [puppet] - 10https://gerrit.wikimedia.org/r/270969 (https://phabricator.wikimedia.org/T126470) [16:23:50] paravoid: afaict, python-kafka's consumer does look like it would work for us, but it is built for kafka 0.9 [16:23:54] <_joe_> oh I'm pretty sure without the git blame, no one dares touching those servers with a 10-ft pole besides me [16:23:57] so, we'll be using pykafka at least until we upgrade [16:24:45] i've just built a new pykafka deb, but annoyingly they've increased their python-gevent dependency to a more recent (beta?) release that is not in debian or ubuntu [16:24:47] so I had to build that dep too [16:25:00] which would make actually getting a python-pykafka package in debian more difficult, i think [16:25:21] (03CR) 10Giuseppe Lavagetto: [C: 032] nutcracker: re-organize redis servers list [puppet] - 10https://gerrit.wikimedia.org/r/270969 (https://phabricator.wikimedia.org/T126470) (owner: 10Giuseppe Lavagetto) [16:28:19] (03PS1) 10Gehel: Disable cirrus search crons on mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/271548 (https://phabricator.wikimedia.org/T127322) [16:29:14] (03CR) 10Giuseppe Lavagetto: [C: 031] Disable cirrus search crons on mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/271548 (https://phabricator.wikimedia.org/T127322) (owner: 10Gehel) [16:29:36] <_joe_> gehel: commit and merge whenever you like, but puppet is disabled on mw1* hosts for a few minutes [16:29:47] (03CR) 10DCausse: [C: 031] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/271548 (https://phabricator.wikimedia.org/T127322) (owner: 10Gehel) [16:30:30] _joe_: so no need to wait for puppet swat? [16:30:32] (03PS5) 10Thcipriani: Beta: Move deployment server [puppet] - 10https://gerrit.wikimedia.org/r/270343 (https://phabricator.wikimedia.org/T126377) [16:30:51] <_joe_> gehel: not in your case, you have root right? [16:31:14] <_joe_> so the idea of puppetswat is for non-ops to submit patches for ops to consider and deploy at a fixed time [16:31:32] yep, I'm root. [16:31:49] 6Operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1979046 (10Deskana) >>! In T125126#2039229, @Gehel wrote: > To answer @Naveenpf, to the best of my knowledge, there is no budget issue in the sense of "we do not have enou... [16:31:51] <_joe_> since you don't qualify as "non-ops", you have +2 in gerrit (I hope), so you can merge and deploy changes [16:32:13] _joe_: first time, so can you check me? It is as follow: [16:32:21] PROBLEM - nutcracker process on mw2132 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [16:32:25] <_joe_> yes [16:32:31] <_joe_> gehel: query :) [16:32:31] PROBLEM - nutcracker port on mw2132 is CRITICAL: Connection refused [16:32:38] gehel got his ldap/ops access sorted out... yesterday, I think. So should be able to +2 in puppet now [16:32:44] <_joe_> uhm wait a sec [16:33:00] theoretically [16:34:19] so, (1) I'll +2, merge is automatic, code end up in "production" branch (2) I run "puppet [...] --noop" on mw1152 (3) if all is good I'll do the actual run (4) log all that [16:34:23] correct? [16:34:32] (03PS2) 10Andrew Bogott: vagrant: Add upstart script to start container on boot [puppet] - 10https://gerrit.wikimedia.org/r/271171 (https://phabricator.wikimedia.org/T127129) (owner: 10BryanDavis) [16:34:41] <_joe_> wtf strontium [16:35:40] (03CR) 10Gehel: [C: 032] "Looks simple enough, @_joe_ is helping me to actually follow through." [puppet] - 10https://gerrit.wikimedia.org/r/271548 (https://phabricator.wikimedia.org/T127322) (owner: 10Gehel) [16:36:01] RECOVERY - nutcracker process on mw2132 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:36:12] RECOVERY - nutcracker port on mw2132 is OK: TCP OK - 0.000 second response time on port 11212 [16:36:27] (03CR) 10Andrew Bogott: [C: 032] vagrant: Add upstart script to start container on boot [puppet] - 10https://gerrit.wikimedia.org/r/271171 (https://phabricator.wikimedia.org/T127129) (owner: 10BryanDavis) [16:36:49] <_joe_> gehel: puppet is ff-only, so you need to rebase your change [16:37:05] <_joe_> as apparently you gave it +2 and not "publish and submit [16:37:17] <_joe_> so andrewbogott merge-sniped you [16:37:53] damn, already need to rebase? [16:38:30] <_joe_> yeah [16:38:48] (03PS2) 10Gehel: Disable cirrus search crons on mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/271548 (https://phabricator.wikimedia.org/T127322) [16:38:52] <_joe_> what andrewbogott did to you is considered like stealing a parking spot in rome :P [16:39:18] <_joe_> (let's see when andrew notices my obnoxious pings :D) [16:39:47] _joe_: in Rome this would actually be dagerous for your physical integrity, no? Here danger is much less... [16:40:05] <_joe_> well, vengeance is best served cold, right? [16:40:09] gehel: nah we have an ops offsite [16:40:13] joe can just wait. [16:40:31] gehel: puppet is one of the weird repositories in which jenkins doesn't merge for you [16:40:34] ok, merged [16:41:49] 6Operations, 10ops-codfw, 5Patch-For-Review: es2011-es2019 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2040088 (10jcrespo) 5Open>3Resolved [16:44:23] (03CR) 10BBlack: [C: 031] RESTBase: Enable purging and minor config style changes [puppet] - 10https://gerrit.wikimedia.org/r/271436 (owner: 10Mobrovac) [16:45:24] s/the wierd/the awesome/ [16:45:38] hrmm, why the hell cant i edit the #hardware-requests project when it specifically says i can in the edit rules. it also says This enables editing of project policies and is restricted to members of #acl*phabricator along the top though [16:45:51] it seems odd that one project overwrites the edit rights and cannot seem to be set? [16:46:07] * robh is logged in as his user, and also as ops admin user in another browser to compare [16:46:19] robh: come on over to -devtools :) [16:49:22] (03CR) 10BryanDavis: "Verified on mwv-image-builder.mediawiki-vagrant.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/271171 (https://phabricator.wikimedia.org/T127129) (owner: 10BryanDavis) [16:49:37] !log removing cirrus maintenance crons from mw1152 (T127322) [16:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:50:39] <_joe_> bbiab [16:53:54] hey moritzm, did you intend to have puppet swat today and in a week? [16:54:01] or is that a typo? [16:54:07] !log restarting hadoop services on analytics105* nodes for security updates [16:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:30] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 60.87% of data above the critical threshold [5000000.0] [16:58:58] (03CR) 10Glaisher: "Probably okay now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252627 (owner: 10Glaisher) [17:00:04] moritzm mutante: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160218T1700). [17:00:04] mobrovac thcipriani: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:41] o/ [17:01:06] hm so I don't know who is here but I'll handle puppet swat unless one of them turns up [17:01:11] I thought it was my week anyways [17:01:17] mobrovac: around? [17:02:30] yup apergos [17:02:46] here too [17:03:01] I've looked at the patch, I have two questions, what purge host is used for labs instances that aren't in deployment-prep, with your patch? [17:03:09] if one were to be set up [17:03:30] apergos: that'd need to be configured on a per-project bases in hiera [17:03:48] with no configuration what happens? I mean if someone leaves it out by accident [17:03:53] apergos: by def, the main vhtcpd host in prod is used, but that wouldn't resolve to anything outside of prod [17:04:01] !log updating completion suggester indices in eqiad [17:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:04:11] well it's an IP for prod though [17:04:24] 6Operations, 10ops-codfw: install SSDs in restbase2001-restbase2006 - https://phabricator.wikimedia.org/T127333#2040168 (10fgiunchedi) [17:04:24] bblack, thanks for looking [17:04:26] 6Operations, 10ops-codfw: install SSDs in restbase2001-restbase2006 - https://phabricator.wikimedia.org/T127333#2040168 (10fgiunchedi) [17:04:27] yup, but you wouldn't be able to reach it [17:04:31] it would still "work", but I don't think those forward from labs subnets to prod [17:04:39] nope [17:04:41] (it's multicast) [17:05:09] so do we want to make an explicit undef so that the hapless user will get told by puppet? or not? [17:05:18] I think the labs caches do listen on multicast, though, don't they? it just may not work at all there [17:05:46] bblack: i think apergos is concerned for other labs projetcs, not BC [17:06:12] yeah I'm just wondering if multicast works at all in labs [17:06:17] I'm primarily wanting to make sure 1) nothing bad happens to the prod hosts receiving the purges and 2) labs users aren't blindsided [17:06:30] ok well not working at all didn't even cross my mind. urg [17:06:32] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [17:06:33] i can make localhost the def and then change the ip in hiera for prod, but afaik the usual way to do it in ops/puppet is to put the default to point to prod (all other vars do that) [17:06:35] it's not necessarily a great idea to be sending to a well-known multicast addr by default [17:06:53] I don't think it breaks anything, though [17:07:21] if ourging fails in labs we're no worse off than we are now, I guess? [17:07:25] (but it should be tested) [17:07:29] basically [17:07:31] *purging [17:08:01] my other question is, from reading the manifest I don't know how you determine for which events a purge gets sent [17:08:05] unless by some miracle (a) multicast does work all across labs instances and (b) the beta-cluster caches do effectively listen on the prod multicast now and thus (c) random labs instances purge beta-cluster caches [17:08:06] *manifests [17:08:31] that would be a hell of a miracle [17:09:31] so, how can I verify for which events purging will happen? [17:09:33] I think the beta caches do listen for the multicast, I just don't think it actually works (multicast in general) in the labs instances networks between them [17:09:42] right [17:09:42] apergos: whenever something cacheable is updated in storage, resource change events containing affcted external urls are emitted to an internal event handler; this will eventually produce to a kafka topic, but for now it directly purges those via htcp [17:09:42] apergos: that's defined in a config in the source repo [17:10:28] got a pointer? [17:10:42] this is just me doublechecking before we flip the switch [17:11:06] apergos: https://github.com/wikimedia/restbase/blob/master/sys/mobileapps.yaml#L54-L64 is the only place for now [17:11:28] ok [17:11:31] thubms up for that ten [17:11:47] so I'm ok with merging this (unless you say otherwise bblack) but with this caveat [17:12:06] I'd like after this for you to arrange for testing in labs in beta [17:12:17] (03PS2) 10Krinkle: mediawiki: Resolve docroot symlink used by bits /w/extension [puppet] - 10https://gerrit.wikimedia.org/r/270786 [17:12:18] and if it's not working, open a task and link this patchset [17:12:24] purge rates for this are on the order of 20/s [17:12:26] so we know that there's more going on... ok? [17:12:34] apergos: i tested the patch in beta already [17:12:41] and purging worked? [17:12:54] but yeah, bblack should definitely double check [17:12:55] (03PS1) 10BBlack: cache_misc: disable do_gzip [puppet] - 10https://gerrit.wikimedia.org/r/271559 (https://phabricator.wikimedia.org/T127294) [17:12:58] apergos: yes [17:13:03] huh, ok [17:13:19] (03PS1) 10Volans: mariadb: Add new es2011-2019 servers [puppet] - 10https://gerrit.wikimedia.org/r/271560 (https://phabricator.wikimedia.org/T127330) [17:13:21] ok. 20/sec, bblack what do you think? [17:13:29] it's reasonable all things considered [17:13:36] ok, gonna merge then [17:13:38] (for prod) [17:13:47] yep [17:14:04] (03PS4) 10ArielGlenn: RESTBase: Enable purging and minor config style changes [puppet] - 10https://gerrit.wikimedia.org/r/271436 (owner: 10Mobrovac) [17:14:26] (03CR) 10ArielGlenn: "https://github.com/wikimedia/restbase/blob/master/sys/mobileapps.yaml#L54-L64 controls which events get purges atm" [puppet] - 10https://gerrit.wikimedia.org/r/271436 (owner: 10Mobrovac) [17:14:28] apergos, bblack: mobileapps request rates are the first graph in https://grafana-admin.wikimedia.org/dashboard/db/mobileapps [17:14:38] each of those purges three URLs [17:15:20] three [17:15:25] I see [17:15:56] so really 60 a sec we are talking about? [17:16:05] 60 URLs / s, yes [17:16:11] (03PS2) 10Krinkle: mediawiki: Remove outdated bits config for /static/current fonts [puppet] - 10https://gerrit.wikimedia.org/r/270787 [17:16:17] the htcp client might batch, I forget [17:16:40] https://github.com/wikimedia/htcp-purge [17:16:51] (03CR) 10Filippo Giunchedi: "compiler changes for labnet hosts, https://puppet-compiler.wmflabs.org/1808/ will require bouncing dnsmasq I'm assuming" [puppet] - 10https://gerrit.wikimedia.org/r/270343 (https://phabricator.wikimedia.org/T126377) (owner: 10Thcipriani) [17:17:07] commit message is slightly off then [17:17:08] ko [17:17:10] *ok [17:17:16] gwicke: apergos: bblack: the client does batch the reqs [17:17:26] batch? [17:17:44] how does that work? [17:17:48] it looks like one UDP packet per URL to me: https://github.com/wikimedia/htcp-purge/blob/master/index.js#L129 [17:17:55] yeah I think it has to be [17:18:10] 6Operations, 10Analytics-Cluster, 10hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#2040281 (10RobH) @Ottomata: Can you advise if waiting for these is acceptable? The new restbase machines arrive next week, and we'll likely need to g... [17:18:29] I don't really understand the rates + multipliers of course [17:18:32] yeah sorry, it accepts multiple uris and sends off reqs in parallel, but it's one uri per packet [17:18:46] (03CR) 10ArielGlenn: "60 urls/sec acrually. 3 urls per hit and we're getting about 20/sec https://grafana-admin.wikimedia.org/dashboard/db/mobileapps" [puppet] - 10https://gerrit.wikimedia.org/r/271436 (owner: 10Mobrovac) [17:19:01] come on jenkins [17:19:07] we're getting about 20/sec what? updates to actual wiki content pages? [17:19:18] (03CR) 10ArielGlenn: [C: 032] RESTBase: Enable purging and minor config style changes [puppet] - 10https://gerrit.wikimedia.org/r/271436 (owner: 10Mobrovac) [17:19:47] hm actually that first graph is get not post [17:19:49] bblack: mobile content re-renders, which trigger those purges [17:19:56] oh re-renders [17:19:57] I see [17:20:08] !log disabled puppet on analytics1027 to avoid any Camus job to run [17:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:20:19] what's a re-render if it's not from actual text content changes? [17:20:19] so I'ma gonna jut puppet-merge this... 60/sec purges still ok right? [17:20:23] 6Operations, 10Analytics-Cluster, 10hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#2040286 (10Ottomata) @RobH, we'll wait. I might move these services in the mean time to a spare old Analytics Dell. We won't be in such a time crunc... [17:20:48] apergos: it's small potatoes in the grand scheme of current purge rates, but [17:21:01] * apergos waits for the other shoe [17:21:38] 6Operations, 10Analytics-Cluster, 10hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#2040290 (10RobH) Excellent, so with this info we can let @mark review and approve (since his only listed question was the waiting period until these a... [17:21:45] as with the craziness going on with MW/JobRunner purging right now, I have to ask: fundamentally, what's triggering a purge here? template/content updates? or something else? [17:21:59] !log upgrading Cassandra to 2.1.13 on xenon.eqiad.wmnet (restbase staging) T126629 [17:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:22:10] +1 [17:22:16] template / content updates, *if* the original HTML actually changed in the update [17:22:49] this is one of the optimizations we implemented after realizing that many updates are actually not changing the content [17:22:50] so basically we're saying that across all wikis, we're seeing avg 20 articles/sec updated (including those updated due to template change) [17:23:14] yes [17:23:17] _joe_: would you have 20 minutes tomorrow afternoon (or next week) for a hangout? I'd like to get some face time and see what I can do for you ... [17:23:22] ok [17:23:30] what's the 3x URL variants? [17:23:50] different output formats? [17:23:50] PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:23:54] running on restbase1002 now [17:24:05] out of about 80 update requests / s, the Parsoid HTML currently does not actually change for ~half of them [17:24:23] the mobile content is split in two parts - the lead section and the remainder (that's the way they load stuff), plus the third one which is an aggregate [17:24:24] "Parsoid re-parses without content change" in https://grafana.wikimedia.org/dashboard/db/restbase?panelId=13&fullscreen [17:24:30] ok [17:24:34] is there a dashboard where I can watch the text purges come in? [17:24:40] apergos: nrunning what on rb1002? [17:24:49] puppet [17:24:52] change now live there [17:24:57] no, it's not [17:25:01] apergos: https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes [17:25:02] bblack: this is an optimization we could potentially also apply to MediaWiki's purges [17:25:10] the top graph there shows purge rate as one of several method-rates [17:25:14] if it were live, rb would have crashed right now :) [17:25:16] (limit to text cluster in dropdowns) [17:25:22] i'll apply this to staging now [17:25:28] I did not restart any services so I don't know if it rereads its config or anything [17:25:35] gwicke: s/could/should/, probably among many others [17:25:43] bblack: thanks [17:25:47] (03CR) 10Tjones: A/B/C test of control vs textcat vs accept-lang + textcat (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268048 (https://phabricator.wikimedia.org/T121542) (owner: 10EBernhardson) [17:25:52] keep in mind the purge rate on that aggregate graph is counting purges on all caches... [17:26:39] right [17:26:40] !log Cassandra on xenon.eqiad.wmnet killed by kernel after Cassandra package upgrade (coincidence): [1482254.046078] Out of memory: Kill process 21854 (java) score 595 or sacrifice child [17:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:26:51] is it summing executed purges across all varnishes? [17:26:53] so there's 37x cache text, so if you dropdown-limit to only cache_text, a bump of 20/s equals a bum pof 740/s there [17:27:11] I see [17:27:35] !log Cassandra on xenon.eqiad.wmnet killed by kernel after Cassandra package upgrade (coincidence?): [1482254.046078] Out of memory: Kill process 21854 (java) score 595 or sacrifice child : T126629 [17:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:27:42] yepo I don't expect to be able to see it [17:27:45] that's what I want :-) [17:27:50] the multiplier is large if you're not narrowing to cache text, since multicast is sadly still shared [17:28:06] !log restbase deploying a42976cc82 to restbase1002 [17:28:07] I've checked it on the dropdown [17:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:28:42] ok, the change is live on rb1002 [17:28:58] great [17:29:02] well we're expecting 60/s right? so 2.2K/sec on the text aggregate graph [17:29:08] but still, it's within the usual noise [17:29:18] yes. from the one restbase [17:29:29] bblack: can you tail the log and see if there are purges from rb? [17:29:38] i can send a couple of reqs directly to rb1002 [17:30:11] bblack: we have a few more end points that we plan to enable purging for; we'll compile the expected rates for that & run it by you [17:30:18] <_joe_> gehel: for sure, tomorrow is fine [17:30:32] I'll send you a meeting ... [17:30:49] I can see them, yes [17:30:51] PROBLEM - puppet last run on restbase-test2001 is CRITICAL: CRITICAL: puppet fail [17:31:04] yeah yeah [17:31:08] (03PS1) 10Filippo Giunchedi: restbase: adjust alarms with new metric names [puppet] - 10https://gerrit.wikimedia.org/r/271561 [17:31:41] 25 RxURL c /api/rest_v1/page/mobile-sections/Template_talk%3AInfobox_religion [17:31:44] 25 RxHeader c Host: en.wikipedia.org [17:31:47] 25 TxStatus c 204 [17:31:49] 25 TxResponse c Cache miss [17:31:52] 25 RxURL c /api/rest_v1/page/mobile-sections-lead/Template_talk%3AInfobox_religion [17:31:55] 25 RxHeader c Host: en.wikipedia.org [17:31:57] 25 TxStatus c 204 [17:32:00] 25 TxResponse c Cache miss [17:32:02] 25 RxURL c /api/rest_v1/page/mobile-sections-remaining/Template_talk%3AInfobox_religion [17:32:05] 25 RxHeader c Host: en.wikipedia.org [17:32:08] 25 TxStatus c 204 [17:32:09] godog: urandom: fyi puppet is still disabled in staging (was for brotli testing), i think it's time to re-enable it [17:32:10] 25 TxResponse c Cache miss [17:32:13] ^ from a prod cache_text box [17:32:15] (those are PURGE reqs, not gets) [17:32:16] bblack: \o/ [17:32:22] great [17:32:29] (03CR) 10jenkins-bot: [V: 04-1] restbase: adjust alarms with new metric names [puppet] - 10https://gerrit.wikimedia.org/r/271561 (owner: 10Filippo Giunchedi) [17:32:55] mobrovac: yeah I think so too [17:32:59] (03CR) 10Jcrespo: [C: 031] "Looks good, there is actually a mistake on the (existing) es1, but that is unrelated to this patch (I have to check the current state of e" [puppet] - 10https://gerrit.wikimedia.org/r/271560 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [17:33:10] apergos: I'll take the next patch btw, ping me when done [17:33:41] (03PS2) 10Volans: mariadb: Add new es2011-2019 servers [puppet] - 10https://gerrit.wikimedia.org/r/271560 (https://phabricator.wikimedia.org/T127330) [17:34:31] RECOVERY - puppet last run on restbase-test2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:34:42] we good here? actually I have one more question, why is there the drop-off on the top graph in varnish-aggregate-client-status-codes anyways? [17:35:17] apergos: it's "live" raw data, and there's delays in receiving updates through the stats pipeline [17:35:20] apergos: graphs aren't always updated at the same time, so sums tend to show lower values in the very recent past [17:35:22] so the right edge of stats always trails off [17:35:35] so we won't really know for another 10-15 mins. ok [17:35:39] yeah [17:35:51] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2040369 (10RobH) a:5mark>3RobH [17:35:56] well I'm inclined to watch and call it good unless I see something horrid [17:36:03] and let go dog get the next patch [17:36:22] speak now or forever hold your peace [17:36:31] it's probably fine [17:36:43] godog: you're on [17:36:53] I can kinda-see what looks like maybe the purge bump from this, but in the huge rate + noise fluctation from MW/jobrunner presently, we'll never know for sure what that is [17:37:10] thcipriani: ready? [17:37:21] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: Puppet last ran 7 days ago [17:37:21] godog: yup [17:37:33] reminder: In ~30 minutes all labs instances are going to start rebooting in a totally unpredictable sequence. [17:37:38] that's me enabling puppet in staging ^^ [17:37:39] yeah and if it stays at that level then "who cares" too [17:37:51] PROBLEM - puppet last run on cerium is CRITICAL: CRITICAL: Puppet last ran 7 days ago [17:38:21] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:38:31] (03PS2) 10BBlack: lvs: remove rt_cache_rebuild_count sysctl [puppet] - 10https://gerrit.wikimedia.org/r/271453 [17:38:39] (03CR) 10BBlack: [C: 032 V: 032] lvs: remove rt_cache_rebuild_count sysctl [puppet] - 10https://gerrit.wikimedia.org/r/271453 (owner: 10BBlack) [17:38:42] andrewbogott: do you envison any problems with me merging https://gerrit.wikimedia.org/r/#/c/270343/ now? the only relevant change seems to be https://puppet-compiler.wmflabs.org/1808/labnet1002.eqiad.wmnet/ for which I'll bounce dnsmasq [17:38:44] sneaking in two patch [17:38:51] heh [17:38:52] (03PS2) 10BBlack: cache_misc: disable do_gzip [puppet] - 10https://gerrit.wikimedia.org/r/271559 (https://phabricator.wikimedia.org/T127294) [17:39:03] apergos, bblack: thanks for your help in getting the first purges from RB set up! [17:39:03] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: disable do_gzip [puppet] - 10https://gerrit.wikimedia.org/r/271559 (https://phabricator.wikimedia.org/T127294) (owner: 10BBlack) [17:39:06] real sneaky announcing them like that in the channel [17:39:10] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:39:12] gwicke: happy to help [17:39:18] godog: looking [17:39:33] bblack: thanks for being the goto exprt on that [17:39:41] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [17:41:31] godog: I think dnsmasq will restart on its own… doesn’t nova-network depend on it? (That’s the service that manages dnsmasq in this case) [17:41:46] !log upgrading Cassandra to 2.1.13 on cerium.eqiad.wmnet (restbase staging) T126629 [17:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:42:23] andrewbogott: could depend on it yeah, the compiler afaics says it won't restart any services [17:44:07] godog: is deployment-bastion currently an alias to deployment-tin? [17:45:02] andrewbogott: no [17:45:29] andrewbogott: no, deployment-bastion is just gone at this point, everything moved to deployment-tin. [17:45:36] ok [17:46:19] ok, going to merge! [17:46:43] (03CR) 10Andrew Bogott: [C: 031] "No objections for me, although if you merge this right before/during the upcoming reboot window it might be hard to tell whether or not it" [puppet] - 10https://gerrit.wikimedia.org/r/270343 (https://phabricator.wikimedia.org/T126377) (owner: 10Thcipriani) [17:46:59] godog: I really think you should wait until after the reboots [17:47:05] which might mean ‘tomorrow’ if you can stand it [17:47:13] !log manual failover of hadoop master node (analytics1001) to secondary (analytics1002) for maintenance (plus service restarts) [17:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:47:35] andrewbogott: wfm, thcipriani ? [17:47:43] andrewbogott: godog I can wait 'til tomorrow. No real rush. [17:48:35] kk, thanks andrewbogott thcipriani [17:49:00] graphs look good for purges so that's a wrap [17:49:10] puppet SWAT is over unless somebody has more patches [17:49:14] no other patches in the queue, we can hang out here for ten more minutes though [17:49:29] !log restbase deploy start of a42976cc82 [17:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:50:28] bblack: https://grafana-admin.wikimedia.org/dashboard/db/varnish-http-errors-datacenters - made a few tweaks (repeat-row, %) and fixed the input metric to be .rate+scale60 instead of .sum (which is aggregated in a way you don't want) - also fixed N/A showing up half the time. Though that will still happen when the metric isn't reported in the last minute [17:50:28] (may wanna time-shift the whole dashboard by 1-2minutes) [17:50:51] let me know if you like it. Just experimenting [17:50:53] it's a nice board :) [17:52:37] mobrovac: restbase1008 puppet disabled cause raid explansion, godog fyi [17:52:48] the other restbase1* hosts have the config change [17:53:22] yup [17:53:44] (03PS3) 10Volans: mariadb: Add new es2011-2019 servers [puppet] - 10https://gerrit.wikimedia.org/r/271560 (https://phabricator.wikimedia.org/T127330) [17:55:21] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [17:55:40] (03CR) 10Volans: [C: 032] mariadb: Add new es2011-2019 servers [puppet] - 10https://gerrit.wikimedia.org/r/271560 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [17:57:01] RECOVERY - Restbase root url on restbase1008 is OK: HTTP OK: HTTP/1.1 200 - 15253 bytes in 0.010 second response time [17:57:08] (03CR) 10Nikerabbit: [C: 031] Enabling translation notifications at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271496 (https://phabricator.wikimedia.org/T126901) (owner: 10MarcoAurelio) [17:58:50] (03CR) 10Jcrespo: [C: 031] "Actually, replication will probably break on application, specially if row based replication is used (it happend to me on labsdb), because" [puppet] - 10https://gerrit.wikimedia.org/r/271495 (owner: 10Jcrespo) [18:00:04] yurik gwicke cscott arlolra subbu: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160218T1800). [18:00:41] PROBLEM - puppet last run on restbase1008 is CRITICAL: CRITICAL: Puppet has 1 failures [18:00:51] !log reenable puppet on restbase1008 [18:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:01:14] mobrovac: I'll bounce restbase there since the config changed (it is depooled anyway) [18:01:41] godog: i skipped it in the deploy, i can re-deploy there [18:01:48] mobrovac: yes please! [18:01:48] gwicke cscott arlolra subbu greg-g : I plan to deploy graphoid shortly, and update mediawiki config a little to match the changes [18:02:00] godog: kk, will also force a puppet run there [18:02:01] ok. [18:02:02] anyone doing any deployments? [18:02:11] mobrovac: ack, puppet just ran btw [18:02:16] ah ok [18:02:17] :) [18:02:18] Krinkle: nice :) [18:02:41] Krinkle: I don't think I created that one. I'm not sure if I've ever even seen it before [18:02:41] RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:02:58] bblack: but you just shared it with a*pergos [18:03:22] hm.. [18:03:37] maybe not. [18:03:40] not sure how I got there [18:03:46] anyways [18:03:53] !log applied a hotfix from https://secure.phabricator.com/D15306 on iridium to test a fix for https://phabricator.wikimedia.org/T127290 [18:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:04:02] godog: {{done}} [18:04:10] !log restbase deploy end of a42976cc82 [18:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:04:34] (03CR) 10Tim Landscheidt: "It runs in the Labs project Quarry (https://wikitech.wikimedia.org/wiki/Nova_Resource:Quarry; http://quarry.wmflabs.org/)." [puppet] - 10https://gerrit.wikimedia.org/r/260187 (owner: 10Dzahn) [18:04:42] mobrovac, are you deploying? [18:05:09] yurik: just look at the log a few lines up ^^^ [18:05:33] mobrovac, i see you deployed the rest of a4..., but it doesn't tell me if you have more :) [18:05:37] 6Operations, 10Traffic: Update prod custom varnish package for upstream 3.0.7 + deploy - https://phabricator.wikimedia.org/T96846#2040513 (10BBlack) Note T127294 was fixed by disabling `do_gzip` on the cache_misc cluster. I suspect that there was a bad interaction there with streamed pass-mode response from t... [18:06:21] 6Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 5Patch-For-Review: https://integration.wikimedia.org/ci/api/json is corrupted when required more than one time in a raw - https://phabricator.wikimedia.org/T127294#2038534 (10BBlack) 5Open>3Resolved Fixed by disabling do_gzip on the misc... [18:06:40] !log Updated Wikidata's property suggester with data from Monday's json dump [18:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:06:49] (03CR) 10Giuseppe Lavagetto: Rationalize services definitions for labs too. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269955 (owner: 10Giuseppe Lavagetto) [18:07:39] yurik: go ahead, i'm done [18:09:59] 6Operations, 10EventBus: setup/deploy conf200[1-3] - https://phabricator.wikimedia.org/T127344#2040542 (10RobH) [18:10:35] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#1890778 (10RobH) [18:10:38] 6Operations, 10EventBus: setup/deploy conf200[1-3] - https://phabricator.wikimedia.org/T127344#2040557 (10RobH) [18:10:44] yurik, let me know once you are done. [18:11:10] subbu, if you want, go ahead now, i'm still testing it on the server, haven't started the actual git sync [18:11:30] i am not ready yet. :) [18:11:42] 6Operations, 10EventBus: setup/deploy conf200[1-3] - https://phabricator.wikimedia.org/T127344#2040542 (10RobH) [18:11:42] i have to get my patches merged. [18:11:44] 6Operations, 10Analytics-Cluster, 10EventBus, 6Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#2040574 (10RobH) [18:11:46] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#1890778 (10RobH) 5Open>3Resolved Task T127344 is for the setup/deployment of conf200[1-3]. This #hardware-requests is fulfilled. [18:12:51] FWIW, the Deployments page was incorrect/mis-copied, I was doing puppet swat last week, but not this week [18:14:19] moritzm: it's up to ops to update who is on point for those [18:14:21] subbu, ok, i'm ready [18:15:09] moritzm: good cause we took your one today and did it [18:15:49] so someone probably copy pasted whatever names were in there for creating next week's list then and we can ignore it [18:15:54] godog: ^^ [18:16:36] apergos: ah, yeah that makes sense, likely it was me! [18:16:44] :-) [18:16:46] (03PS2) 10Jcrespo: Allow heartbeat table to replicate to dbstore* hosts [puppet] - 10https://gerrit.wikimedia.org/r/271495 [18:16:59] apergos, godog: yeah, just checked you two were listed in the ops meeting of last week [18:17:07] yep [18:17:13] all good [18:17:39] !log re-enabled puppet on analytics1027 after maintenance [18:17:40] 6Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 5Patch-For-Review: https://integration.wikimedia.org/ci/api/json is corrupted when required more than one time in a raw - https://phabricator.wikimedia.org/T127294#2040636 (10hashar) Nodepool is all happy reaching out to Jenkins API since do_... [18:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:18:07] mobrovac, 4/4 minions finished deploy - does it mean we now have 4 scb-s? or that it is still sending stuff to sca but gets ignored? [18:18:50] (03CR) 10Jcrespo: [C: 032] Allow heartbeat table to replicate to dbstore* hosts [puppet] - 10https://gerrit.wikimedia.org/r/271495 (owner: 10Jcrespo) [18:20:20] yurik: the latter :) [18:21:37] 6Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 5Patch-For-Review: https://integration.wikimedia.org/ci/api/json is corrupted when required more than one time in a raw - https://phabricator.wikimedia.org/T127294#2040685 (10BBlack) For further followup, see also: https://phabricator.wikimed... [18:22:29] !log deployed graphoid https://gerrit.wikimedia.org/r/#/c/271563/ [18:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:23:00] subbu, go ahead with your stuff, i need a few more min to get configs in place [18:23:16] 6Operations, 10EventBus: setup/deploy conf200[1-3] - https://phabricator.wikimedia.org/T127344#2040706 (10RobH) [18:23:19] hello people. We want to collect some basic API statistics for the article recommendation API served via Labs. My question is: are there standard/recommended ways for doing this? We are capturing our need in https://phabricator.wikimedia.org/T124502 and a sample API call is: http://recommend.wmflabs.org/api?s=en&t=fa&n=10&article=Tiger [18:23:55] yurik, i need another ~10 mins before i can start. [18:24:00] ok [18:24:19] !log rebooting labvirt1001 [18:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:24:31] leila: where in labs is that running? [18:24:33] tool labs? [18:24:35] yurik, but if you are ready before then, you can go ahead. [18:24:57] bblack: it seems it takes varnish 10min to process a purge req, is that possible? [18:25:06] actually, don't really matter, you could use eventlogging like you normally do [18:25:12] and it can log to the prod EL endpoint [18:25:49] that seems like a really long time [18:25:50] 6Operations, 15User-mobrovac, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#2040742 (10mobrovac) [18:26:12] PROBLEM - Host ores.wmflabs.org is DOWN: CRITICAL - Host Unreachable (ores.wmflabs.org) [18:26:21] unlesss there i some other standard 'api usage' monitoring that i don't know about [18:26:39] I see, ottomata. can you help Nathaniel with finding where he should look at, some examples, etc.? [18:27:42] I heard from Dario that Ops may have some standard recommendations, something for example they recommend for ORES as well? If not, what you recommended sounds good to me, ottomata. [18:30:11] so ottomata, schana is here. can you tell us where we should look into to learn more? [18:31:11] ha, ummm, i just read the ticket you linked to, saw your mention of EL, and it sounded fine to me [18:31:25] 6Operations, 10EventBus: setup/deploy conf200[1-3] - https://phabricator.wikimedia.org/T127344#2040827 (10RobH) [18:31:27] if there is some ORES recommendation, I don't know about it [18:31:27] 6Operations, 10Analytics-Cluster, 10EventBus, 6Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#2040828 (10RobH) [18:31:29] maybe yuvipanda would know? [18:31:30] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2040825 (10RobH) 5Resolved>3Open So it turns out WMF3560 & WMF3565 were listed on our spares in codfw, but are actually in eqiad. I'm not sure how that happened... [18:31:35] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2040829 (10RobH) a:5RobH>3Ottomata [18:32:09] 6Operations, 10EventBus: setup/deploy conf200[1-3] - https://phabricator.wikimedia.org/T127344#2040542 (10RobH) 5Open>3declined turns out two of these spares werent in this datacenter, mistake on the spares page. The request has to go back to the #hardware-requests stage. [18:32:10] waaaaah [18:32:21] (03PS1) 10Yurik: Revert "Set default graph vega version back to 1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271571 [18:32:23] do you have an example of EL used outside of mediawiki context we can look into ottomata? [18:32:32] ottomata: so i fubard up the conf200[1-3] allocations and had to kick the task back to you with details [18:32:41] (03CR) 10jenkins-bot: [V: 04-1] Revert "Set default graph vega version back to 1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271571 (owner: 10Yurik) [18:32:42] (turns out the spares page was wrong) [18:32:46] sorry about that dude =[ [18:32:51] (03CR) 10Giuseppe Lavagetto: [C: 032] logrotate: explicit ownership and permissions [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/271505 (https://phabricator.wikimedia.org/T127025) (owner: 10Giuseppe Lavagetto) [18:33:12] mobrovac: I'm not sure I understand your statement - what exactly is taking 10 minutes? [18:33:17] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2040847 (10Ottomata) I think those would work fine. Are you sure conf100x have 6 cores? I see 4. We def don't need 4*4TB, maybe you can swap out smaller HDDs and... [18:34:07] (03PS1) 10Aklapper: phabricator: redirect old sprint board URL to default board [puppet] - 10https://gerrit.wikimedia.org/r/271572 (https://phabricator.wikimedia.org/T127348) [18:34:16] bblack: a page edit happened, and a new version landed in rb storage, but it took varnish around 10 mins to discard the old render and serve the new one [18:34:33] mobrovac: but you're sure HTCP went out immediately? [18:34:46] PROBLEM - mysqld processes on es2012 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [18:35:18] 6Operations, 10Analytics, 10Analytics-Cluster: Audit Hadoop worker memory usage. - https://phabricator.wikimedia.org/T118501#1802076 (10Milimetric) p:5Triage>3Normal [18:35:19] PROBLEM - mysqld processes on es2013 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [18:35:27] bblack: it went out +/- x ms after the new version was available in storage [18:35:40] bblack: we'll do a bit more tests to confirm [18:35:47] 6Operations, 10Adminbot: Add log bit in #countervandalism for CVN labs project - https://phabricator.wikimedia.org/T127352#2040865 (10Krinkle) [18:35:48] <_joe_> jynus, volans is that any of you? [18:35:53] those are new installs aren't they? [18:35:54] 6Operations, 10Adminbot: Add log bot in #countervandalism for CVN labs project - https://phabricator.wikimedia.org/T127352#2040880 (10Krinkle) [18:35:58] PROBLEM - mysqld processes on es2011 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [18:36:05] yeah, there we go [18:36:05] mobrovac: in general, there's probably a lot of possible causes. our HTCP stuff is not in a good state (and hasn't ever been, historically) [18:36:06] sorry, puppet run and automatically added them... adding downtime [18:36:18] <_joe_> volans: common problem :/ [18:36:20] mobrovac: I would've more-likely suspected the purge didn't work at all, though [18:36:22] yes, he didn't know, my fault for not telling him [18:36:29] <_joe_> only the other servers don't page [18:36:29] PROBLEM - mysqld processes on es2018 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [18:36:36] <_joe_> (non-dbs I mean) [18:36:37] ok es things are all known then got it [18:36:50] PROBLEM - mysqld processes on es2019 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [18:37:05] bblack: hm but how come varnish discarded the old version then? we set the TTL to 3600 secs [18:37:13] PROBLEM - mysqld processes on es2017 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [18:37:19] helloo yuvipanda. You actually may know the answer to my question for ottomata, too. Do you have an example where we have used EL outside of the mediawiki context? schana wants to tackle logging api usage and UI logging for article recommendation service. [18:37:31] mobrovac: are you sure the object you purged wasn't already 10 minutes away from its 3600 expiry from before? [18:37:32] leila: sure [18:37:46] bblack: yup, it was a new page we created [18:38:00] leila: https://github.com/wikimedia/apps-android-wikipedia/blob/a7685074590062047a735ba7da54062626f22557/app/src/main/java/org/wikipedia/analytics/EventLoggingEvent.java is in Java that I wrote for the Android app [18:38:03] (03PS1) 10Ori.livneh: Gradually invalidate mobile web parser cache entries with [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271573 [18:38:09] but it's fairly trivial and can be converted to whatever language [18:38:12] leila: hmmm, you might want to ask in #analytics [18:38:22] (03PS1) 10Yurik: Re-set default vega version to 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271574 [18:38:22] do you have X-Cache headers from the test requests? (preferably one before when it wasn't purged yet, and the first one after when it was the new content?) [18:38:27] ah cool! [18:38:30] ottomata: I think we are good with yuvipanda's link. thank you! :-) [18:38:30] yuvipanda: I didn't know of that [18:38:30] nice [18:38:47] leila: ottomata schana https://github.com/wikimedia/operations-puppet/blob/production/modules/toollabs/files/log-command-invocation is a python example [18:38:51] that we use in tool labs [18:39:26] thanks yuvipanda [18:39:30] could someone +2 tiny config change, and i will deploy it -- https://gerrit.wikimedia.org/r/#/c/271574/ [18:39:41] changing graph defaults to vega 2 [18:39:43] X-Cache: cp1068 pass+chfp(0), cp4010 hit(29), cp4010 frontend pass(0) [18:39:44] ottomata: :D [18:39:48] bblack: ^^ [18:40:02] yurik: I gotcha [18:40:05] mobrovac: the primary problems are going to be: (1) purge is racy: it's entirely possible for a purge to not really affect some caches due to race conditions and (2) with MW's exccessive purge rate, queues could be falling behind, although I'd think 10 minutes would be rare) [18:40:11] grrr [18:40:13] (03CR) 10Ottomata: [C: 032] Re-set default vega version to 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271574 (owner: 10Yurik) [18:40:23] yurik: merged, you deploy ja? [18:40:34] ottomata, i take that grr back :) i thought you were answering to yuvipanda who just pinged you :))) [18:40:36] yep [18:41:03] mobrovac: why are two different layers giving hit + pass? [18:41:04] bblack: i see [18:41:13] schana: leila np :D do create docs out of it for future generations :D [18:41:29] subbu, i am about to depl tiny config change [18:41:48] (03Abandoned) 10Yurik: Revert "Set default graph vega version back to 1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271571 (owner: 10Yurik) [18:41:48] sounds good. [18:41:48] bblack: good question, might be a vcl reconfig might be needed [18:41:51] <_joe_> ottomata: so, I have a problem with the varnishkafka submodule [18:42:07] oh! [18:42:07] ok [18:42:11] bblack: afaik, there's a pass in the front-end layer for anything ~ /api/rest_v1/ [18:42:11] mobrovac: I'm not aware of anything in the VCL that would pass in one layer and hit in another.... [18:42:18] _joe_: i'm cool with converting that one into non submodule [18:42:22] if that is your problem :) [18:42:28] (03PS1) 10Ori.livneh: Modify $wgRenderHashAppend when disabling responsive images on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271575 [18:42:37] <_joe_> ottomata: I merged a simple yet important change, https://gerrit.wikimedia.org/r/#/c/271505/ [18:42:37] mobrovac: I don't think so [18:43:07] <_joe_> ottomata: and well, I was about to merge the submodule in ops/puppet, but there are like 4 more changes I don't want to merge and I didn't look at [18:43:14] oh? [18:43:15] looking [18:43:19] (03CR) 10Ori.livneh: [C: 032] Modify $wgRenderHashAppend when disabling responsive images on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271575 (owner: 10Ori.livneh) [18:43:29] <_joe_> we're actually at 9b53e51fc45147330383559589fb59da871a785c [18:43:46] <_joe_> git log 9b53e51fc45147330383559589fb59da871a785c..master [18:44:01] (03Merged) 10jenkins-bot: Modify $wgRenderHashAppend when disabling responsive images on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271575 (owner: 10Ori.livneh) [18:44:11] <_joe_> merged by you :P [18:44:21] _joe_: it looks like those logs are all CI stuff from hashar [18:44:33] so that jenkins can run tests on the repo [18:44:38] <_joe_> ok if it's safe to merge... [18:44:40] those comits [18:44:45] <_joe_> I'll bbiab, just let me know [18:45:11] ori, did you juts merge a patch? [18:45:20] bblack: you are right, no pass in vcl for api/rest_v1 [18:45:28] yurik: yes, to a different repo. [18:45:30] ori, i was about to deploy a file in config [18:45:33] ja should be _joe_ [18:45:41] yurik: go for it [18:45:42] ori, its in config's master [18:45:50] i git pulled and it got there too [18:45:51] aside from yours, they are all CI changes, mostly linting of the python ganglia module stuff [18:45:56] yurik: want me to deploy yours? [18:46:16] mobrovac: was there a recent change to the cache headers emitted by RB? [18:46:23] yurik: I can deploy both of our changes, or you can deploy yours and I'll do mine after. [18:46:24] ori, sure, sync-file both [18:46:28] got it [18:46:43] i already git pulled them [18:47:48] bblack: yes in the sense that we added them recently (there were no cache headers being sent before that) [18:48:01] recently like when and what was the change? [18:48:11] * apergos lurks for this discussion [18:48:37] RECOVERY - Host ores.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 5.77 ms [18:49:26] !log stopping, upgrading and reconfiguring dbstore1001 [18:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:50:02] bblack: https://github.com/wikimedia/restbase/blob/master/v1/mobileapps.yaml#L173 is adding the cache-control header, deployed last week [18:50:24] !log ori@mira Synchronized wmf-config/InitialiseSettings.php: I02dbbdb79ea9a: Re-set default vega version to 2 (duration: 03m 24s) [18:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:50:36] yurik: ^ [18:50:45] ori, thx! [18:53:07] !log tagged phabricator hotfixes as release/2016-02-18/2 in the phabricator/phabricator repository. This includes fixes T127290 and T127349 [18:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:54:06] yurik, let me know once you are done. [18:54:25] subbu, done [18:55:09] mobrovac: I donno, I'll have to look at this later, I need to get back to the gzip thing [18:55:13] thanks. starting parsoid deploy in a minute. [18:55:21] kk bblack, thnx! [18:55:57] twentyafterfour: I saw you got a lot of work done last night :-) where are we on https://wikitech.wikimedia.org/wiki/Incident_documentation/20160204-Phabricator actionables now (when you have time)? [18:56:00] !log starting parsoid deploy [18:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:56:07] mobrovac: maybe file a bug, and try to reproduce whatever purge behavior you're seeing on some rare/unpopular URL nobody else is hitting with logs of req/resp headers throughout (what it looked like when it first landed in the cache, then after supposedly purging, then later when it finally updates)? [18:56:14] !log ori@mira Synchronized wmf-config/mobile.php: Ib6fff26be162: Modify $wgRenderHashAppend when disabling responsive images on mobile (duration: 02m 24s) [18:56:15] mobrovac: can you ping me when you find out? curious [18:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:56:50] sounds good bblack, will investigate some more [18:56:52] morebots: but it's entirely possible the reason the layers are unusually inconsistent on pass-vs-hit is that we have mixed cache objects from before and after the change to the cache headers... [18:56:52] I am a logbot running on tools-exec-1210. [18:56:52] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [18:56:52] To log a message, type !log . [18:56:58] lol [18:57:04] apergos: the first two items are still not done. I need to figure out the backups of /srv/phab/repos before I can move repos to /srv/repos [18:57:06] ugh [18:57:24] and I need to get T125851 updated and then reviewd [18:57:25] T125851: Refactor phabricator module in puppet to remove git tag pinning behavior - https://phabricator.wikimedia.org/T125851 [18:57:30] I guess mo does not complete to mobrovac every time :) [18:57:37] :) [18:57:53] https://phabricator.wikimedia.org/T114363 twentyafterfour this is done then? [18:58:25] apergos: mostly but it depends on the others before it can be truly considered {done} [18:58:33] ah so that's why it's still left open [18:58:34] ok [18:58:46] it's mostly unblocked currently [18:58:49] !log synced code; restarted parsoid on wtp1001 as a canary [18:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:59:00] the important thing is that scap3 is now unblocked for everyone [18:59:09] I'm gonna update the actionables on the incident report to show that though, it's still a lot of work that got done [18:59:14] so the services team can start using it as soon as they are ready [18:59:16] yep [18:59:28] someone wanted it righ taway, who was that [18:59:31] yeah a whole lot got done, I'm very pleased [18:59:52] ottomata is using scap3 right-away I believe [18:59:56] for eventbus [19:00:10] * twentyafterfour is exhausted from this past two weeks, it's been hell [19:00:26] indeed! [19:00:39] I going to get lunch, if anything goes boom with phabricator, call me [19:00:53] (I don't think anything will go boom, it seems totally stable now) [19:01:10] enjoy yer lunch! [19:01:15] all good; restarting parsoid on all nodes [19:01:45] btw ottomata, the scap3 provider and scap::target / scap debian package have all landed, hopefully my changes to scap::target didn't break your eventbus stuff [19:01:49] hoo: were you guys wanting the scap3 puppet provider? I have a feeling [19:02:38] the puppet provider works, you just need to specify the package with provider='scap3' and use a scap::target resource to configure everything. [19:03:16] apergos: Would be nice for dcat [19:03:20] but not urgent or so [19:03:28] so... it seems you can give it a try now :-) [19:03:30] it's only on a single box right now [19:03:52] package { something: provider => 'scap3', install_args => [ { 'owner' => 'username' }] } [19:04:01] !log finished deploying parsoid dfbafb60 [19:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:04:44] puppet will auto-deploy the package in /srv/deployment/something and ensure the directory has the correct ownership, then the rest is up to scap over ssh [19:05:04] * twentyafterfour gets lunch [19:05:06] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 11.54% of data above the critical threshold [100000000.0] [19:05:46] (03PS1) 10Volans: Depool es2001 to copy the data to es2011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271577 (https://phabricator.wikimedia.org/T127330) [19:06:24] !log restarting apache on bohrium [19:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:06:38] twentyafterfour: i will find out the next time i deploy :) [19:06:44] and will poke you if it doesn't work! [19:06:46] thank you! [19:07:02] (03CR) 10Jcrespo: [C: 031] Depool es2001 to copy the data to es2011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271577 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [19:07:20] 6Operations, 10ops-eqiad: db1063 degraded RAID - https://phabricator.wikimedia.org/T126969#2041078 (10Cmjohnson) Return information usps 9202 3946 5301 2430 9344 fedex 9611918 2393026 52426109 [19:08:37] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [19:11:07] PROBLEM - puppet last run on mw2097 is CRITICAL: CRITICAL: Puppet has 1 failures [19:21:51] !log reboot labvirt1003 [19:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:25:57] PROBLEM - Host www.toolserver.org is DOWN: PING CRITICAL - Packet loss = 100% [19:26:02] hey opsen, we need some redirects in place for Phab: the upgrade last night made some commonly used URLs 404 [19:26:07] https://gerrit.wikimedia.org/r/#/c/271485/ and https://gerrit.wikimedia.org/r/#/c/271572/ [19:26:11] please help [19:26:25] PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: No route to host [19:26:33] sigh [19:26:46] PROBLEM - showmount succeeds on a labs instance on tools.wmflabs.org is CRITICAL: No route to host [19:26:50] chasemp: forgot to fail this one over, heh [19:26:59] sorry guys that alert is almost certainly our maint [19:27:04] I didn't realize this was separate [19:27:09] * yuvipanda does [19:27:12] greg-g: i can review those cuz i get them ;] [19:27:15] no outage, then? [19:27:16] greg-g: tell me [19:27:22] jynus: nope [19:27:24] all planed? [19:27:29] then, no issue [19:27:35] chasemp: I changed it to tools-checker-01 [19:27:39] thanks [19:27:53] greg-g: I'll do those [19:28:29] jynus: right, no outage, all planned, just unintended broken urls [19:28:32] (03CR) 10ArielGlenn: [C: 032] phabricator: redirect old sprint board URL to default board [puppet] - 10https://gerrit.wikimedia.org/r/271572 (https://phabricator.wikimedia.org/T127348) (owner: 10Aklapper) [19:28:42] oh, you were asking about the other thing, my bad [19:28:51] thanks both robh and apergos [19:28:54] oh ariel has them it seems [19:29:03] thats cool, they were both localized to phab only and low risk [19:29:26] I'll +1 each (since apergos is +2ing) [19:29:28] https://gerrit.wikimedia.org/r/#/c/271485/ needs rebase [19:30:13] (03CR) 10RobH: [C: 031] phabricator: redirect old task creation URL to new one [puppet] - 10https://gerrit.wikimedia.org/r/271485 (https://phabricator.wikimedia.org/T127286) (owner: 10Alex Monk) [19:30:20] Krenair: you around to rebase ^^ ? [19:30:35] yep [19:30:36] sorry, greg-g I was talking about labs [19:30:37] apergos: I didnt do any merges or anything else either, just the +1. just fyi [19:30:47] no issue with phab [19:30:49] uh huh [19:30:49] jynus: yeah, my bad :) [19:31:01] (it paged, so my question) [19:31:04] * greg-g nods [19:31:12] still need help? [19:32:17] can't load any page from any wiki, anything with connection? [19:32:19] (03PS2) 10ArielGlenn: phabricator: redirect old task creation URL to new one [puppet] - 10https://gerrit.wikimedia.org/r/271485 (https://phabricator.wikimedia.org/T127286) (owner: 10Alex Monk) [19:32:21] nm I rebased [19:32:26] oh, ok [19:32:33] I was just doing it :) [19:32:44] Danny_B, wfm [19:33:02] Danny_B, do you get an error? [19:33:06] Danny_B: wfm, no server errors, nothing deployed recently, what's up? [19:33:28] sorry Krenair :-) [19:33:40] any of you in europe? [19:33:43] (03CR) 10ArielGlenn: [C: 032] phabricator: redirect old task creation URL to new one [puppet] - 10https://gerrit.wikimedia.org/r/271485 (https://phabricator.wikimedia.org/T127286) (owner: 10Alex Monk) [19:33:52] i can't even load phabricator [19:34:25] i'm in europe Danny_B [19:34:33] Danny_B, so do you get an error? [19:34:46] server timeout [19:34:56] from the browser? or some sort of server error? [19:34:57] actually connection timeout [19:35:00] ok [19:35:01] browser [19:35:02] can you run traceroute? [19:35:08] mmt [19:35:41] running [19:36:16] <_joe_> Danny_B: are you on linux/mac? [19:36:19] greg-g: got anyone to test those changes? now live [19:36:33] _joe_: win [19:36:37] <_joe_> if so, can you post the result of 'host phabricator.wikimedia.org' [19:36:38] sorry ;-) [19:36:47] <_joe_> err, nslookup then IIRC [19:36:57] RECOVERY - puppet last run on mw2097 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [19:37:15] <_joe_> I need to know which caching pop you're trying to connect to [19:37:23] ok, i end up on 130.244.64.186 and can't get further [19:37:27] I'm afraid we're back to the same problem we had one week ago when it comes to Tele2 / UPC having connectivity issues. [19:37:34] Same here. [19:37:35] yes [19:37:38] that's tele2 [19:37:42] <_joe_> andre__: where are you located? [19:37:48] _joe_, same city as Danny_B [19:37:49] <_joe_> oh geez [19:37:55] <_joe_> wikipedias work? [19:38:04] nope [19:38:08] <_joe_> andre__: can you check if it's phab or everything? [19:38:10] as i wrote above [19:38:13] <_joe_> ok [19:38:17] _joe_, same as last week, also en.wp and such [19:38:17] <_joe_> paging paravoid [19:38:24] i can't connect anywhere [19:38:31] _joe_, exactly same story as last week I'm afraid [19:38:34] but labs actually [19:38:53] hey [19:38:53] https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Text%20caches%20esams&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1455824301&g=network_report&z=large [19:38:54] at least ssh to labs works [19:38:54] I'm here [19:38:56] what's up? [19:39:08] <_joe_> paravoid: seems like tele2 agani [19:39:13] paravoid: same story as seven days ago for people on Tele2 [19:39:36] There is a report on #wikipedia-fr, some contributor experiences difficulties also. [19:39:46] yes [19:40:05] apergos: krenair or andre can [19:40:05] paravoid, _joe_: traceroute: http://fpaste.org/325063/ [19:40:10] I'm obviously way too distracted right now [19:40:17] RECOVERY - Host www.toolserver.org is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [19:40:18] !log cr2-knams: deactivating BGP with 1257 [19:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:40:25] better? [19:40:45] <_joe_> Danny_B, andre__ is it back? [19:40:50] yes [19:40:53] apergos, lgtm [19:40:54] _joe_, yes [19:40:57] cool [19:40:59] worrying [19:41:01] thanks Krenair [19:41:03] (sorry, I missed your message earlier) [19:41:26] Orange for the French contributor as ISP. [19:41:32] andre__: https://phabricator.wikimedia.org/T119057 would help you but stuck in legal review :) [19:41:32] <_joe_> Danny_B: thanks for reporting and sorry if I was slow to catch up, was just back from dinner [19:41:41] no prob [19:42:04] not your fault [19:42:08] andre__: (non-public, but you should have access) [19:42:32] I was here the whole time, I just missed it until _joe_ mentioned my nick :) [19:42:33] (03PS6) 10Dzahn: exim: rewriting rule for maint-announce@ mail to phab [puppet] - 10https://gerrit.wikimedia.org/r/268851 (https://phabricator.wikimedia.org/T118176) [19:42:41] ah, thx [19:43:15] RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.121 second response time [19:43:28] RECOVERY - showmount succeeds on a labs instance on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.039 second response time [19:46:53] (03PS5) 10Phedenskog: webperf: Create new navtiming metric with higher value limit [puppet] - 10https://gerrit.wikimedia.org/r/270725 (https://phabricator.wikimedia.org/T125381) [19:47:36] someone started some maintenance or rename at 19UTC? [19:47:55] jouncebot, previous [19:48:04] (03CR) 10Phedenskog: webperf: Create new navtiming metric with higher value limit (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/270725 (https://phabricator.wikimedia.org/T125381) (owner: 10Phedenskog) [19:48:35] there is like a double amount of writes than previouly on enwiki [19:49:04] 19:00 like 49 minutes ago? [19:49:14] today? [19:49:16] yes [19:49:19] hrmmmmm [19:49:29] let me se other wikis [19:49:35] nothing obvious in sal [19:49:56] yeah, s2 too [19:50:21] I have to go to a long meeting, I'll be on IRC though, afk for a bit [19:50:32] (03PS3) 10Andrew Bogott: Move the labs pdns db server into hiera. [puppet] - 10https://gerrit.wikimedia.org/r/271465 [19:50:34] !log reboot labvirt1004 [19:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:51:50] most wikis, at least most "text" wikis [19:53:06] jynus: parsoid perhaps? there was a deploy [19:53:37] greg-g: ^ [19:53:46] jynus: what sort of writes? [19:53:50] !log upgrade Cassandra to 2.1.13 on restbase-test200[1-3].codfw.wmnet (restbase staging) : T126629 [19:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:53:57] UPDATES [19:54:29] I am trying to see them [19:55:01] https://gerrit.wikimedia.org/r/#/c/271575/ (which I pushed out) invalidates the mobile web parser cache, so that could be related [19:55:37] (03CR) 10Aklapper: "Nope, that didn't work as it just redirects to https://phabricator.wikimedia.org/project/board/ now and loses the board number. Meh. Someo" [puppet] - 10https://gerrit.wikimedia.org/r/271572 (https://phabricator.wikimedia.org/T127348) (owner: 10Aklapper) [19:55:40] UPDATE /* EmailNotification::updateWatchlistTimestamp -- pure speculation for now [19:55:40] if that is it, it should subside fairly quickly; if the write load is too high, we can revert it [19:55:47] oh, that's not related [19:55:52] no idea what that is [19:56:16] let me see the binlog [19:56:23] for better evaluation [19:58:16] on s1 the incrase is sharp between 18:52 and 19:02, flatting at the new level since then [19:58:23] s/incrase/increase/ [19:58:47] PROBLEM - DPKG on restbase-test2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:58:56] no [19:59:05] it is either UPDATE /* Wikibase\Client\Usage\Sql\EntityUsageTable::touchUsageBatch [19:59:27] PROBLEM - DPKG on restbase-test2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:59:29] no or [19:59:31] it is that [20:00:04] hashar: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160218T2000). [20:02:06] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [20:03:24] was there a wikidata deploy? [20:03:34] I didn't see it in the backread [20:03:42] hoo.... ? [20:04:36] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: /page/mobile-sections/{title} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200) [20:04:37] maybe a user purging could cause that? [20:09:18] (03CR) 10Andrew Bogott: [C: 032] Move the labs pdns db server into hiera. [puppet] - 10https://gerrit.wikimedia.org/r/271465 (owner: 10Andrew Bogott) [20:10:19] if we get more writes, but they are all small like that, I can sign that, what I am worried about is the patern shift without explanation [20:12:01] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [20:12:01] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [20:12:01] PROBLEM - puppet last run on restbase-test2001 is CRITICAL: CRITICAL: Puppet has 1 failures [20:12:06] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.151, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [20:13:26] PROBLEM - Restbase root url on restbase-test2003 is CRITICAL: Connection refused [20:13:48] PROBLEM - Restbase root url on xenon is CRITICAL: Connection refused [20:13:49] PROBLEM - Restbase root url on cerium is CRITICAL: Connection refused [20:13:49] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.149, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [20:13:49] PROBLEM - restbase endpoints health on cerium is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.147, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [20:14:07] PROBLEM - Restbase root url on restbase-test2002 is CRITICAL: Connection refused [20:14:26] PROBLEM - Restbase root url on restbase-test2001 is CRITICAL: Connection refused [20:14:27] <_joe_> mobrovac urandom gwicke ^^ [20:14:33] if someone want to have a look later, 3rd graph on the right at: https://tendril.wikimedia.org/host/view/db1052.eqiad.wmnet/3306 [20:14:37] PROBLEM - Restbase root url on praseodymium is CRITICAL: Connection refused [20:14:47] PROBLEM - restbase endpoints health on xenon is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.200, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [20:15:58] _joe_: looking into it [20:16:01] it's staging though [20:16:17] !log reboot labvirt1005 [20:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:16:56] <_joe_> mobrovac: that's why I just notified you instead of looking into it :P [20:17:28] RECOVERY - Restbase root url on xenon is OK: HTTP OK: HTTP/1.1 200 - 15253 bytes in 0.026 second response time [20:17:36] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [20:18:04] _joe_: :P [20:18:10] _joe_: all good now ^^^ [20:18:18] RECOVERY - Restbase root url on praseodymium is OK: HTTP OK: HTTP/1.1 200 - 15253 bytes in 0.033 second response time [20:18:28] PROBLEM - puppet last run on restbase-test2002 is CRITICAL: CRITICAL: Puppet has 1 failures [20:18:36] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [20:19:18] RECOVERY - Restbase root url on cerium is OK: HTTP OK: HTTP/1.1 200 - 15253 bytes in 0.026 second response time [20:19:26] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [20:19:26] RECOVERY - Auth DNS for labs pdns on labtest-ns0.wikimedia.org is OK: DNS OK: 0.090 seconds response time. nagiostest.eqiad.wmflabs returns [20:19:47] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [20:19:52] ^andrew is this a checker node going down? the pdns stuff [20:19:57] RECOVERY - Restbase root url on restbase-test2001 is OK: HTTP OK: HTTP/1.1 200 - 15253 bytes in 0.122 second response time [20:20:38] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [20:20:59] !log restarted restbase in staging (5 min. delay) [20:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:21:16] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [20:21:27] RECOVERY - Restbase root url on restbase-test2002 is OK: HTTP OK: HTTP/1.1 200 - 15253 bytes in 0.140 second response time [20:22:37] RECOVERY - Restbase root url on restbase-test2003 is OK: HTTP OK: HTTP/1.1 200 - 15253 bytes in 0.136 second response time [20:34:11] (03PS1) 10Krinkle: wmfstatic: Set MW_NO_SESSION to disable automatic session creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271602 [20:36:09] (03CR) 10Krinkle: [C: 032] "Per T99096 and CommonSettings, this is still only enabled on testwiki and mediawiki.org, so impact is low." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271602 (owner: 10Krinkle) [20:36:16] ori: ^ [20:36:19] anomie: [20:36:45] 👍 [20:37:37] (03Merged) 10jenkins-bot: wmfstatic: Set MW_NO_SESSION to disable automatic session creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271602 (owner: 10Krinkle) [20:39:35] Krinkle: Looks sensible. [20:39:42] thx [20:39:55] !log krinkle@tin Synchronized w/static.php: Set MW_NO_SESSION (duration: 01m 55s) [20:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:41:20] !log reboot labvirt1006 [20:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master