[00:00:05] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151222T0000). [00:00:05] Krenair MatmaRex: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:07:57] hi there. [00:10:03] so… RoanKattouw in on vacation and ostriches is not around today. Krenair, you there? [00:10:22] ok [00:12:59] patches are going through jenkins [00:13:37] 6operations, 6Performance-Team: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1896643 (10ori) Diffing heaps on mw1015 points to `xmlSearchNsByHref` being the culprit. [00:16:11] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1896646 (10coren) Logo exists there: https://upload.wikimedia.org/wikipedia/commons/4/4a/Wikimania_Montreal_logo.png Announcement allowing the switch being thrown pending only... [00:18:06] (03CR) 10Alex Monk: [C: 04-2] "should be merged with Ib8513531" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260515 (https://phabricator.wikimedia.org/T122062) (owner: 10Dzahn) [00:19:04] (03PS1) 10Dzahn: add wikimania2017 logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260521 (https://phabricator.wikimedia.org/T122062) [00:19:11] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1896648 (10coren) >>! In T122062#1896571, @Dzahn wrote: > do you want this -> T96564 for 2017 wiki as well or no? It seems to make sense, given the (comparatively) low volume... [00:19:45] (03CR) 10Alex Monk: [C: 04-2] "This is supposed to all be in one commit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260521 (https://phabricator.wikimedia.org/T122062) (owner: 10Dzahn) [00:22:47] !log krenair@tin Synchronized php-1.27.0-wmf.9/extensions/SyntaxHighlight_GeSHi/modules/ve-syntaxhighlight/ve.ui.MWSyntaxHighlightDialogTool.js: https://gerrit.wikimedia.org/r/#/c/260429/ (duration: 00m 30s) [00:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:22:52] Krenair: why cant a logo be added that is used later in config [00:23:16] i can merge that into one though [00:23:32] mutante, 1 wiki creation = 1 commit please [00:27:43] 6operations, 10hardware-requests: Decomission/reclaim nembus/neptunium - https://phabricator.wikimedia.org/T122050#1896664 (10Andrew) confirmed -- these systems are no longer needed and can be reclaimed. [00:29:25] !log krenair@tin Synchronized php-1.27.0-wmf.9/extensions/VisualEditor: https://gerrit.wikimedia.org/r/#/c/260492/ (duration: 00m 32s) [00:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:29:37] MatmaRex, ^ [00:29:40] please test [00:30:01] 6operations, 10hardware-requests: Decomission/reclaim nembus/neptunium - https://phabricator.wikimedia.org/T122050#1896666 (10RobH) p:5Triage>3Normal [00:30:44] (03PS2) 10Dzahn: add wikimania2017 wiki to InitSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260521 (https://phabricator.wikimedia.org/T122062) [00:30:45] mhm [00:31:06] Krenair: ^ git squash :p [00:31:13] thank you [00:31:42] MatmaRex, so is your patch working as expected? [00:31:44] Krenair: well, i don't see either of the fixes deployed on en.wp, presumably something's cached, but i can't tell what [00:31:56] 6operations, 10hardware-requests: Decomission/reclaim nembus/neptunium - https://phabricator.wikimedia.org/T122050#1894746 (10RobH) [00:32:01] I just checked test.wikipedia.org [00:32:46] Krenair: both seem okay on https://en.wikipedia.org/wiki/Pheel_Khana_School now that i disabled cache in dev tools. [00:33:29] (03CR) 10Dzahn: "git squashed the other 2 changes in here" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260521 (https://phabricator.wikimedia.org/T122062) (owner: 10Dzahn) [00:33:41] i guess it's just the 5-minute caching of startup module. [00:33:46] so, yay! [00:33:52] yeah [00:34:45] (03PS3) 10Dzahn: add wikimania2017 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260521 (https://phabricator.wikimedia.org/T122062) [00:38:00] (03PS1) 10RobH: reclaim nembus/neptunium to spares [puppet] - 10https://gerrit.wikimedia.org/r/260522 [00:40:00] (03PS1) 10RobH: Decomission/reclaim nembus/neptunium [dns] - 10https://gerrit.wikimedia.org/r/260523 [00:42:37] (03PS3) 10Yuvipanda: k8s: Have docker service start after the flannel one [puppet] - 10https://gerrit.wikimedia.org/r/260501 [00:45:59] (03PS4) 10Yuvipanda: k8s: Have docker service start after the flannel one [puppet] - 10https://gerrit.wikimedia.org/r/260501 [00:46:08] PROBLEM - Host neptunium is DOWN: CRITICAL - Host Unreachable (208.80.154.6) [00:46:18] PROBLEM - Host nembus is DOWN: PING CRITICAL - Packet loss = 100% [00:47:01] 6operations, 10hardware-requests: Decomission/reclaim nembus/neptunium - https://phabricator.wikimedia.org/T122050#1896699 (10RobH) [00:49:51] (03CR) 10RobH: [C: 032] Decomission/reclaim nembus/neptunium [dns] - 10https://gerrit.wikimedia.org/r/260523 (owner: 10RobH) [00:50:18] (03PS4) 10Dzahn: add wikimania2017 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260521 (https://phabricator.wikimedia.org/T122062) [00:52:04] (03PS5) 10Yuvipanda: k8s: Have docker service start after the flannel one [puppet] - 10https://gerrit.wikimedia.org/r/260501 [00:52:10] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1896709 (10Dzahn) >>! In T122062#1896648, @coren wrote: > It seems to make sense, given the (comparatively) low volume of that Wiki. Yes, please. Ok, amended and squashed thi... [00:52:47] 6operations, 10hardware-requests: Decomission/reclaim nembus/neptunium - https://phabricator.wikimedia.org/T122050#1896710 (10RobH) [00:53:31] (03CR) 10RobH: [C: 032] reclaim nembus/neptunium to spares [puppet] - 10https://gerrit.wikimedia.org/r/260522 (owner: 10RobH) [00:54:04] (03PS6) 10Yuvipanda: k8s: Have docker service start after the flannel one [puppet] - 10https://gerrit.wikimedia.org/r/260501 [00:54:13] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Have docker service start after the flannel one [puppet] - 10https://gerrit.wikimedia.org/r/260501 (owner: 10Yuvipanda) [00:54:18] (03PS5) 10Dzahn: add wikimania2017 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260521 (https://phabricator.wikimedia.org/T122062) [00:54:45] robh: I merged your change too [00:54:52] thx! [00:54:55] beat me to it =] [00:55:46] mutante: who loves Hiera? [00:56:10] 6operations, 10ops-codfw, 10hardware-requests: wipe disks and add nembus back to server spares - https://phabricator.wikimedia.org/T122100#1896711 (10RobH) 3NEW a:3Papaul [00:56:26] cajoel: i think it's a love-hate relationship [00:56:51] I can't decide if it's solving any problems for me.. [00:56:51] 6operations, 10hardware-requests: Decomission/reclaim nembus/neptunium - https://phabricator.wikimedia.org/T122050#1894746 (10RobH) [00:57:08] maybe slightly more readable? [00:57:25] but in general, yea, it removes config from the roles [00:57:41] changing some value doesnt require touching actual puppet code [00:57:55] 6operations, 10ops-eqiad, 10hardware-requests: wipe neptunium and add back to spares - https://phabricator.wikimedia.org/T122101#1896724 (10RobH) 3NEW a:3Cmjohnson [00:58:41] 6operations, 10hardware-requests: Decomission/reclaim nembus/neptunium - https://phabricator.wikimedia.org/T122050#1896744 (10RobH) [00:59:04] before we had it the same things would be in role classes or site.pp directly [00:59:15] solves the 2-spaces 4-spaces war ?? :) [00:59:19] 6operations, 10hardware-requests: Decomission/reclaim nembus/neptunium - https://phabricator.wikimedia.org/T122050#1896748 (10RobH) 5Open>3Resolved All remote steps complete and on-site tasks dispatched for the disk wipe and addition back to the server spares tracking. This task for the overall steps and... [00:59:30] nice is that you can apply things to an entire role or data center [00:59:38] but you might not need that here [01:00:10] like if you see the hieradata/ structure in the ops/puppet repo [01:00:23] It's been about 18 months since I've looked at puppet, and it's gotten crazy.. [01:00:38] there is role/ and ./eqiad and ./common etc [01:00:41] lots of different ways to accomplish the same sorts of things.. [01:00:47] yes [01:01:26] I started big, toying with lots of the newer stuff, but I think I'm going to pare it way back down to keep it maintainable .. [01:02:35] yes, you might not need it, and it's possible to add it on later [01:02:57] just separate config from modules by using role classes [01:04:28] (03Abandoned) 10Dzahn: add wikimania2017 wiki to InitSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260514 (https://phabricator.wikimedia.org/T122062) (owner: 10Dzahn) [01:04:48] (03Abandoned) 10Dzahn: add wikimania2017 wiki to db lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260515 (https://phabricator.wikimedia.org/T122062) (owner: 10Dzahn) [01:05:40] (03PS6) 10Dzahn: add wikimania2017 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260521 (https://phabricator.wikimedia.org/T122062) [01:07:24] actually, right now i'd have to debug something that attemps a hiera lookup [01:07:41] for the icinga notification groups [01:13:26] (03PS2) 10Dzahn: beta: update submodules for mediawiki-config [puppet] - 10https://gerrit.wikimedia.org/r/260338 (https://phabricator.wikimedia.org/T122018) (owner: 10Hashar) [01:14:36] (03CR) 10Dzahn: [C: 032] "beta-only and already cherry-picked on beta master by hashar" [puppet] - 10https://gerrit.wikimedia.org/r/260338 (https://phabricator.wikimedia.org/T122018) (owner: 10Hashar) [01:17:16] (03PS2) 10Dzahn: Add a .bash_profile for myself [puppet] - 10https://gerrit.wikimedia.org/r/260206 (owner: 10Hoo man) [01:18:07] (03CR) 10Dzahn: [C: 032] Add a .bash_profile for myself [puppet] - 10https://gerrit.wikimedia.org/r/260206 (owner: 10Hoo man) [01:24:52] (03PS1) 10Dzahn: add piwik.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/260525 (https://phabricator.wikimedia.org/T103577) [01:25:59] (03CR) 10Dzahn: "here's a DNS change to add the name this depends on: https://gerrit.wikimedia.org/r/#/c/260525/" [puppet] - 10https://gerrit.wikimedia.org/r/259601 (https://phabricator.wikimedia.org/T103577) (owner: 10Ori.livneh) [01:30:08] (03CR) 10Dzahn: conftool: add support for ACLs, helper scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/258975 (owner: 10Giuseppe Lavagetto) [01:37:41] (03CR) 10Dzahn: [C: 04-1] "" Syntax error at '<<'; expected '}'" on line 67 ... in compiler.. but why ?" [puppet] - 10https://gerrit.wikimedia.org/r/220085 (owner: 10Alexandros Kosiaris) [01:39:48] (03CR) 10Dzahn: [C: 031] keep fewer dataset web server logs, add date to filename [puppet] - 10https://gerrit.wikimedia.org/r/253594 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [01:44:45] (03PS2) 10Dzahn: Disable AQS cassandra CQL interface check until AQS is production ready [puppet] - 10https://gerrit.wikimedia.org/r/247910 (https://phabricator.wikimedia.org/T78514) (owner: 10Ottomata) [01:45:27] (03CR) 10Dzahn: "maybe reopen T78514 to discuss why it notifies to often?" [puppet] - 10https://gerrit.wikimedia.org/r/247910 (https://phabricator.wikimedia.org/T78514) (owner: 10Ottomata) [01:46:29] (03CR) 10jenkins-bot: [V: 04-1] Disable AQS cassandra CQL interface check until AQS is production ready [puppet] - 10https://gerrit.wikimedia.org/r/247910 (https://phabricator.wikimedia.org/T78514) (owner: 10Ottomata) [01:47:01] (03CR) 10Dzahn: "imho don't abandon. what kept us from merging it? looks like just did not get reviews but is still valid" [puppet] - 10https://gerrit.wikimedia.org/r/130296 (owner: 10ArielGlenn) [01:48:41] (03CR) 10Dzahn: [C: 031] "can't hurt to make it configurable and since the hiera lookup default is 'true' if it fails" [puppet] - 10https://gerrit.wikimedia.org/r/249222 (https://phabricator.wikimedia.org/T109173) (owner: 10Gergő Tisza) [01:51:27] 6operations: "Opsonly" bastion? - https://phabricator.wikimedia.org/T114992#1896855 (10Dzahn) @Robh we should also copy the /home directories over and merge files into the home dirs on bast1001 and/or give people a heads-up so they can copy it themselves. [01:52:20] 6operations: "Opsonly" bastion? - https://phabricator.wikimedia.org/T114992#1896862 (10RobH) I think we can just tell folks to make their own copies, as its opsen/roots only on iron in the first place. [02:00:29] 6operations, 6Performance-Team: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1896872 (10ori) I disabled puppet on mw1014 and mw1015 and excluded specific job types on each of them to see if it helps isolate the cause. * on mw1015: excluded cirrusSearchLinksUpdate and cirrusSearchLinksUpd... [02:13:12] (03CR) 10BryanDavis: "The NFS problem I saw when testing this last may have been due to /etc/exports being empty before Vagrant runs. I found while trying to se" [puppet] - 10https://gerrit.wikimedia.org/r/245920 (owner: 10BryanDavis) [02:13:47] (03PS1) 10Dzahn: disable paging for labtestcontrol2001 in hiera [puppet] - 10https://gerrit.wikimedia.org/r/260527 (https://phabricator.wikimedia.org/T120047) [02:15:11] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/260527/1" [puppet] - 10https://gerrit.wikimedia.org/r/259319 (owner: 10Dzahn) [02:15:46] (03CR) 10Dzahn: [C: 032] disable paging for labtestcontrol2001 in hiera [puppet] - 10https://gerrit.wikimedia.org/r/260527 (https://phabricator.wikimedia.org/T120047) (owner: 10Dzahn) [02:21:45] (03PS1) 10Dzahn: disable paging for labtest[neutron|services]2001 [puppet] - 10https://gerrit.wikimedia.org/r/260529 (https://phabricator.wikimedia.org/T120047) [02:22:36] (03CR) 10Dzahn: [C: 032] disable paging for labtest[neutron|services]2001 [puppet] - 10https://gerrit.wikimedia.org/r/260529 (https://phabricator.wikimedia.org/T120047) (owner: 10Dzahn) [02:23:34] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 09m 47s) [02:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:18] !hiera [02:28:18] ./util/hiera_lookup --fqdn=tin.eqiad.wmnet --roles=role::deployment::server admin::groups -v [02:28:36] !hiera del [02:28:36] Successfully removed hiera [02:30:15] don't get my hopes up [02:30:20] !hiera example: ./utils/hiera_lookup --site=codfw --fqdn=labtestcontrol2001.wikimedia.org do_paging -v (--site is needed when fqdn doesn't reveal it. you can run this f.e. in your home on palladium after cloning the puppet repo) [02:30:27] !hiera is example: ./utils/hiera_lookup --site=codfw --fqdn=labtestcontrol2001.wikimedia.org do_paging -v (--site is needed when fqdn doesn't reveal it. you can run this f.e. in your home on palladium after cloning the puppet repo) [02:30:27] Key was added [02:30:29] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Dec 22 02:30:28 UTC 2015 (duration 6m 54s) [02:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:48] ori: :p i needed the --site= thing, usually it gets it from the FQDN but it cant tell where we use wikimedia.org [02:33:42] and i want to disable paging for labtestcontrol2001.wm.org [02:34:02] i assume you would have also told cajoel hiera is maybe not needed for OIT setup ?:p [02:42:12] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [02:44:11] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [02:44:28] that was me.. bbl now [02:46:49] https://twitter.com/SpaceX/status/679113860960288769 [02:47:28] robh: ^ congrats to the spacex guy.. laters [02:56:29] yea thats pretty slick. im supposed to do a trip with that dude and some other friends this spring hopefully, heh [02:57:17] 6operations, 6Labs, 10Labs-Infrastructure, 7Icinga, 5Patch-For-Review: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#1896920 (10Dzahn) merged the hiera change above. ran puppet on neon, it would still add it, used puppetstoredconfigclean.rb to remove store... [03:01:20] 6operations, 6Labs, 10Labs-Infrastructure, 7Icinga, 5Patch-For-Review: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#1896923 (10Dzahn) @chasemp @andrew can i enable puppet on labtestcontrol2001 (even if just for a while) or does that mess with testing? [03:14:13] RECOVERY - Disk space on restbase1003 is OK: DISK OK [03:40:52] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 6Zero, and 3 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1896972 (10Nuria) @Dzahn: the machine we are requesting will receive traffic from annual report website via js beacon. Can w... [04:04:03] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 14.81% of data above the critical threshold [100000000.0] [04:10:25] PROBLEM - puppet last run on mw1113 is CRITICAL: CRITICAL: Puppet has 1 failures [04:36:39] (03PS2) 10Aaron Schulz: [WIP] Set initial $wgMaxUserDBWriteDuration value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260507 (https://phabricator.wikimedia.org/T95501) [04:37:34] RECOVERY - puppet last run on mw1113 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [05:02:14] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [05:27:06] 6operations, 10RESTBase-Cassandra: Update to Cassandra 2.1.12 - https://phabricator.wikimedia.org/T120803#1897054 (10GWicke) This ticket is resolved for RESTBase. This leaves AQS (@jallemandou) and the maps service (@yurik or @maxsem?). Upgrade instructions: ``` sudo apt-get install cassandra cassandra-tool... [05:31:46] 6operations, 10RESTBase-Cassandra: Update to Cassandra 2.1.12 - https://phabricator.wikimedia.org/T120803#1897057 (10Yurik) Neither of us have access, so leaving it to @akosiaris. From IRC: Upgrade one server at a time: ``` yes, just make sure to wait long enough after each upgrade for the server to... [05:35:27] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 314 [05:36:06] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 303 [05:36:32] I thought the codfw db slave paging was disabled earlier today.... [05:36:36] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 307 [05:36:42] as its not in production use, simply replication.... [05:37:27] PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 359 [05:38:25] PROBLEM - puppet last run on mw2013 is CRITICAL: CRITICAL: puppet fail [05:40:45] well, jynus is set to get paged now [05:41:52] except its too early there [05:41:58] i jut logged into aql to check [05:42:07] but im going to text him cuz its 8:42 his time [05:42:08] am [05:42:14] yeah sounds good [05:42:24] (he is paged on the base EU timezone which is after his local time) [05:42:31] text sent =] [05:42:46] I dont think we need to worry, but I rather over-react to a page than under. [05:43:15] I thought the paging was getting disabled for these while he was actively troubleshooting the issue (and perhaps still plan to, jaime has a lot going on!) [05:43:55] we need to split up the EU paging timezones a bit more. most of them are lumped together iirc and it results in most of them rotating on and off at the exact same time [05:44:15] (when they are really spread across 3+ timezones) [05:45:53] meh, no answer from jaime but blehhhhhh [05:46:07] I dont think its worth calling him and possibly waking when they are replication slaves.... [05:46:17] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Seconds_Behind_Master: 9 [05:46:29] and there we go =P [05:47:17] RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Seconds_Behind_Master: 52 [05:48:19] bleh, and my dash clock was wrong, its far earlier for jaime than i thought [05:48:25] * robh sends apology email rather than text [05:48:31] seems rude to say sorry for texting via text... [05:48:48] (clock was 2 hours off for him) [05:53:18] well, it seems to just be a small section of the s3 replication. i dropped a note to jaime with the details (and an apogy for the early sms) [05:57:11] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 301 [05:57:46] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [05:59:07] RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [06:04:24] PROBLEM - Disk space on restbase1008 is CRITICAL: DISK CRITICAL - free space: /srv 74092 MB (3% inode=99%) [06:06:04] RECOVERY - puppet last run on mw2013 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:08:14] PROBLEM - puppet last run on mw2063 is CRITICAL: CRITICAL: puppet fail [06:21:03] robh: the change to disable those pages should be [06:21:05] https://gerrit.wikimedia.org/r/#/c/260377/4 [06:21:21] he said earlier though he'd rather fix the actual cause [06:21:52] if it keeps going we could merge though [06:22:29] out again [06:27:52] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 6Zero, and 3 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1897101 (10Dzahn) @Nuria thanks, yes, i added a DNS change to add piwik.wm.org https://gerrit.wikimedia.org/r/#/c/260525/ [06:30:53] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:54] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:04] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:24] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:53] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:54] PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:54] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:14] PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:15] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:26] (03CR) 10Dzahn: [C: 031] "tested in compiler, works, no change on eqiad db, and becomes critical => false in codfw -> http://puppet-compiler.wmflabs.org/1536/" [puppet] - 10https://gerrit.wikimedia.org/r/260377 (owner: 10Jcrespo) [06:33:14] PROBLEM - puppet last run on mw2088 is CRITICAL: CRITICAL: puppet fail [06:33:14] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:43] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:44] RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:37:24] RECOVERY - puppet last run on mw2063 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:54] RECOVERY - Disk space on restbase1008 is OK: DISK OK [06:55:23] RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:55:55] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:56:34] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:56:53] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:57:03] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:04] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:24] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:04] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:05] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:24] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:24] RECOVERY - puppet last run on mw2088 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [08:12:14] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [100000000.0] [08:35:44] RECOVERY - Outgoing network saturation on labstore1001 is OK: OK: Less than 10.00% above the threshold [75000000.0] [08:42:14] !log restarting and reconfiguring mysql at db2036 [08:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:00:20] (03PS1) 10TTO: Set $wgPageLanguageUseDB = true for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260541 (https://phabricator.wikimedia.org/T69223) [09:01:46] (03PS2) 10TTO: Set $wgPageLanguageUseDB = true for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260541 (https://phabricator.wikimedia.org/T69223) [09:06:16] good morning [09:06:45] morning [09:10:18] huomenta [09:14:13] בוקר טוב [09:15:05] (03PS5) 10Jcrespo: Disable critical errors for codfw slaves due to lag [puppet] - 10https://gerrit.wikimedia.org/r/260377 [09:16:35] PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: Puppet has 1 failures [09:18:17] (03CR) 10Ori.livneh: [C: 031] "Thanks!" [dns] - 10https://gerrit.wikimedia.org/r/260525 (https://phabricator.wikimedia.org/T103577) (owner: 10Dzahn) [09:21:32] (03PS1) 10Muehlenhoff: Remove now obsolete OpenDJ server module [puppet] - 10https://gerrit.wikimedia.org/r/260542 [09:26:48] (03PS2) 10Muehlenhoff: Remove now obsolete OpenDJ server module and related templates/files [puppet] - 10https://gerrit.wikimedia.org/r/260542 [09:33:39] 6operations, 10DBA: db1024 (s2 master) will run out of disk space in ~4 months - https://phabricator.wikimedia.org/T122048#1897213 (10jcrespo) Adding mark and robh as requested. [09:36:44] (03CR) 10Jcrespo: [C: 032] Disable critical errors for codfw slaves due to lag [puppet] - 10https://gerrit.wikimedia.org/r/260377 (owner: 10Jcrespo) [09:42:14] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:54:22] !log precise and trusty salt packages with wmf patches deployed manually on dataset1001 and analytics1001, seem to work fine [09:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:55:24] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [09:56:04] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [09:58:42] 6operations, 10Salt: salt minions need 'wake up' test.ping after idle period before they respond properly to commands - https://phabricator.wikimedia.org/T120831#1897272 (10ArielGlenn) Trusty and precise packages now deployed manually on analytics1001 and dataset1001 after labs testing. Look good. Next move... [10:01:53] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: puppet fail [10:06:23] !log Restarting Jenkins [10:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:09:21] !log online resizing /srv/postgres on labsdb1006 +100GB [10:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:27:14] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:29:04] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [10:29:43] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [10:30:13] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 650 [10:32:00] 6operations, 7Tracking: Meta task "Revamp user authentication" - https://phabricator.wikimedia.org/T116747#1897296 (10MoritzMuehlenhoff) [10:32:47] 6operations: suspected opendj file descriptor leak on neptunium - https://phabricator.wikimedia.org/T84082#1897301 (10MoritzMuehlenhoff) 5Open>3Resolved a:3MoritzMuehlenhoff opendj neptunium has been replaced by a different server based on openldap. [10:35:13] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 946 [10:38:59] 6operations, 5Patch-For-Review: Assign salt grains to server groups for debdeploy - https://phabricator.wikimedia.org/T111006#1897318 (10MoritzMuehlenhoff) 5Open>3Resolved All server groups are now complete (along with DC-based assignment) [10:40:13] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 999 [10:40:16] 6operations, 10Traffic, 5Patch-For-Review, 7discovery-system, 5services-tooling: Create an etcd puppet module + find suitable servers for deployment - https://phabricator.wikimedia.org/T97973#1897325 (10MoritzMuehlenhoff) [10:40:19] 6operations, 6Labs, 10Labs-Infrastructure, 7LDAP, 7discovery-system: Allow creation of SRV records in labs. - https://phabricator.wikimedia.org/T98009#1897323 (10MoritzMuehlenhoff) 5Open>3Resolved This is available in the new servers based on OpenLDAP. [10:45:13] RECOVERY - check_mysql on db1008 is OK: Uptime: 65302 Threads: 155 Questions: 4333917 Slow queries: 939 Opens: 2700 Flush tables: 2 Open tables: 346 Queries per second avg: 66.367 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:45:24] PROBLEM - puppet last run on mw2140 is CRITICAL: CRITICAL: puppet fail [10:47:39] 6operations, 10DBA: db1024 (s2 master) will run out of disk space in ~4 months - https://phabricator.wikimedia.org/T122048#1897327 (10mark) I'm... confused. So is this just a problem with XFS not growing to the new LV size, or do we actually need larger hardware? Or both? [10:55:23] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1897328 (10Jdforrester-PERSONAL) Announcement now posted: https://lists.wikimedia.org/pipermail/wikimania-l/2015-December/007120.html [10:57:15] PROBLEM - puppet last run on lvs4002 is CRITICAL: CRITICAL: puppet fail [10:58:36] 6operations, 10hardware-requests: One YubiHSM for the SF office - https://phabricator.wikimedia.org/T122120#1897329 (10MoritzMuehlenhoff) 3NEW [11:00:57] 6operations, 10DBA: db1024 (s2 master) will run out of disk space in ~4 months - https://phabricator.wikimedia.org/T122048#1897338 (10jcrespo) Both. [11:01:12] 6operations: setup/deploy auth1001(WMF4576) as eqiad auth system - https://phabricator.wikimedia.org/T121655#1897339 (10MoritzMuehlenhoff) I'll bring the YubiHSM to the allhands and give it to Chris, then. [11:01:25] 6operations, 10ops-codfw: rack/setup/deploy auth2001 as codfw auth system - https://phabricator.wikimedia.org/T120263#1897340 (10MoritzMuehlenhoff) I'll bring the YubiHSM to the allhands and give it to Papaul, then. [11:01:38] 6operations, 10hardware-requests: One YubiHSM for the SF office - https://phabricator.wikimedia.org/T122120#1897341 (10mark) a:3RobH Approved. [11:13:34] RECOVERY - puppet last run on mw2140 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:20:01] robh: u still on duty? [11:20:12] he is but he's probably sleeping atm [11:23:45] RECOVERY - puppet last run on lvs4002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:32:33] paravoid: waithing sice one day. needing somone to watch job queue when doing a user rename with +50000 edits. [11:33:16] you mean you already did this rename, or that you want to do this and asking for permission? [11:33:47] asking for permission [11:34:14] ah, good [11:34:30] yeah, please refrain from doing so right now, we're experiencing some job queue issues that we're investigating [11:34:33] https://phabricator.wikimedia.org/T122069 [11:34:42] ok :) [11:41:18] 6operations, 6Performance-Team: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1897407 (10jcrespo) Just in case it may be helpful: P2446 [11:41:25] 6operations, 10hardware-requests: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#1897408 (10fgiunchedi) looking again `/dev/sdc` is marked as "spare, rebuilding" and I've convinced myself that data is readable by running ``` root@restbase1007:/srv/cassa... [11:42:05] RECOVERY - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is OK: TCP OK - 0.000 second response time on port 9042 [11:42:25] RECOVERY - cassandra-a service on restbase1007 is OK: OK - cassandra-a is active [11:43:47] !log restart cassandra bootstrap on restbase1004 [11:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:44:06] RECOVERY - cassandra-a service on restbase1004 is OK: OK - cassandra-a is active [11:46:45] PROBLEM - DPKG on restbase2006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:47:34] PROBLEM - DPKG on restbase2005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:47:45] PROBLEM - DPKG on restbase2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:47:45] PROBLEM - DPKG on restbase2003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:47:55] PROBLEM - DPKG on restbase2004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:48:24] PROBLEM - DPKG on restbase2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:48:53] that you godog? [11:49:35] PROBLEM - puppet last run on restbase2002 is CRITICAL: CRITICAL: Puppet has 1 failures [11:51:44] RECOVERY - DPKG on restbase2003 is OK: All packages OK [11:51:54] RECOVERY - DPKG on restbase2004 is OK: All packages OK [11:52:15] RECOVERY - DPKG on restbase2001 is OK: All packages OK [11:53:25] RECOVERY - DPKG on restbase2005 is OK: All packages OK [11:53:35] RECOVERY - DPKG on restbase2002 is OK: All packages OK [11:54:07] 6operations, 10Salt: salt minions need 'wake up' test.ping after idle period before they respond properly to commands - https://phabricator.wikimedia.org/T120831#1897425 (10ArielGlenn) precise running on ms-{bf}e* in esams; trusty running on analytics103* in eqiad; jessie running on restbase2* in codfw. jessie... [11:54:30] 6operations, 10Deployment-Systems, 5Patch-For-Review: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585#1897428 (10faidon) >>! In T120585#1885495, @bd808 wrote: > Choosing a low number in the system reserved range (100-999) does seem likely to cause a conflict with another packag... [11:54:32] !log salt packages with wmf packages precise running on ms-{bf}e* in esams; trusty running on analytics103* in eqiad; jessie running on restbase2* in codfw [11:54:35] RECOVERY - DPKG on restbase2006 is OK: All packages OK [11:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:54:48] sorry about the package whines, that was jessie + salt updates being ill behaved [11:56:26] paravoid: no [11:56:56] yeah it was apergos :) [11:57:53] hah! yeah just noticed the backlog after replying :) I've seen the same races with puppet when doing fleetwide package upgrades [11:58:08] I wonder if it is worth "fixing" so the two don't race [11:59:35] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: puppet fail [12:11:41] 6operations: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#1897515 (10faidon) 3NEW [12:12:17] 6operations: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#1897523 (10mark) ...one in the other data center. [12:13:52] 6operations, 10Salt: salt minions need 'wake up' test.ping after idle period before they respond properly to commands - https://phabricator.wikimedia.org/T120831#1897539 (10MoritzMuehlenhoff) >>! In T120831#1897425, @ArielGlenn wrote: > precise running on ms-{bf}e* in esams; trusty running on analytics103* in... [12:15:24] 6operations: Test and fix db1047 BBU - https://phabricator.wikimedia.org/T103345#1897560 (10jcrespo) 5Open>3Resolved Lag problems have been solved, although hardware renewal is still needed. [12:17:05] RECOVERY - puppet last run on restbase2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:18:24] PROBLEM - DPKG on snapshot1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:18:39] ACKNOWLEDGEMENT - Restbase root url on restbase1004 is CRITICAL: Connection refused Filippo Giunchedi cassandra bootstrapping [12:18:39] ACKNOWLEDGEMENT - restbase endpoints health on restbase1004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.160, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Filippo Giunchedi cassandra bootstrapping [12:21:14] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [12:21:26] PROBLEM - Disk space on restbase1003 is CRITICAL: DISK CRITICAL - free space: /var 110950 MB (3% inode=99%) [12:21:36] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#1897598 (10faidon) 3NEW [12:21:54] RECOVERY - Restbase root url on restbase1007 is OK: HTTP OK: HTTP/1.1 200 - 15184 bytes in 0.006 second response time [12:22:25] PROBLEM - DPKG on snapshot1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:24:34] RECOVERY - DPKG on snapshot1002 is OK: All packages OK [12:24:54] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CRITICAL: Puppet has 1 failures [12:25:34] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:28:25] RECOVERY - DPKG on snapshot1001 is OK: All packages OK [12:29:49] 6operations: 2FA for SSH access to the production cluster - https://phabricator.wikimedia.org/T116750#1897693 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [12:32:29] those snapshot whines were me too [12:37:17] 6operations, 10Salt: salt minions need 'wake up' test.ping after idle period before they respond properly to commands - https://phabricator.wikimedia.org/T120831#1897761 (10ArielGlenn) >>! In T120831#1897539, @MoritzMuehlenhoff wrote: ... > I have seen the same problem with debdeploy: The debdeploy-minion pack... [12:42:35] 6operations, 10Salt: salt minions need 'wake up' test.ping after idle period before they respond properly to commands - https://phabricator.wikimedia.org/T120831#1897803 (10ArielGlenn) for the record, updated also on snapshot1001,2 and on restbase1003 while testing the installation process. [12:51:05] RECOVERY - puppet last run on snapshot1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:55:35] PROBLEM - salt-minion processes on cygnus is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:03:44] (03PS1) 10Jcrespo: Setting default character set as utf8mb4 for toolsdb [puppet] - 10https://gerrit.wikimedia.org/r/260558 (https://phabricator.wikimedia.org/T122148) [13:04:15] (03CR) 10Jcrespo: [C: 032] Setting default character set as utf8mb4 for toolsdb [puppet] - 10https://gerrit.wikimedia.org/r/260558 (https://phabricator.wikimedia.org/T122148) (owner: 10Jcrespo) [13:05:35] RECOVERY - salt-minion processes on cygnus is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:11:23] (03PS1) 10Jcrespo: Correcting typo on tooldb client configuration [puppet] - 10https://gerrit.wikimedia.org/r/260559 (https://phabricator.wikimedia.org/T122148) [13:12:27] (03CR) 10Jcrespo: [C: 032] Correcting typo on tooldb client configuration [puppet] - 10https://gerrit.wikimedia.org/r/260559 (https://phabricator.wikimedia.org/T122148) (owner: 10Jcrespo) [13:14:17] (03PS1) 10ArielGlenn: trusty 2014.7.5 patch for batch cli returns with broken dict [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/260560 [13:14:19] (03PS1) 10ArielGlenn: trusty 2014.7.5 continue reading events even after getting one with wrong tag [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/260561 [13:14:21] (03PS1) 10ArielGlenn: make ping_on_rotate work without minion data cache [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/260562 [13:14:23] (03PS1) 10ArielGlenn: 2014.7.5 trusty, backport patches for singleton SAuth class [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/260563 [13:14:25] (03PS1) 10ArielGlenn: bump version number for wmf build, 2014.7.5+ds-1ubuntu1+wm1 [debs/salt] (trusty) - 10https://gerrit.wikimedia.org/r/260564 [13:19:26] akosiaris: so, the citoid/zotero issue hasn't happened again [13:19:48] might have been a temp failure on the NIH side [13:20:45] mobrovac: ccccccevibtjhgvnejibuetdkvhcrufbnjdgidhjjfrf [13:20:54] ah... that yubikey thing is killing me [13:21:03] disabling [13:21:09] ccccccevibtjudcklulnhhihrtevjujvjkekigkjiifk [13:21:22] damned [13:21:27] ok done on that side [13:21:34] thank god I have not yet configured it [13:21:36] wth? [13:21:40] yubikey [13:21:44] it's the nano [13:21:54] and it's surprisingly easy to activate it [13:21:59] you just touch it [13:22:07] it comes your way btw in the next quarter [13:22:12] hehe [13:22:13] ah ok [13:22:35] so yeah I noticed too citoid has not complained yet [13:22:59] not sure what it was after all [13:24:01] i'm inclined to think it was a genuine server time-out after all [13:29:52] (03PS1) 10Merlijn van Deen: toollabs: install goaccess on webproxy hosts [puppet] - 10https://gerrit.wikimedia.org/r/260566 (https://phabricator.wikimedia.org/T121233) [13:31:44] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [13:31:45] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [13:32:25] jynus: ^^^ is you [13:32:30] (03CR) 10coren: "Note inline, there seems to be a stray change in the patch?" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260521 (https://phabricator.wikimedia.org/T122062) (owner: 10Dzahn) [13:32:45] done [13:33:36] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [13:33:44] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [13:35:48] (03PS1) 10ArielGlenn: precise 2014.7.5 patch for batch cli returns with broken dict [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/260568 [13:35:50] (03PS1) 10ArielGlenn: precise 2014.7.5 continue reading events even after getting one with wrong tag [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/260569 [13:35:52] (03PS1) 10ArielGlenn: make ping_on_rotate work without minion data cache [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/260570 [13:35:54] (03PS1) 10ArielGlenn: 2014.7.5 precise, backport patches for singleton SAuth class [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/260571 [13:35:56] (03PS1) 10ArielGlenn: bump version number for wmf build, 2014.7.5+ds-1precise1+wm1 [debs/salt] (precise) - 10https://gerrit.wikimedia.org/r/260572 [13:43:23] 6operations: 2FA for SSH access to the production cluster - https://phabricator.wikimedia.org/T116750#1898246 (10MoritzMuehlenhoff) Updated implementation details: **Objective** The important security property to gain is protection against compromised notebooks; endpoint security is one of the biggest risks fo... [13:45:30] akosiaris: when you enable the key by error, use immediately another pass to invalidate those disclosed [13:46:11] akosiaris: there are in sequence, so you can't use an old OTP as soon as you submit a new one to the validation server [13:47:07] akosiaris: but they are not time constrained, so if you let in clear an OTP, you can reuse it as long as no other afterwards has been used [13:50:54] PROBLEM - puppet last run on mw1015 is CRITICAL: CRITICAL: Puppet last ran 12 hours ago [13:53:06] PROBLEM - puppet last run on mw1014 is CRITICAL: CRITICAL: Puppet last ran 12 hours ago [13:53:46] (03CR) 10Florianschmidtwelzow: "I'm not sure, what should be tested, but is T121834 & T121666 blocking it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260541 (https://phabricator.wikimedia.org/T69223) (owner: 10TTO) [13:56:35] Dereckson: it's not enabled anywhere, and fully uninitialized. And most importantly, not gonna be used in that configuration (OTP) anyway [13:57:24] godog: did you set up the Jessie install images on Carbon? Or was it moritzm? [13:58:29] andrewbogott: ? [13:58:34] what do you mean ? [13:58:36] PROBLEM - puppet last run on mw1013 is CRITICAL: CRITICAL: Puppet last ran 12 hours ago [13:58:49] akosiaris: I’m just trying to make sense out of the weird dir structure there. [13:58:57] lol [13:59:04] And also I want to add a ‘default’ cfg file and looking for confirmation that that won’t break anything [13:59:09] andrewbogott: no, wasn't me [13:59:34] I’ll send an email if no one will admit to being an authority :) [13:59:44] default ? ah, as in we won't have the mac address ? [13:59:57] I assume you refer to tftp [14:00:36] akosiaris: right, installing from the labs subnet which uses a different dhcp setup [14:00:44] RECOVERY - puppet last run on mw1013 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [14:01:05] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [14:01:12] hmmm [14:01:19] akosiaris: jessie-installer/pxelinux.cfg/default fixes my issue [14:01:26] but that might make other installs succeed that we would prefer to have fail [14:01:38] yeah, faidon did those for sure [14:02:02] ok — paravoid, any objection to my creating a fallback default there? [14:03:35] (03CR) 10Alexandros Kosiaris: "puppet compiler cherry-picks (no merge/rebase) and this specific patchset does not cleanly merge. This is an issue indeed reported by a co" [puppet] - 10https://gerrit.wikimedia.org/r/220085 (owner: 10Alexandros Kosiaris) [14:04:42] a fallback what? [14:05:15] (03PS1) 10KartikMistry: CX: Use config.yaml to read registry [puppet] - 10https://gerrit.wikimedia.org/r/260575 [14:05:40] paravoid: I want to add /srv/tftpboot/jessie-installer/debian-installer/amd64/pxelinux.cfg/default [14:05:50] on carbon [14:06:19] why? [14:06:54] ‘Because otherwise I will have to bug Faidon about how to make this work properly’ [14:07:08] haha [14:07:14] paravoid: that’s the last thing that it took to get my metal-in-labs-vm to boot properly [14:07:33] It was otherwise blocked failing to find an acceptable config [14:07:44] sec [14:07:45] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1898283 (10akosiaris) 5Open>3Resolved Great. Thanks @smalyshev. Resolving this [14:07:49] so /srv/tftpboot/jessie-installer/debian-installer/amd64/pxelinux.cfg/default exists already, right? [14:08:04] that's the stock one from debian [14:08:35] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [14:08:48] paravoid: when we pointed the install at that subdir it failed for lack of ldlinux.c32 [14:09:11] which is present in the jessie-installer/ tree [14:09:13] yeah you're not supposed to use the subdir [14:09:23] but you probably don't want that "default" file anyway [14:09:34] because that one is the stock debian one, not an autoinstall one [14:09:39] it has menus and prompts and stuff like that [14:09:50] while I'm guessing you want to install automatically, right? [14:09:50] akosiaris: https://gerrit.wikimedia.org/r/260575 - I'll apply on beta. but you can review it :) [14:09:58] would be nice :) [14:10:02] heh [14:10:10] The ‘default’ that I created was a copy of ttyS1-115200 [14:10:16] it prompted me for a few things but mostly worked [14:10:33] where did you create it? [14:10:48] /srv/tftpboot/jessie-installer/debian-installer/amd64/pxelinux.cfg/default [14:10:49] /srv/tftpboot/jessie-installer/pxelinux.cfg/default I guess? [14:10:58] wait, sorry, bad paste... [14:10:59] yes, what yous aid [14:11:01] *said [14:11:04] right [14:11:25] well a "default" one is wrong here, we have it split per serial port/baud speed due to different hardware [14:11:48] yeah, it’s definitely a hack based on knowing in advance which one would work for my particular server [14:12:01] when it’s working correctly how does that matching happen [14:12:01] ? [14:12:17] we have lists of servers in puppet [14:12:25] for each of these, we define a different config file option in DHCP [14:12:41] # Dell PowerEdge RXXX Line & C2100s [14:12:41] group { [14:12:41] option pxelinux.configfile "pxelinux.cfg/ttyS1-115200"; [14:12:44] include "/etc/dhcp/linux-host-entries.ttyS1-115200"; [14:12:47] } [14:12:59] yep, I’m familiar with how it’s configured but not with how the proper file is passed in... [14:13:09] that "option pxelinux.configfile" is a DHCP option [14:13:15] oh [14:13:24] hm… ok, maybe I can get dnsmasq to pass that [14:13:24] option 209 specifically [14:13:32] that would at least confine the stupid hack to labs [14:14:44] where do you define the pxe options right now? [14:14:50] dnsmasq pxe options I mean [14:15:00] https://gerrit.wikimedia.org/r/#/c/259788/ [14:17:11] 6operations, 6Labs, 10Labs-Infrastructure, 7Icinga, 5Patch-For-Review: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#1898297 (10chasemp) @dzahn sure man go ahead thanks [14:17:14] ok, I think I see how to do it (at least as an unconfigurable one-off) [14:18:40] dhcp-option-force=209,pxelinux.cfg/ttyS1-115200 [14:18:47] yep :) [14:18:52] cool :) [14:18:57] * andrewbogott tries [14:18:58] thanks [14:20:20] btw, hitting carbon for tftp turned out to just work. We’d discussed issues with forwarding and such but that turns out to not be needed. I guess that means that the carbon tftp server is wide open, or at least open to all internal networks. [14:20:49] yeah [14:20:55] a bit worrying really [14:21:04] yeah — good news/bad news [14:21:28] want me to open a ticket for that? [14:21:47] I guess [14:23:28] (03PS2) 10Rush: Collect suggest stats and specific search groups [puppet] - 10https://gerrit.wikimedia.org/r/259767 (owner: 10DCausse) [14:23:45] paravoid: :) [14:25:27] 6operations: Security audit for tftp on Carbon - https://phabricator.wikimedia.org/T122210#1898315 (10Andrew) 3NEW [14:26:27] passing in that extra arg and removing my hacked default file works fine [14:29:07] (03CR) 10Rush: [C: 032] "collection times seems very fine" [puppet] - 10https://gerrit.wikimedia.org/r/259767 (owner: 10DCausse) [14:29:46] great [14:32:54] (03PS14) 10Andrew Bogott: nova-network: have dnsmasq advertise pxe-boot options [puppet] - 10https://gerrit.wikimedia.org/r/259788 [14:34:42] (03Abandoned) 10Rush: diamond: fix up dependencies [puppet] - 10https://gerrit.wikimedia.org/r/254287 (owner: 10Merlijn van Deen) [14:35:44] (03PS15) 10Andrew Bogott: nova-network: have dnsmasq advertise pxe-boot options [puppet] - 10https://gerrit.wikimedia.org/r/259788 [14:37:37] (03CR) 10Andrew Bogott: [C: 032] "It goes!" [puppet] - 10https://gerrit.wikimedia.org/r/259788 (owner: 10Andrew Bogott) [14:37:45] paravoid: carbon is on public network right? [14:37:48] why wouldn't it work? [14:38:34] iptables for starters :) [14:38:44] but apparently we allow ALL_NETWORKS for it [14:39:36] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: puppet fail [14:41:07] (03PS1) 10Andrew Bogott: Hardcode labs tftp server for now. [puppet] - 10https://gerrit.wikimedia.org/r/260577 [14:42:21] (03CR) 10Andrew Bogott: [C: 032] Hardcode labs tftp server for now. [puppet] - 10https://gerrit.wikimedia.org/r/260577 (owner: 10Andrew Bogott) [14:43:36] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [14:50:22] akosiaris, the database re-import finished with the new settings [14:50:46] PROBLEM - Disk space on restbase1008 is CRITICAL: DISK CRITICAL - free space: /srv 74061 MB (3% inode=99%) [14:52:13] (03PS5) 10Xqt: sites/redirects: Redirect *.pywikibot.org to tool labs [puppet] - 10https://gerrit.wikimedia.org/r/243688 (https://phabricator.wikimedia.org/T106311) (owner: 10Merlijn van Deen) [14:52:32] (03CR) 10Xqt: [C: 031] sites/redirects: Redirect *.pywikibot.org to tool labs [puppet] - 10https://gerrit.wikimedia.org/r/243688 (https://phabricator.wikimedia.org/T106311) (owner: 10Merlijn van Deen) [14:52:45] akosiaris: you are ok with me adding the codfw IPs and names for eventbus, right? just not the cluster configs, since it doesn't exist yet? [14:53:20] ottomata: in the DNS repo ? yes, please upload a change for it [14:53:33] ja DNS, and hiera [14:53:37] e.g. https://gerrit.wikimedia.org/r/#/c/260047/1/hieradata/role/codfw/eventbus/eventbus.yaml [14:53:50] and https://gerrit.wikimedia.org/r/#/c/260047/1/hieradata/common/lvs/configuration.yaml [14:54:14] ottomata: no, not the LVS config [14:54:28] ok [14:55:00] yurik: great!!! [14:55:20] yurik: so, which date ? yesterday's ? [14:55:26] need to populate state.txt [14:55:42] akosiaris, i downloaded planet-151214.osm.pbf [14:55:54] so that's over a week ago [14:56:33] akosiaris: hm, ok so. cluster name vs service name. [14:56:40] (03PS1) 10Andrew Bogott: Remove duplicate definition of labs-project for metal instance [puppet] - 10https://gerrit.wikimedia.org/r/260579 [14:56:51] yurik: ok. tell me when we want to enable sync though, ok ? [14:57:03] not sure if there is a real 'cluster' for this now. it is currently colocated on the new kafka boxes [14:57:38] (03CR) 10jenkins-bot: [V: 04-1] Remove duplicate definition of labs-project for metal instance [puppet] - 10https://gerrit.wikimedia.org/r/260579 (owner: 10Andrew Bogott) [14:57:44] ottomata: ok, that's fine. as long as we name the "cluster" uniquely that's fine [14:58:46] ok [14:58:56] also, am looking for proxyfetch docs, and am not finding [14:58:59] akosiaris, i think we can already do it, possibly with an extra day overlap just to be sure? MaxSem ? [14:59:19] (03PS2) 10Andrew Bogott: Remove duplicate definition of labs-project for metal instance [puppet] - 10https://gerrit.wikimedia.org/r/260579 [14:59:32] ottomata: look at the rest of the configuration, it's basically a check that does a URL fetch from pybal to make sure the service is working fine [14:59:36] ok yeah [14:59:47] was just wondering if there was more info, but cool [14:59:57] since from what I gather it's a nodejs app, I expect it will have an endpoint that we can query [15:00:03] oh wait [15:00:06] it's eventlogging ? [15:00:20] so, it's an HTTP app anyway [15:00:34] need an endpoint that can provide proof that the app is working fine [15:00:43] akosiaris, yurik sure - but I've got 2 hours of meetings [15:00:46] ottomata: we can do without it but it's preferably [15:01:20] yurik: yes, that's mostly fine on my side. got various stuff so I expect the most possible time to do it is tomorrow european morning [15:01:26] but if you are ready, I don't have much to do [15:01:38] just set an ensure to present and populate a state.txt file [15:01:45] akosiaris: does the endpoint just need to return 2xx somehting? [15:01:46] s/possible/probably/ [15:01:54] ottomata: yes [15:01:55] 200 [15:02:08] (03PS2) 10Zfilipin: RuboCop: fixed Style/CaseIndentation offense [puppet] - 10https://gerrit.wikimedia.org/r/259699 (https://phabricator.wikimedia.org/T112651) [15:02:16] ottomata: if you don't have any already, well, it will be nice to add [15:02:33] cause anyway we can to be able to have some detailed monitoring of the thing [15:02:38] (03CR) 10Zfilipin: RuboCop: fixed Style/CaseIndentation offense (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/259699 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [15:02:56] services team can help greatly with that... there is even a swagger based spec checker that can do a lot of that [15:02:58] ja can add anything [15:03:00] oo [15:03:11] not sure how that applies to python though [15:03:16] we do have a wip swagger spec for the api [15:03:27] it isn't used to generate anything, but it should map to the api [15:03:41] ottomata: that's great! [15:03:56] yurik: MaxSem: just tell me at some point if you want me to enable the sync process [15:04:01] mobrovac: yt? [15:04:06] !log upgraded cassandra on maps-test2004 [15:04:09] MaxSem, not sure if we need you for this - akosiaris can simply turn the sync back on, and specify the the day before 14th [15:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:04:23] akosiaris, okay. I'd better be around though so not now :P [15:04:38] !log enabling puppet on labtestcontrol2001 [15:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:04:44] ottomata: y [15:04:55] MaxSem: ok [15:06:12] MaxSem, akosiaris - in two hours ok? [15:06:40] yurik: hmm, can't promise I 'll make it for that. I 'll try to [15:06:45] !log labtestcontrol2001 - puppet had not been running for a while, a bunch of changes have been applied incl. keys and passwords [15:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:06:55] !log restarting and reconfiguring mysql at dbstore2001 [15:06:57] (03PS3) 10Andrew Bogott: Remove duplicate definition of labs-project for metal instance [puppet] - 10https://gerrit.wikimedia.org/r/260579 [15:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:07:00] (03PS1) 10Ottomata: Add eventbus.svc IPs [dns] - 10https://gerrit.wikimedia.org/r/260580 [15:07:33] mobrovac: the api has changed a bit since your first swagger patch, we never revisited it. I mainly just simplified some things. [15:07:46] akosiaris: says yall have a good swagger / lvs monitoring status thing I should look into [15:07:59] so that pybal has a good place to check if a node is up [15:08:17] ottomata: it's service_checker, it's in the puppet repo [15:08:43] it will more or less follow the API and check basic stuff [15:08:53] which is useful as you get monitoring for multiple endpoints [15:09:02] akosiaris: test_checker? [15:09:02] ottomata: modules/service/files/checker.py [15:09:17] k [15:09:23] oh thats the test :p [15:09:31] ottomata: for that, we'd need to add x-amples stanzas to the swagger spec [15:09:35] (03CR) 10Andrew Bogott: [C: 032] Remove duplicate definition of labs-project for metal instance [puppet] - 10https://gerrit.wikimedia.org/r/260579 (owner: 10Andrew Bogott) [15:10:12] ottomata: also, the service needs to expose the spec to the monitoring script [15:10:16] aye [15:10:17] k [15:10:39] mobrovac: i'll update your patch with the latest api changes, and also make it exposed at /?spec (right?) [15:10:46] then will ask you more about x-amples, etc. [15:10:58] ottomata: e.g. en.wikipedia.org/api/rest_v1/?spec (look for x-amples stanzas) [15:11:02] ottomata: yup, exactly [15:11:51] hm ok [15:12:18] (03PS1) 10Andrew Bogott: The field I want is 'project' not 'project-name' [puppet] - 10https://gerrit.wikimedia.org/r/260581 [15:12:19] andrewbogott: fyi. success about the paging and labtestcontrol2001 and stuff [15:12:26] great! [15:12:41] andrewbogott: by that i mean: labtestcontrol and labcontrol are both in icinga, but the 'test' one does not page, and the production one does [15:12:47] and the difference is only in hiera [15:13:00] akosiaris, ah, one more thing - you will need to move the /srv/temp/nodes.bin to some magic location and set appropriate rights on it [15:13:06] so you can switch it globally for a host and not touch service definitions in modules [15:13:29] mutante: good morning Icinga guru :-} [15:14:02] (03CR) 10Andrew Bogott: [C: 032] The field I want is 'project' not 'project-name' [puppet] - 10https://gerrit.wikimedia.org/r/260581 (owner: 10Andrew Bogott) [15:14:07] mutante: I could use a monitoring probe got some time? [15:14:15] yurik: yeah I noticed that. thanks for reminding me [15:14:51] andrewbogott: i had to delete labtestcontrol2001 from stored resources on puppetmaster, which makes it disappear from icinga completely, then enable puppet on labtestcontrol2001, it gets re-added, but this time without the "sms" contact group [15:15:05] yeah, that seems fine [15:15:16] it caught up with a bunch of changes now [15:15:20] and puppet is enabled [15:16:02] hashar: good morning, what are you looking for? [15:16:41] mutante: gotta make sure Jenkins on gallium listen on port 8888 (zeromq server). I crafted https://gerrit.wikimedia.org/r/#/c/257568/2/manifests/role/ci.pp,cm [15:17:01] the command works fine on gallium, but I am not sure role:: is the best place to add such probe :D [15:17:02] akosiaris: see above.. "do_paging = false" in hieradata disables paging for a host without having to add special cases to puppet code [15:18:04] mutante: heh ? [15:19:16] akosiaris: grep -r "do_paging" * and you will see in hieradata and modules/monitoring [15:21:28] we can now just set a test host not to page in hiera, without having to change monitor checks set to critical in roles that are used on prod and on test hosts [15:21:32] hashar: looking now [15:21:36] mutante: yes I know, it's the change we were discussing together [15:21:49] I am the one the proposed the "do_paging" name, no ? [15:21:53] akosiaris: right, but now i am sure it works, and before i wasn't [15:21:57] ah [15:22:01] I always was ;-) [15:22:08] i had to kill the host from stored configs [15:22:13] and puppet wasnt running [15:22:16] mutante: you can get to it after you the paging stuff. I am still around for an hour and a half :-} [15:22:17] so it looked on neon like it doesnt work [15:22:19] see the level of trust I got ? ;) [15:22:25] heh :) [15:22:40] if puppet wasn't running that would explain it [15:23:00] well, i just kept grepping the puppet_services.cfg that gets generated and the 'sms' group would not go away [15:23:20] and now it is [15:23:38] !log upgrade cassandra on maps-test2003 [15:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:47] and i set the same for the other "labtest-*' and restbase-test but not pybal-test [15:23:54] great [15:33:19] 6operations, 6Labs, 10Labs-Infrastructure, 7Icinga, 5Patch-For-Review: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#1898391 (10Dzahn) 5Open>3Resolved after re-enabling puppet on labtestcontrol2001 and running it on neon, it now works as intended. the... [15:34:13] 6operations, 6Labs, 10Labs-Infrastructure, 7Icinga, 7Monitoring: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#1898393 (10Dzahn) [15:39:38] akosiaris: actually, do you have an opinion on hashar's question? it's about where to put nrpe::monitor_service checks. when i look what we do elsewhere, unsurprisingly we have it all over the place, both in role classes and inside modules. i guess my opinion is that if it doesn't have parameters that change (usually) it's fine in the module, but if it has config that we might change like port numbers or strings to check for then in the rol [15:40:07] you have cut :D [15:40:36] i guess my opinion is that if it doesn't have parameters that change (usually) it's fine in the module, but if it has config that we might change like port numbers or strings to check for then in the role? [15:41:34] hashar: i guess it's correct in the role because the port 8888 is a config thing [15:41:45] but we certainly do it both ways [15:44:27] if port 8888 was a parameter of a class ci::master though and you'd use that variable in the check here, then it could be in the module [15:46:20] mutante: so the patch looks fine ? :-:} [15:46:31] did gerrit get upgraded? [15:46:58] yea, the command line works too [15:47:24] (03CR) 10Dzahn: [C: 031] "[gallium:~] $ /usr/lib/nagios/plugins/check_tcp -H 127.0.0.1 -p 8888 --timeout=2" [puppet] - 10https://gerrit.wikimedia.org/r/257568 (https://phabricator.wikimedia.org/T120669) (owner: 10Hashar) [15:47:30] (03PS3) 10Dzahn: contint: monitor Jenkins has a ZMQ publisher [puppet] - 10https://gerrit.wikimedia.org/r/257568 (https://phabricator.wikimedia.org/T120669) (owner: 10Hashar) [15:48:12] (03CR) 10Dzahn: [C: 032] contint: monitor Jenkins has a ZMQ publisher [puppet] - 10https://gerrit.wikimedia.org/r/257568 (https://phabricator.wikimedia.org/T120669) (owner: 10Hashar) [15:48:20] \/ [15:51:47] it created the nrpe config on gallium.. now just the run on neon [15:52:01] meanwhile: restbase servers showing issues in icinga [15:52:24] restbase1003 - disk space, restbase1008 - disk space, restbase1004 - cassandra cql refused [15:52:37] !log restbase1003 - disk space, restbase1008 - disk space, restbase1004 - cassandra cql refused [15:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:07] (03PS1) 10DCausse: Really collect groups stats [puppet] - 10https://gerrit.wikimedia.org/r/260586 [15:53:13] !log kafka1001,1002 - crit - eventlogging not running (?) [15:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:59:03] (03PS2) 10Rush: Really collect groups stats [puppet] - 10https://gerrit.wikimedia.org/r/260586 (owner: 10DCausse) [15:59:28] 6operations, 10RESTBase-Cassandra: Update to Cassandra 2.1.12 - https://phabricator.wikimedia.org/T120803#1898413 (10JAllemandou) Same issuer for me: I'm not root on AQS machines :( Either @akosiaris or @ottomata ? [16:00:05] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151222T1600). [16:00:22] ohi [16:00:37] nothing to swat [16:01:38] hashar: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=gallium&service=jenkins_zmq_publisher [16:01:49] (03PS1) 10Andrew Bogott: Restart nslcd after ldap.conf or nsswitch get changed. [puppet] - 10https://gerrit.wikimedia.org/r/260588 [16:02:15] mutante: awesome thank you very much [16:02:46] (03CR) 10jenkins-bot: [V: 04-1] Restart nslcd after ldap.conf or nsswitch get changed. [puppet] - 10https://gerrit.wikimedia.org/r/260588 (owner: 10Andrew Bogott) [16:03:39] hashar: this ticket i clicked resolved but did not move on workboard, yesterday i moved on workboard but left it open for you.. heh. [16:04:08] sometimes feels like doing the same thing twice [16:04:09] (03PS2) 10Andrew Bogott: Restart nslcd after ldap.conf or nsswitch get changed. [puppet] - 10https://gerrit.wikimedia.org/r/260588 [16:04:47] but then it's good if i claim it's done and you confirm .. shrug [16:04:50] (03CR) 10coren: [C: 031] "Afaict, that should suffice." [puppet] - 10https://gerrit.wikimedia.org/r/260588 (owner: 10Andrew Bogott) [16:05:14] mutante: as long as a ticket ends up resolved. We are fine :-} [16:05:39] 'k :) [16:06:12] (03PS3) 10Andrew Bogott: Restart nslcd after ldap.conf or nsswitch get changed. [puppet] - 10https://gerrit.wikimedia.org/r/260588 [16:06:28] (03PS2) 10Dzahn: add piwik.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/260525 (https://phabricator.wikimedia.org/T103577) [16:06:34] (03CR) 10Dzahn: [C: 032] add piwik.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/260525 (https://phabricator.wikimedia.org/T103577) (owner: 10Dzahn) [16:08:44] do you know how hard it is to type a word that ends in "wik" but not type the final "i"... [16:09:05] piwiki a million times [16:09:15] (03CR) 10Andrew Bogott: [C: 032] Restart nslcd after ldap.conf or nsswitch get changed. [puppet] - 10https://gerrit.wikimedia.org/r/260588 (owner: 10Andrew Bogott) [16:09:18] ori: ^ [16:09:38] mutante: or how hard it is to spell "snack" correctly [16:09:42] * aude used to saying snak [16:09:45] :P [16:10:55] hehee [16:13:15] (03PS2) 10Dzahn: [WIP] Add piwik module and role [puppet] - 10https://gerrit.wikimedia.org/r/259601 (https://phabricator.wikimedia.org/T103577) (owner: 10Ori.livneh) [16:13:36] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 6Zero, and 3 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1898439 (10Nuria) @Dzahn: thank you. Let us know when we have a box so we can deploy . [16:13:44] (03CR) 10Dzahn: "PS2: fixed lint errors (tabs)" [puppet] - 10https://gerrit.wikimedia.org/r/259601 (https://phabricator.wikimedia.org/T103577) (owner: 10Ori.livneh) [16:14:10] jynus: hola. still busy or should we proceed with the tokudb changes? [16:14:14] (03PS1) 10Ottomata: Fix icinga proc warning for eventlogging-service on kafka100[12] [puppet] - 10https://gerrit.wikimedia.org/r/260592 [16:15:01] !log upgrade cassandra on maps-test2002 [16:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:15:40] !log upgrade cassandra on maps-test2001 [16:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:02] (03CR) 10Dzahn: [C: 031] Fix icinga proc warning for eventlogging-service on kafka100[12] [puppet] - 10https://gerrit.wikimedia.org/r/260592 (owner: 10Ottomata) [16:17:03] (03PS3) 10Rush: Really collect groups stats [puppet] - 10https://gerrit.wikimedia.org/r/260586 (owner: 10DCausse) [16:17:36] nuria, I can [16:17:50] (03PS1) 10Alex Monk: Update wikimania redirects to 2016 [puppet] - 10https://gerrit.wikimedia.org/r/260593 (https://phabricator.wikimedia.org/T122207) [16:17:51] do you want me to start the alter table? [16:17:53] jynus: ok, i am going to blacklist that table so data doesn't flow into it [16:18:04] jynus: will sent you CR [16:18:10] lt's move to #wikimedia-databases [16:18:20] (03CR) 10Dzahn: "root@kafka1001:~# /usr/lib/nagios/plugins/check_procs -c 1:1 -C python -a '/srv/deployment/eventlogging/eventbus/bin/eventlogging-service " [puppet] - 10https://gerrit.wikimedia.org/r/260592 (owner: 10Ottomata) [16:19:18] ottomata: ^ works! thank you [16:19:21] (03CR) 10Ottomata: [C: 032] Fix icinga proc warning for eventlogging-service on kafka100[12] [puppet] - 10https://gerrit.wikimedia.org/r/260592 (owner: 10Ottomata) [16:19:34] (03PS4) 10Rush: Really collect groups stats [puppet] - 10https://gerrit.wikimedia.org/r/260586 (owner: 10DCausse) [16:20:11] hey ori just a thought: next quarter we plan on migrating all eventlogging stuff to jessie/systemd [16:20:17] do you think we should do python3 too? [16:20:25] everything *should just work* [16:20:46] RECOVERY - Check that eventlogging-service-eventbus is running on kafka1001 is OK: PROCS OK: 1 process with command name python, args /srv/deployment/eventlogging/eventbus/bin/eventlogging-service @/etc/eventlogging.d/services/eventbus [16:21:19] (03CR) 10Rush: [C: 032] Really collect groups stats [puppet] - 10https://gerrit.wikimedia.org/r/260586 (owner: 10DCausse) [16:21:25] /dev/mapper/restbase1003--vg-var 2.7T 2.7T 3.9G 100% /var [16:23:34] (03PS1) 10Nuria: Blacklisting (temporarily) MobileWikiAppShareAFact schema [puppet] - 10https://gerrit.wikimedia.org/r/260595 (https://phabricator.wikimedia.org/T120187) [16:23:38] mutante: yeah it should recover by itself when compactions finish, and in two days anyways when 1004 is fully in service [16:23:55] 6operations, 10RESTBase-Cassandra: Update to Cassandra 2.1.12 - https://phabricator.wikimedia.org/T120803#1898461 (10akosiaris) I 've upgraded the cassandra on maps-test200{1,2,3,4}.codfw.wmnet and everything seems fine. Btw, I must point out that somehow the cassandra devs managed to break the cqlsh tool fu... [16:24:17] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:24:30] godog: ah, thank you. so i see that on 2 servers, 1003 and 1008. ok. what about 1004 saying that CQL refused connection though [16:24:54] ottomata: mysql driver doesn't work in python3 last time we checked [16:25:09] jynus: what is your gerrit user? [16:25:51] mm [16:26:03] mutante: mh I thought I acked that, anyways it is because 1004 is bootstrapping atm and doesn't listen on cql [16:26:10] Jcrespo [16:26:21] ACKNOWLEDGEMENT - Disk space on restbase1003 is CRITICAL: DISK CRITICAL - free space: /var 1932 MB (0% inode=99%): Filippo Giunchedi 1004 bootstrapping [16:26:22] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.32.192:9042 on restbase1004 is CRITICAL: Connection refused Filippo Giunchedi 1004 bootstrapping [16:26:22] ACKNOWLEDGEMENT - Disk space on restbase1008 is CRITICAL: DISK CRITICAL - free space: /srv 31086 MB (1% inode=99%): Filippo Giunchedi 1004 bootstrapping [16:26:24] godog: gotcha, thank you [16:27:07] RECOVERY - Check that eventlogging-service-eventbus is running on kafka1002 is OK: PROCS OK: 1 process with command name python, args /srv/deployment/eventlogging/eventbus/bin/eventlogging-service @/etc/eventlogging.d/services/eventbus [16:27:12] mutante: np, I agree it is confusing [16:28:26] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:29:30] 6operations, 10Mathoid, 10RESTBase: restbase/mathoid service checker failure when ran from outside wmf network - https://phabricator.wikimedia.org/T122213#1898469 (10fgiunchedi) [16:30:28] I am creating a new delayed slave on codfw, which would be the most effective way to transmit that? [16:30:48] (icinga already knows about that) [16:31:02] OK slave_sql_lag Slave_IO_Running: Yes, Slave_SQL_Running: No, (no error: intentional) [16:32:37] 6operations, 10Mathoid, 10RESTBase: restbase/mathoid service checker failure when ran from outside wmf network - https://phabricator.wikimedia.org/T122213#1898476 (10Physikerwelt) See https://www.mediawiki.org/wiki/Extension:Math#Configuration you can use the https://api.formulasearchengine.com instead [16:33:48] PROBLEM - cassandra service on restbase1003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [16:33:52] 6operations, 10Mathoid, 10RESTBase: restbase/mathoid service checker failure when ran from outside wmf network - https://phabricator.wikimedia.org/T122213#1898478 (10Physikerwelt) [16:34:46] PROBLEM - cassandra CQL 10.64.32.159:9042 on restbase1003 is CRITICAL: Connection refused [16:35:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] "technically correct, but needs some coordination with a new version of cxserver being deployed, hence a -1. Otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/260575 (owner: 10KartikMistry) [16:36:09] (03CR) 10Jcrespo: [C: 031] "Ok with it, although you may want to mention the table on the comment, just in case it gets forgorten?" [puppet] - 10https://gerrit.wikimedia.org/r/260595 (https://phabricator.wikimedia.org/T120187) (owner: 10Nuria) [16:38:56] PROBLEM - HHVM rendering on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:39:27] PROBLEM - Apache HTTP on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:39:57] RECOVERY - cassandra service on restbase1003 is OK: OK - cassandra is active [16:40:16] RECOVERY - Disk space on restbase1003 is OK: DISK OK [16:40:33] !log bounce cassandra on restbase1003 [16:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:41:16] PROBLEM - configured eth on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:41:23] checks mw1144 [16:41:46] PROBLEM - puppet last run on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:41:47] PROBLEM - Check size of conntrack table on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:41:47] PROBLEM - RAID on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:41:54] !log converting dbstore2001 (delayed slave) into an actual delayed slave, adding redundancy to dbstore1002 [16:41:57] PROBLEM - nutcracker process on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:42:07] PROBLEM - DPKG on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:42:26] PROBLEM - Disk space on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:42:27] PROBLEM - SSH on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:42:27] PROBLEM - salt-minion processes on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:42:38] !log powercycling crashed mw1144 [16:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:42:45] mutante, just by the timout, that looks like a job runner's produced OOM (if not hw/network issue) [16:42:46] PROBLEM - HHVM processes on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:42:47] RECOVERY - cassandra CQL 10.64.32.159:9042 on restbase1003 is OK: TCP OK - 0.006 second response time on port 9042 [16:44:24] !log bounce cassandra on restbase1004, restart bootstrap [16:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:44:37] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: puppet fail [16:45:16] RECOVERY - configured eth on mw1144 is OK: OK - interfaces up [16:45:38] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 38 minutes ago with 0 failures [16:45:47] RECOVERY - Check size of conntrack table on mw1144 is OK: OK: nf_conntrack is 0 % full [16:45:47] RECOVERY - RAID on mw1144 is OK: OK: no RAID installed [16:45:56] RECOVERY - nutcracker process on mw1144 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:46:07] RECOVERY - DPKG on mw1144 is OK: All packages OK [16:46:17] RECOVERY - Disk space on mw1144 is OK: DISK OK [16:46:26] RECOVERY - SSH on mw1144 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [16:46:27] RECOVERY - salt-minion processes on mw1144 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:46:37] RECOVERY - HHVM processes on mw1144 is OK: PROCS OK: 6 processes with command name hhvm [16:46:56] RECOVERY - HHVM rendering on mw1144 is OK: HTTP OK: HTTP/1.1 200 OK - 64653 bytes in 1.281 second response time [16:47:02] and I am wrong, # mw1120-1148 are api apaches [16:47:27] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.199 second response time [16:47:59] (03PS4) 10Dzahn: snapshot: mv wikidatadumps classes to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/260186 [16:49:01] jynus: hhvm invoked oom-killer right before that [16:49:08] i did not have a usable console [16:49:22] well, there it is again [16:49:48] RECOVERY - Disk space on restbase1008 is OK: DISK OK [16:51:05] well, if it is willing more than job runners that is bad news [16:51:06] (03CR) 10Hoo man: snapshot: Deploy DCAT from operations/dumps/dcat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/260247 (https://phabricator.wikimedia.org/T120932) (owner: 10Hoo man) [16:51:09] *killing [16:51:14] (03PS2) 10Hoo man: snapshot: Deploy DCAT from operations/dumps/dcat [puppet] - 10https://gerrit.wikimedia.org/r/260247 (https://phabricator.wikimedia.org/T120932) [16:52:13] (03CR) 10jenkins-bot: [V: 04-1] snapshot: Deploy DCAT from operations/dumps/dcat [puppet] - 10https://gerrit.wikimedia.org/r/260247 (https://phabricator.wikimedia.org/T120932) (owner: 10Hoo man) [16:53:13] wtf [16:53:49] hoo: wenn ein 'ensure' da ist dann muss es das erste sein :p [16:54:19] eh, ensure must be the first attribute [16:54:20] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1898506 (10Krenair) Great, next step (once the above commit is ready and the deployment calendar has space): ```{{#invoke:Deployment schedule|row |when=2016-01-11 ? |length=1 |... [16:54:54] hm… for consistency it should then as well be in the definition of the git module, I guess [16:55:08] ? [16:55:29] Well, the first parameter for git::clone is $directory [16:55:31] jenkins seems to have 2 separate things though [16:55:38] it doesn't really matter because they're named [16:55:40] but still [16:55:58] Could not parse for environment production: Syntax error at 'origin'; expected '}' at /mnt/jenkins-workspace/workspace/pplint-HEAD/modules/snapshot/manifests/wikidatadumps/common.pp:18 [16:56:00] hm [16:56:16] yea, there were 2 separate checks that failed [16:56:22] oh, doh... forgot a , after directory [16:56:32] missing , [16:56:35] yes [16:57:16] 6operations, 10Mathoid, 10RESTBase: restbase/mathoid service checker failure when ran from outside wmf network - https://phabricator.wikimedia.org/T122213#1898509 (10fgiunchedi) thanks @physikerwelt that makes sense! [16:57:18] (03PS3) 10Hoo man: snapshot: Deploy DCAT from operations/dumps/dcat [puppet] - 10https://gerrit.wikimedia.org/r/260247 (https://phabricator.wikimedia.org/T120932) [16:57:47] (03CR) 10ArielGlenn: [C: 031] snapshot: mv wikidatadumps classes to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/260186 (owner: 10Dzahn) [16:57:55] (03CR) 10Hoo man: "Made jenkins happy and addressed the issue brought up by Daniel." [puppet] - 10https://gerrit.wikimedia.org/r/260247 (https://phabricator.wikimedia.org/T120932) (owner: 10Hoo man) [16:58:17] PROBLEM - puppet last run on mw2020 is CRITICAL: CRITICAL: puppet fail [16:58:26] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1538/" [puppet] - 10https://gerrit.wikimedia.org/r/260186 (owner: 10Dzahn) [16:59:10] MaxSem, akosiaris ? [16:59:29] yessir [17:00:04] akosiaris RobH: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151222T1700). Please do the needful. [17:00:21] yes it is, and there are no patches =[ [17:00:47] :p but last week there were 9 [17:01:08] i bet everybody wanted to get in before the freeze [17:01:13] yea [17:01:31] this week would be minor stuff only, and i suppose the minor stuff was pushed in live when it was completed rather than wait ;] [17:01:51] yurik: yup ? [17:02:08] akosiaris, if MaxSem is available, can we do it? [17:02:11] seems like he is [17:02:11] yes [17:02:22] excellente ) [17:02:24] i'm ready [17:02:24] yes [17:02:30] so... [17:02:35] lemme upload the path [17:02:37] patch* [17:02:37] max will drive, akosiaris will implement, i will manage ) [17:02:43] nodes.bin should be moved to /srv/osmosis [17:02:43] lol [17:02:45] everyone is busy [17:03:07] and permissions adjusted to rw for osmupdater [17:04:05] state.txt: https://osm.mazdermind.de/replicate-sequences/?Y=2015&m=12&d=14&H=0&i=0&s=0&stream=minute# [17:04:28] MaxSem, where will the updated tiles file go? [17:05:11] to /srv/osm_expire [17:05:34] (03PS1) 10Alexandros Kosiaris: Ensure osm_sync present [puppet] - 10https://gerrit.wikimedia.org/r/260599 [17:05:51] (03CR) 10Alex Monk: add wikimania2017 wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260521 (https://phabricator.wikimedia.org/T122062) (owner: 10Dzahn) [17:06:05] 6operations, 6Analytics-Kanban, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1898546 (10Ottomata) [17:06:12] (03CR) 10Alex Monk: [C: 04-1] "logo has not been put through optipng, I'll do that in the next PS" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260521 (https://phabricator.wikimedia.org/T122062) (owner: 10Dzahn) [17:06:20] MaxSem: so, I switched for a while to day and once per day cron and let's work down from that. [17:06:28] 6operations, 6Analytics-Kanban, 5Patch-For-Review: Move misc/udp2log.pp to a module [3 pts] - https://phabricator.wikimedia.org/T122058#1898550 (10Ottomata) 5Open>3Resolved [17:06:53] (03CR) 10KartikMistry: "Let me know what we need to fix in cxserver. Adding Santhosh, so he will be in loop." [puppet] - 10https://gerrit.wikimedia.org/r/260575 (owner: 10KartikMistry) [17:07:04] (03CR) 10Alexandros Kosiaris: [C: 032] Ensure osm_sync present [puppet] - 10https://gerrit.wikimedia.org/r/260599 (owner: 10Alexandros Kosiaris) [17:07:48] (03PS7) 10Alex Monk: Add wikimania2017.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260521 (https://phabricator.wikimedia.org/T122062) (owner: 10Dzahn) [17:09:23] akosiaris, daily is ok to catch up, however the more frequent are the updates the less spikey is the load [17:09:59] even minutely updates catch up ok [17:10:25] MaxSem: yup agreed [17:10:29] catching up is the plan [17:11:17] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [17:11:19] (03CR) 10Dzahn: [C: 031] snapshot: Deploy DCAT from operations/dumps/dcat [puppet] - 10https://gerrit.wikimedia.org/r/260247 (https://phabricator.wikimedia.org/T120932) (owner: 10Hoo man) [17:13:01] Krenair: when should the wikimania redirects change? now or Dec 31st or something. i remember this from last year, i think we also did it some time in December [17:13:13] so i'd be fine with now too [17:13:24] mutante, from 2015 -> 2016? [17:13:27] yes [17:13:28] akosiaris, did you move nodes.bin to /srv/osm_expire ? i thought it should be in /srv/osmosis [17:13:30] probably after 2015 finished [17:13:32] MaxSem, ^^ [17:13:36] (03PS4) 10ArielGlenn: snapshot: Deploy DCAT from operations/dumps/dcat [puppet] - 10https://gerrit.wikimedia.org/r/260247 (https://phabricator.wikimedia.org/T120932) (owner: 10Hoo man) [17:13:43] * Krenair shrugs [17:14:15] yurik: copying still. I decided it's better to not move it [17:14:19] (03CR) 10Dzahn: [C: 031] "to be merged Jan 1st ?" [puppet] - 10https://gerrit.wikimedia.org/r/260593 (https://phabricator.wikimedia.org/T122207) (owner: 10Alex Monk) [17:14:20] not done yet [17:14:26] Krenair: ok, +1 then :) [17:14:33] as in to have a backup in case everything goes south [17:15:08] (03CR) 10ArielGlenn: [C: 032] snapshot: Deploy DCAT from operations/dumps/dcat [puppet] - 10https://gerrit.wikimedia.org/r/260247 (https://phabricator.wikimedia.org/T120932) (owner: 10Hoo man) [17:15:18] akosiaris, yes, but its in ls /srv/osm_expire and in /srv/temp, not in ls /srv/osmosis [17:15:21] 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#771227 (10Ireas) [17:15:31] yes I know [17:15:34] ok [17:16:30] (03CR) 10Alexandros Kosiaris: [C: 031] "Looks good, one small nitpick otherwise it's ok to merge" (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/260580 (owner: 10Ottomata) [17:17:06] (03CR) 10Nemo bis: [C: 031] "AFAICT jcrespo wants us to verify that the schema change worked, so let's enable Special:PageLanguage." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260541 (https://phabricator.wikimedia.org/T69223) (owner: 10TTO) [17:19:15] (03CR) 10Dzahn: [C: 031] "to be merged early January (before the 15th)" [dns] - 10https://gerrit.wikimedia.org/r/248504 (https://phabricator.wikimedia.org/T599) (owner: 10Dzahn) [17:20:15] yurik: MaxSem: ok I think we are ready, running it in a screen for the very first time [17:21:23] the most interesting stuff is in logs anyway [17:22:00] ERROR: permission denied for relation planet_osm_point [17:22:15] hmm what does it need to do ? alter ? [17:22:50] shouldn't need it, just inserts/updates [17:26:08] ah , I know [17:26:47] RECOVERY - puppet last run on mw2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:27:23] MaxSem: yeah, so osm2pgsql recreated all the tables when you guys imported the DB dropping all privileges [17:27:34] I 've reapplied the /usr/local/bin/maps-grants.sql file and we are ok [17:27:46] seems like it's reading file now [17:29:46] errrgggh [17:29:51] (03CR) 10Dzahn: [C: 031] "ok, so given it's a limitation of the compiler, i change it to +1 then for the change itself" [puppet] - 10https://gerrit.wikimedia.org/r/220085 (owner: 10Alexandros Kosiaris) [17:30:21] (03PS1) 10Hoo man: Re-add the dcat configuration [puppet] - 10https://gerrit.wikimedia.org/r/260604 [17:30:25] akosiaris: ^ should we just watch it on etherpad then? [17:30:26] BTW, unrelated, but this morning I increased a bit lab's postgres disk space, I did not investigate much more as there was plenty of free space, just FYI [17:31:58] mutante: no, it will unfortunately not work :-(. It's the same problem with the ::role:: namespace being defined by both the module and the role :-( [17:32:13] if we merge that, all puppet compilation will break [17:32:19] need to figure out how to handle this... [17:32:22] ouch, ok :p [17:32:28] I am starting to think role::etherpad::dummy [17:32:42] and when manifests/role is no more, remove the dummy [17:32:42] i was thinking to use ::main [17:32:47] for the same reason [17:32:51] but of course that breaks hiera lookups [17:33:00] did you see ottomata's suggestion? [17:33:01] (03CR) 10ArielGlenn: [C: 032] Re-add the dcat configuration [puppet] - 10https://gerrit.wikimedia.org/r/260604 (owner: 10Hoo man) [17:33:05] to add a second module path [17:33:20] what would that solve ? [17:33:34] oh you mean the eventlog/eventlog.pp thing ? [17:33:38] yes [17:33:39] yeah for new stuff it's fine [17:33:43] it will work [17:33:53] the hiera lookups are the biggest problem tb [17:33:56] tbh* [17:34:18] hmm.. ok.yes [17:34:31] MaxSem: yurik: so java (osmosis) is fetching data from what it seems [17:34:37] (03CR) 10Dzahn: "09:35 < akosiaris> if we merge that, all puppet compilation will break" [puppet] - 10https://gerrit.wikimedia.org/r/220085 (owner: 10Alexandros Kosiaris) [17:34:49] and replaying it to osm2pgsql [17:35:01] I put as a start date 2015-12-10 just to be on the safe side [17:35:22] which is admitedly a bit too much but anyway [17:35:23] akosiaris: if we had a secondary module_path, then no one could create roles and modules with the same name [17:35:32] as roles would just be other modules [17:35:45] ottomata: the problem is the "import roles/*.pp" in site.pp [17:35:51] that's the thing that breaks everything [17:35:56] 6operations, 10ops-codfw, 10fundraising-tech-ops: bellatrix hardware RAID predictive failure - https://phabricator.wikimedia.org/T122026#1898639 (10Papaul) a:3Jgreen Disk replacement complete [17:36:20] the entire idea is to remove that import and move all roles into the role modules [17:36:24] module* [17:36:28] jynus: we are ready to go on the blacklisting [17:36:32] as the import statement is being deprecated [17:36:43] jynus: how long you think the table will be offline? [17:36:58] aye, i'm just nudging towards using a secondary module_path instead of a special role module [17:37:00] seems weird [17:37:11] like we are reproducing manifests/ just at a role:: sublevel [17:37:29] manifests/role/*.pp but yes, that is exactly what we are doing [17:37:38] not the best approach in retrospect of course [17:37:46] nuria, it is difficult to say- the online changes for similar tables took a few hours- less than 1 day [17:38:01] jynus: ok, ottomata , want to merge change? [17:38:03] offline should take less [17:38:24] but that mostly depends on the load of the rest of the database [17:38:54] I als started with the smallest "large size" one [17:39:29] akosiaris: we could just add ::main to every role class that is "role foo" instead of "role foo::bar", so they can all be modules/role/manifests/foo/main.pp ? [17:40:03] MaxSem, where does osmosis puts its logs, into /var? [17:40:08] aren't we almost done just killing all of manifest/role? [17:40:14] mutante: yes that is something that would work, albeit we need to take care [17:40:14] yurik, /tmp [17:40:16] RECOVERY - check_raid on bellatrix is OK: OK: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK] [17:40:19] when I last stopped moving them there were like 3-4 roles left [17:40:20] YuviPanda: no [17:40:29] that aren't in autolayout [17:40:34] yurik: it's a redirect from bash into /tmp/osmosis.log [17:40:38] before we can git mv manifest/role/* to modules/role/manifests/* [17:40:38] YuviPanda: more like 138 [17:40:41] and rm manifests-role [17:40:48] mutante: yeah, but they're all in autoload and can be git mv'd [17:40:54] except for a bunch of the labs* ones [17:41:02] I went through all of them and verified that [17:41:28] how can they be in autoload layout though if they are "role foo" and not "role foo::bar" [17:41:30] yurik: MaxSem seems like it's moving on fine. I 'll keep an eye on it, for now signing off [17:41:32] (03PS2) 10Ottomata: Blacklisting (temporarily) MobileWikiAppShareAFact schema [puppet] - 10https://gerrit.wikimedia.org/r/260595 (https://phabricator.wikimedia.org/T120187) (owner: 10Nuria) [17:41:38] (03CR) 10Ottomata: [C: 032 V: 032] Blacklisting (temporarily) MobileWikiAppShareAFact schema [puppet] - 10https://gerrit.wikimedia.org/r/260595 (https://phabricator.wikimedia.org/T120187) (owner: 10Nuria) [17:41:42] YuviPanda: did you mean to move them to "init.pp" ? [17:41:43] thanks akosiaris! [17:41:43] akosiaris, will it auto-pick it up afterwards? [17:41:51] mutante: they're 'role::x' and they are in 'modules/role/manifests/x.pp'? [17:42:18] yurik: it's in a screen, once it's finished I 'll just enable the normal minutely syncing process [17:42:35] akosiaris, Failed to open node cache file: No such file or directory [17:42:35] Error occurred, cleaning up [17:42:50] that's in osm2pgsql [17:42:52] MaxSem, ^ [17:43:05] heh [17:43:18] yurik: that's old [17:43:22] so no nodes.bin, no updates [17:43:26] YuviPanda: i thought exactly that doesnt work and it's a limitation of the role keyword [17:43:34] oh [17:43:40] haha, I give up then [17:43:41] again, that's old MaxSem yurik [17:43:45] YuviPanda: and that made me think they all have to be modules/role/manifest/foo/bar.pp [17:43:50] jynus: so we can get started at your convenience, change is merged [17:43:58] i might be wrong [17:44:00] I think that's something we should fix on the role keyword then. [17:44:04] akosiaris, yep, gotcha. thx! [17:44:08] but there isnt a single file like that yet [17:44:08] ::main sounds pretty ugly :) [17:44:16] no role/manifests/foo.pp at all [17:44:21] yeah, since that breaks [17:44:26] ok [17:44:28] you can't have it before you empty manifests/role [17:44:28] I just deleted /tmp/osm2pgsql.log just to avoid confusion [17:44:36] for now it logs in my screen session [17:44:38] ok, let me get out of bed for real, brb [17:44:43] starting from tomorrow it should log over there [17:44:56] akosiaris, excelente, thx! [17:44:58] Processing: Node(320k 0.9k/s) Way(0k 0.00k/s) Relation(0 0.00/s) [17:45:06] not too fast, but ok [17:45:08] c ya [17:45:10] nuria, can you confirm the change took effect, it usually takes ~30 minutes to do it? [17:45:35] jynus: it will take effect in next puppet run, let me see [17:45:37] akosiaris, next time we should switch maps2001 with 2002 -- its a much bigger machine [17:45:51] yep [17:47:36] YuviPanda: role foo::role :o [17:47:58] jynus: i think you need to query table cause i can only query slave and with lag is not easy to see whether events are coming in right now, this requires querying the master' [17:48:14] true, there is ~1 hour of delat [17:48:24] (03PS2) 10Ottomata: Add eventbus.svc IPs [dns] - 10https://gerrit.wikimedia.org/r/260580 [17:48:48] (03PS3) 10Ottomata: Add eventbus.svc IPs [dns] - 10https://gerrit.wikimedia.org/r/260580 [17:53:58] (03CR) 10Ottomata: [C: 032] Add eventbus.svc IPs [dns] - 10https://gerrit.wikimedia.org/r/260580 (owner: 10Ottomata) [17:54:14] (03PS1) 10Dzahn: mattermost: move role to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260605 [17:55:13] jynus: we have restarted evenlogging [17:55:28] (03PS2) 10Dzahn: mattermost: move role to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260605 [17:56:11] jynus: so changes should be taking effect, EL restarted [17:56:23] jynus: you can confirm by querying table [17:56:48] 6operations, 10ops-codfw: update ILO firmware on fdb2001 - https://phabricator.wikimedia.org/T84806#1898681 (10Papaul) 5Open>3Resolved a:3Papaul Before update Firmware version 1.5 After update firmware version 2.3 0 Closing this task . [17:57:57] nuria, 20151222175238 is the current max timestamp for that table [17:58:27] that is an addition a few minutes ago [17:58:43] (03PS1) 10Dzahn: bugzilla-static: move role to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260606 [17:58:46] (03PS2) 10Ottomata: Add LVS/PyBal config for eventbus [puppet] - 10https://gerrit.wikimedia.org/r/260047 (https://phabricator.wikimedia.org/T118780) [17:58:54] and now it says 20151222175313 [17:59:01] (03CR) 10Ottomata: Add LVS/PyBal config for eventbus (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/260047 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [17:59:16] i re started just now 9:56 [17:59:31] jynus: I have restarted system at 9:56 [17:59:43] should we wait a few minutes more? [17:59:47] jynus: sorry, PST [17:59:54] jynus: no, change is in effect [18:00:07] https://www.irccloud.com/pastebin/SHp6ayIp/ [18:00:31] jynus: do you see data from 2 mins ago? [18:00:51] no, but 2 minutes ago I saw different data [18:00:57] 6operations, 10hardware-requests: One YubiHSM for the SF office - https://phabricator.wikimedia.org/T122120#1898700 (10RobH) Is there a specific person in OIT this should go to (who is aware of this?) They handle Yubikeys for everyone as well so I don't want this lost in that shuffle. [18:01:13] I can roll the change now, it doesn't affect me [18:01:23] it is just that write will fail [18:01:37] (03PS1) 10Dzahn: bastionhost: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260607 [18:01:46] ok with that? [18:02:03] jynus: there should not be any writes though, let me know if you see any [18:02:38] ok, will roll it now, keep watching processlist if I see a write blocked [18:02:53] (03PS2) 10Dzahn: bastionhost: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260607 [18:02:58] jynus: thank you, cc ottomata [18:03:43] !log rolling schema change (ALTER TABLE ENGINE=TokuDB) on m4-master (db1046) log (eventlogging) [18:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:04:09] (03PS2) 10Dzahn: bugzilla-static: move role to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260606 [18:05:52] 6operations, 10hardware-requests: spare swift disks order - https://phabricator.wikimedia.org/T119698#1898708 (10RobH) 5stalled>3Resolved All associated sub-tasks have been completed, all spares onsite. [18:06:13] nuria, Stage: 1 of 2 'Fetched about 67000 rows, loading data still remains' 0.414% of stage done [18:06:36] it seems white fast [18:06:41] *quite [18:07:00] (03PS1) 10Dzahn: ceilometer: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260608 [18:07:03] 6operations, 10DBA: review eqiad database server quantities / warranties / service(s) - https://phabricator.wikimedia.org/T103936#1898714 (10RobH) 5Open>3Resolved I think this task is resolved, as these are indeed being decommissioned per other discussions and refreshing the specification for eqiad. Adidt... [18:07:36] think about 1 hour [18:08:16] at the same time that it is compressing, it is defragmenting too, so that will gain some space already [18:09:25] what should I do when finished, should I rever that patch and thats all? [18:10:12] (03PS1) 10Dzahn: debdeploy: move role to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260609 [18:10:20] 6operations, 10Beta-Cluster-Infrastructure, 7HHVM, 5Patch-For-Review: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1898727 (10RobH) I've removed the patch for review, as it seems that all pending patches on this task have been applied. [18:10:28] 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1898728 (10RobH) [18:10:49] !log disabling event scheduling on db1046 [18:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:11:52] everithing seems ok so far [18:15:47] (03PS1) 10Dzahn: swift: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260610 [18:16:08] ok, time to update wikitech cert woooooooo [18:16:18] andrewbogott: So just so you are aware of the steps I plan to take [18:16:42] 1: disable puppet on silver 2: merge in my certificate change in gerrit 3: merge in the private repo changes to update the private key [18:16:46] 4: run puppet on silver [18:17:01] that should update the certificate and chain for apache, which is typically not difficult these days. [18:17:04] seems reasonable [18:17:25] 7Puppet, 6Analytics-Kanban, 10Analytics-Wikimetrics: Cleanup Wikimetrics puppet module so it can run puppet continuously without own puppetmaster {dove} [? pts] - https://phabricator.wikimedia.org/T101763#1898743 (10Nuria) The goal of this task is to do enough changes so we can deploy wikimetrics with ease,... [18:17:34] !log puppet disabled on silver, going to update wikitech.wikimedia.org certificate [18:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:17:42] andrewbogott: did you wanna blurb about this in labs channel that apache will rehup? [18:17:53] sure [18:17:57] Although in theory no one will notice [18:18:09] indeed, but if you dont then i'll break apache! ;] [18:18:15] damn murphy and his laws. [18:18:23] done :) [18:20:13] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 323 bytes in 0.034 second response time [18:20:16] (03PS2) 10RobH: new wikitech.wikimedia.org certificate (renewal replacement) [puppet] - 10https://gerrit.wikimedia.org/r/259055 [18:20:19] ......... [18:20:22] i didnt do that. [18:20:25] ;p [18:20:38] (the tool labs thing) [18:20:47] unless stoppign puppet on silver does it [18:20:58] I suppose those are expected due to maintenance? [18:21:08] oh, then they should have suspended the checks no? =P [18:21:33] I would wait for confirmation [18:21:36] (03PS1) 10Dzahn: postgres: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260611 [18:21:38] (what im doing isnt tied to wmflabs.org) [18:21:40] Bad Gateway on tools [18:21:47] or it shouldnt be. [18:22:03] (03CR) 10RobH: [C: 032] new wikitech.wikimedia.org certificate (renewal replacement) [puppet] - 10https://gerrit.wikimedia.org/r/259055 (owner: 10RobH) [18:22:46] andrewbogott: so my changes are ready to roll in but i am not certain if we need to halt due to the toollabs issue [18:22:59] ie: once puppet is re-enabled, it'll go live on silver. [18:23:31] robh: 10:20 < YuviPanda> !log tools failed over active proxy to proxy-01 [18:23:42] i think that and totally unrelated to silver [18:24:04] DNS failure?! [18:24:25] no, working for me [18:24:29] private dns at least [18:24:39] see -labs [18:24:42] hmm [18:25:26] robh: yeah, let’s wait a second [18:26:10] yep, holding off (we are simply stalled though, we will ahve to roll it live or back out within the next hour ;) [18:26:19] i dont want to leave puppet suspended too long [18:26:35] !log silver puppet staying stalled during toollabs issue (we dont want to rehup silver web serivce) [18:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:32:30] 7Puppet, 6Analytics-Kanban, 10Analytics-Wikimetrics: Cleanup Wikimetrics puppet module so it can run puppet continuously without own puppetmaster {dove} [? pts] - https://phabricator.wikimedia.org/T101763#1898780 (10Nuria) Things to do: Move alembic and pip install, db creation, any server initialization, a... [18:32:38] 7Puppet, 6Analytics-Kanban, 10Analytics-Wikimetrics: Cleanup Wikimetrics puppet module so it can run puppet continuously without own puppetmaster {dove} [21 pts] - https://phabricator.wikimedia.org/T101763#1898781 (10Nuria) [18:32:53] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 964715 bytes in 9.638 second response time [18:33:10] woot [18:33:11] 7Puppet, 6Analytics-Kanban, 10Analytics-Wikimetrics: Use fabric to deploy wikimetrics {dove} - https://phabricator.wikimedia.org/T122228#1898784 (10madhuvishy) 3NEW a:3madhuvishy [18:33:17] andrewbogott: i think that means we can continue right? =] [18:33:20] jynus: ping us via ticket when you are done, we will undo our changes [18:33:35] robh: yep, have at [18:33:44] ok, reenabling puppet now [18:33:53] ok, will do, it will take ~2 hours [18:34:08] silver is updating [18:36:08] PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: Puppet has 73 failures [18:36:24] andrewbogott: done and working [18:36:38] !log silver returned to normal service, wikitech.w.o certificate renewed. [18:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:36:58] robh: yeah, looks good to me. and one less warning! [18:38:57] we're a far cry from our same state of ssl certificate tracking at the same time last year. [18:39:04] (its excellent now ;) [18:41:13] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:41:45] um [18:41:47] no bueno [18:41:48] works for me, icinga-wm [18:43:23] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 965045 bytes in 7.182 second response time [18:58:00] (03PS1) 10Yuvipanda: dynamicproxy: Add error graphite counts too [puppet] - 10https://gerrit.wikimedia.org/r/260625 [18:58:11] valhallasw`cloud: ^ will add error counts too [18:58:48] (03CR) 10Merlijn van Deen: [C: 031] dynamicproxy: Add error graphite counts too [puppet] - 10https://gerrit.wikimedia.org/r/260625 (owner: 10Yuvipanda) [18:59:02] (03CR) 10Yuvipanda: [C: 04-1] "Will clobber the access log request rate, since it's the same name. figure out how to change name of metric." [puppet] - 10https://gerrit.wikimedia.org/r/260625 (owner: 10Yuvipanda) [18:59:49] doh! [19:00:13] YuviPanda: --metric-prefix=${graphite_metric_prefix}, probably? [19:00:42] it's defined 10 lines up [19:01:00] valhallasw`cloud: no, the actual metric name seems to be $graphite_metric_prefix.line_rate [19:01:05] hmm [19:01:17] maybe the first should be reqstats.something [19:01:23] *nod* [19:01:23] and I should move the metrics first [19:01:26] so we don't lose 'em [19:01:34] or just keep the old one as reqstats [19:01:38] and the new one as errorstats? [19:01:50] dunno if anyone depends on them [19:01:54] nah, they should be reqstats.all and reqstats.error [19:02:03] doubt it, it isn't very well advertised :) [19:02:33] I'll do it once I reach office [19:03:49] (03PS1) 10Andrew Bogott: Add yubikey root key for andrew [puppet] - 10https://gerrit.wikimedia.org/r/260626 [19:09:39] (03CR) 10Florianschmidtwelzow: [C: 031] "Ah, ok :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260541 (https://phabricator.wikimedia.org/T69223) (owner: 10TTO) [19:10:51] (03CR) 10Andrew Bogott: [C: 032] Add yubikey root key for andrew [puppet] - 10https://gerrit.wikimedia.org/r/260626 (owner: 10Andrew Bogott) [19:26:20] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for cwdent - https://phabricator.wikimedia.org/T121916#1898887 (10RobH) IRC Update: I synced up with @cwdent via PM yesterday and this can wait for the typical 3 day wait. If there are (still) no objections lodged by tomorrow, I'll merge this a... [19:31:17] 6operations, 6Labs, 10netops, 5Patch-For-Review: Create labs baremetal subnet? - https://phabricator.wikimedia.org/T121237#1898893 (10RobH) There seems to be no actual #patch-for-review associated with this task? (I don't see it linked or a quick grep of gerrit for bug:T121237 shows nothing. I'm going to... [19:31:23] 6operations, 6Labs, 10netops: Create labs baremetal subnet? - https://phabricator.wikimedia.org/T121237#1898894 (10RobH) [19:42:18] (03PS1) 10Nuria: Restoring MobileWikiAppShareAFact schema to eventlogging stream [puppet] - 10https://gerrit.wikimedia.org/r/260628 (https://phabricator.wikimedia.org/T120187) [20:10:01] (03CR) 10Ottomata: [C: 032] Restoring MobileWikiAppShareAFact schema to eventlogging stream [puppet] - 10https://gerrit.wikimedia.org/r/260628 (https://phabricator.wikimedia.org/T120187) (owner: 10Nuria) [20:23:16] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [20:23:39] is that networking equipment? [20:24:45] sounds like it based on https://phabricator.wikimedia.org/T113771 [20:27:21] oob is out of band [20:27:28] "serial consoles" [20:27:42] Krenair: management network (read: no prod traffic) [20:29:27] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 6.57 ms [20:33:53] (03Abandoned) 10Ottomata: logging: rename webrequest-multicast [puppet] - 10https://gerrit.wikimedia.org/r/260194 (owner: 10Dzahn) [20:43:46] (03CR) 10Ottomata: [C: 032 V: 032] Release 2.1.1-1 with support for librdkafka and python3 [debs/python-pykafka] (debian) - 10https://gerrit.wikimedia.org/r/259335 (owner: 10Ottomata) [21:11:10] !log restbase1008: restarting cassandra to clear up disk space from old stream [21:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:13:00] 6operations, 10ops-eqiad, 6Labs, 5Patch-For-Review: setup promethium in eqiad in support of T95185 - https://phabricator.wikimedia.org/T120262#1899247 (10Andrew) 5Open>3Resolved All patches are merged; process is documented at https://wikitech.wikimedia.org/wiki/Labs_Baremetal_Lifecycle [21:14:39] any of you opsen know of a good tool for network analysis? e.g. testing for max bandwidth, packet loss, reordering, latency [21:17:10] marxarelli: tons of answers to that but I often start with ttcp [21:17:21] since it's old school and there is a client for nearly every platform [21:18:48] sweet as! [21:20:15] i'm looking for both myself and the zero team [21:20:54] the zero team wants to build network profiles based off real-world conditions in the field [21:21:13] i'm looking to build profiles for a `tc` patch i wrote a while back for mw-vagrant [21:22:05] !log restbase1003: restarting cassandra to clear up disk space from old stream [21:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:22:23] marxarelli: ah interesing, i.e. mimicking low bw conditions or something [21:22:38] chasemp: yeah, exactly [21:22:40] there is no silver bullet but usually tcptraceroute/traceroute/mtr/ttcp is a start [21:22:47] chasemp: https://gerrit.wikimedia.org/r/#/c/221143/ [21:22:56] and then a million one-off tools that do it to various degrees of well [21:23:16] the reason I like ttcp is it's pretty simple and standard and I got used it as even cisco IOS has a client baked in [21:23:34] nice for throughput checking for vpn links etc [21:23:34] marxarelli: something like https://jagt.github.io/clumsy/index.html ? [21:23:54] (I may be misunderstanding) [21:24:19] chasemp: awesome, thanks for the suggestions [21:24:51] valhallasw`cloud: i think the zero team would be super interested in tools like yeah for _setting_ conditions, yeah [21:25:00] i'm looking for something to record/analyze conditions [21:25:05] valhallasw`cloud: I think that's a kind of windows version of using tc to modify qdisc values [21:26:10] valhallasw`cloud: i'm actually trying to polish a mw-vagrant role that does the simulating with tc/netem, but i need real-world data to build profiles [21:26:30] aah, you want to record data so it can be tested later easily? Super cool. [21:26:41] exactly [21:27:09] yeah, i wrote it like a year ago but for some (dumb) reason decided to abandon it [21:27:30] I need a similar thing in another venue actually :) [21:27:39] tc mods [21:27:54] so now i'm resurrecting it in hopes that UX Research and Zero will find it useful [21:28:00] marxarelli: fyi there is some easier userland tooling like trickle [21:28:05] that lets you do this using preload tricks iirc [21:28:16] if that is a possibility [21:28:24] chasemp: preload tricks? [21:28:57] "It works by preloading its own socket library wrappers, that limit traffic by delaying data." [21:30:37] oh, whoa. that sounds magical :) [21:31:19] i really want to surface it to users in a dead simple way [21:31:55] (03CR) 10Dzahn: [C: 032] Major overhaul of Main Page [debs/wikistats] - 10https://gerrit.wikimedia.org/r/252249 (owner: 10Southparkfan) [21:32:05] (03CR) 10Dzahn: [V: 032] Major overhaul of Main Page [debs/wikistats] - 10https://gerrit.wikimedia.org/r/252249 (owner: 10Southparkfan) [21:32:20] the thought behind using tc was that i could easily build on top of it, easy ways to enable, disable, or otherwise surface profiles [21:32:38] no doubt it's just another less big hammer way to poke at the problem [21:33:01] awesome, i'll check it out [21:45:56] !log restbase1004: restarted bootstrap [21:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:46:42] !log restbase1004: tune2fs -m 0 /dev/mapper/restbase1004--vg-srv [21:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:01:08] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: puppet fail [22:03:03] (03Abandoned) 10GWicke: Share the RESTBase config template between production and labs [puppet] - 10https://gerrit.wikimedia.org/r/257696 (owner: 10GWicke) [22:03:57] ah ok! glad I didn't merge :) [22:05:46] wrong chat :) [22:06:46] chasemp: Do you have a sec to help me with the Google Webmaster Tools password? :-) [22:10:29] Deskana: sure [22:10:55] chasemp: Great! Let's go to PM. [22:12:39] (03PS1) 10Dzahn: Revert "Major overhaul of Main Page" [debs/wikistats] - 10https://gerrit.wikimedia.org/r/260683 [22:23:04] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 6Zero, and 3 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1899485 (10Dzahn) @Nuria I was just helping with the DNS, virtual machines are granted by Alex (https://wikitech.wikimedia.org... [22:25:46] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#1899489 (10Dzahn) I can help but all there is to it on our side is to delete them once OIT tells us they have been created on their side. The list of aliases is already being emailed automatically to O... [22:27:27] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:31:13] (03CR) 10Dzahn: [C: 032 V: 032] Revert "Major overhaul of Main Page" [debs/wikistats] - 10https://gerrit.wikimedia.org/r/260683 (owner: 10Dzahn) [22:32:21] (03PS1) 10Dzahn: add wikipedia.co.uk as parked domain [dns] - 10https://gerrit.wikimedia.org/r/260685 [22:32:37] (03PS2) 10Dzahn: add wikipedia.co.uk as parked domain [dns] - 10https://gerrit.wikimedia.org/r/260685 [22:33:20] (03PS3) 10Dzahn: add wikipedia.co.uk as parked domain [dns] - 10https://gerrit.wikimedia.org/r/260685 (https://phabricator.wikimedia.org/T107525) [22:34:37] (03CR) 10Dzahn: [C: 032] "until recently this was showing ads and stuff" [dns] - 10https://gerrit.wikimedia.org/r/260685 (https://phabricator.wikimedia.org/T107525) (owner: 10Dzahn) [22:41:15] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#1899521 (10Dzahn) [22:44:51] 6operations: move grants aliases to OIT? - https://phabricator.wikimedia.org/T83791#1899526 (10Dzahn) [22:44:52] 6operations: move grants aliases to OIT? - https://phabricator.wikimedia.org/T83791#918630 (10Dzahn) @JKrauska ping? i saw the new ticket T122144 and linked this one [22:44:54] 6operations: move grants aliases to OIT? - https://phabricator.wikimedia.org/T83791#1899539 (10Dzahn) [22:45:09] 6operations, 7Mail: move grants aliases to OIT? - https://phabricator.wikimedia.org/T83791#918630 (10Dzahn) [22:47:56] (03PS1) 10Madhuvishy: [WIP] wikimetrics: Puppet module for wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/260687 [23:00:51] 6operations, 10MediaWiki-Watchlist, 7Mail: Mails from MediaWiki seem to get (partially) lost - https://phabricator.wikimedia.org/T121105#1899574 (10Dzahn) [23:04:00] how to get unix timestamp inside puppet? [23:04:19] (besides exec date i guess) [23:04:39] call out to ruby? [23:05:03] $timestamp = generate('/bin/date', '+%Y%d%m_%H:%M:%S') [23:05:06] found this [23:05:21] on puppetlabs.com actually.. hmm..ok [23:09:49] (03CR) 10John Vandenberg: [C: 031] "re +1 after rebase." [puppet] - 10https://gerrit.wikimedia.org/r/243688 (https://phabricator.wikimedia.org/T106311) (owner: 10Merlijn van Deen) [23:12:36] 6operations, 10Incident-Labs-NFS-20151216: Investigate need and candidate for labstore100(1|2) kernel upgrade - https://phabricator.wikimedia.org/T121903#1899607 (10chasemp) I wasn't looking for it on labstore1002 thinking whatever weird condition is in question here would show up on the effective server, I do... [23:16:54] milimetric: do we store sessions in memory for wikimetrics? [23:16:59] just curious [23:26:00] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 6Zero, and 3 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1899658 (10RobH) As this is now a virtual machine request, I've removed the #hardware-request tag. [23:26:11] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 6Zero, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1899659 (10RobH) [23:29:51] 6operations: Security audit for tftp on Carbon - https://phabricator.wikimedia.org/T122210#1899672 (10RobH) We have to have it open for both internal and external vlans for any and all wmf subnets. When any new site is deployed with more than a single server, we'll end up hosting those same tftp files locally t... [23:31:05] 6operations: Security audit for tftp on Carbon - https://phabricator.wikimedia.org/T122210#1899674 (10RobH) I should note that if Andrew says lab instance vlan doesnt need it, then it does not need it. (The support vlans will, since they get installed off our normal install systems.) Additionally, we have a sa... [23:42:12] (03PS1) 10Dzahn: ores: enhance monitoring of ores workers [puppet] - 10https://gerrit.wikimedia.org/r/260692 [23:43:20] (03CR) 10jenkins-bot: [V: 04-1] ores: enhance monitoring of ores workers [puppet] - 10https://gerrit.wikimedia.org/r/260692 (owner: 10Dzahn) [23:46:27] (03PS2) 10Dzahn: ores: enhance monitoring of ores workers [puppet] - 10https://gerrit.wikimedia.org/r/260692 [23:49:22] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1540/neon.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/260692 (owner: 10Dzahn) [23:59:58] 6operations, 6RevisionScoringAsAService, 10ores, 7Monitoring: Add monitoring to ORES workers - https://phabricator.wikimedia.org/T121656#1899723 (10Dzahn)