[00:00:09] (03PS2) 10Alex Monk: Make references to tasks/bugs more consistent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199574 (https://phabricator.wikimedia.org/T31902) [00:00:15] (03CR) 10Alex Monk: [C: 032] Make references to tasks/bugs more consistent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199574 (https://phabricator.wikimedia.org/T31902) (owner: 10Alex Monk) [00:00:20] (03Merged) 10jenkins-bot: Make references to tasks/bugs more consistent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199574 (https://phabricator.wikimedia.org/T31902) (owner: 10Alex Monk) [00:00:58] !log krenair Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/199574/ (duration: 00m 08s) [00:01:03] Logged the message, Master [00:01:03] Krenair, let me know when SWAT is over so I can run the FlowFixEditCount.php script as mentioned. [00:01:14] I'm almost done with my huge list of config changes [00:01:25] (03PS3) 10Alex Monk: Add a domain to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194913 (https://phabricator.wikimedia.org/T91630) (owner: 10Odder) [00:01:33] (03CR) 10Alex Monk: [C: 032] Add a domain to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194913 (https://phabricator.wikimedia.org/T91630) (owner: 10Odder) [00:01:38] (03Merged) 10jenkins-bot: Add a domain to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194913 (https://phabricator.wikimedia.org/T91630) (owner: 10Odder) [00:02:04] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/194913/ (duration: 00m 08s) [00:02:09] Logged the message, Master [00:02:45] (03PS2) 10Alex Monk: Enable wgUseRCPatrol for fawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199692 (https://phabricator.wikimedia.org/T85381) (owner: 10Mjbmr) [00:02:51] (03CR) 10Alex Monk: [C: 032] Enable wgUseRCPatrol for fawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199692 (https://phabricator.wikimedia.org/T85381) (owner: 10Mjbmr) [00:02:56] (03Merged) 10jenkins-bot: Enable wgUseRCPatrol for fawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199692 (https://phabricator.wikimedia.org/T85381) (owner: 10Mjbmr) [00:03:16] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/199692/ (duration: 00m 07s) [00:03:21] Logged the message, Master [00:03:32] superm401, hey! I'm done. [00:03:38] Krenair, thanks. [00:03:47] (03CR) 10Dzahn: [C: 031] Add "always" flag when add HSTS header in Apache [puppet] - 10https://gerrit.wikimedia.org/r/199319 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [00:03:53] About to run FlowFixEditCount.php on all production wikis. [00:04:15] 6operations, 6MediaWiki-Core-Team, 6Multimedia, 6Parsoid-Team, and 3 others: Prepare Platform/Ops April 2015 quarterly review presentation - https://phabricator.wikimedia.org/T91803#1151749 (10bd808) [00:04:55] 6operations, 6MediaWiki-Core-Team, 6Multimedia, 6Parsoid-Team, and 3 others: Prepare Platform/Ops April 2015 quarterly review presentation - https://phabricator.wikimedia.org/T91803#1096558 (10bd808) [00:06:04] (03CR) 10Dzahn: [C: 032] annual - Enable HSTS max-age=7 days [puppet] - 10https://gerrit.wikimedia.org/r/199087 (https://phabricator.wikimedia.org/T599) (owner: 10Chmarkine) [00:06:34] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/199087 (https://phabricator.wikimedia.org/T599) (owner: 10Chmarkine) [00:06:44] PROBLEM - Host curium is DOWN: PING CRITICAL - Packet loss = 100% [00:08:55] RECOVERY - Host curium is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms [00:12:15] PROBLEM - DPKG on curium is CRITICAL: Connection refused by host [00:12:20] (03PS1) 10Dzahn: annual report: load mod_headers [puppet] - 10https://gerrit.wikimedia.org/r/199801 [00:12:44] PROBLEM - Disk space on curium is CRITICAL: Connection refused by host [00:12:52] (03CR) 10Dzahn: [C: 032] annual report: load mod_headers [puppet] - 10https://gerrit.wikimedia.org/r/199801 (owner: 10Dzahn) [00:12:55] PROBLEM - configured eth on curium is CRITICAL: Connection refused by host [00:13:07] PROBLEM - dhclient process on curium is CRITICAL: Connection refused by host [00:13:45] PROBLEM - salt-minion processes on curium is CRITICAL: Timeout while attempting connection [00:13:46] PROBLEM - RAID on curium is CRITICAL: Timeout while attempting connection [00:15:54] PROBLEM - Host curium is DOWN: PING CRITICAL - Packet loss = 100% [00:16:08] (03CR) 10Dzahn: [C: 031] doc - Enable HSTS max-age=7 days [puppet] - 10https://gerrit.wikimedia.org/r/198819 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [00:16:55] RECOVERY - Host curium is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [00:17:31] (03CR) 10Dzahn: [C: 031] "it's already HTTPS-only, so yes" [puppet] - 10https://gerrit.wikimedia.org/r/199200 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [00:18:44] (03CR) 10Dzahn: [C: 031] "pybal is on config-master instead of noc nowadays, so i think it's also fine to redirect to https on noc" [puppet] - 10https://gerrit.wikimedia.org/r/199515 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [00:20:09] (03CR) 10Dzahn: [C: 031] "already https-only, yes" [puppet] - 10https://gerrit.wikimedia.org/r/198469 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [00:20:26] RECOVERY - DPKG on curium is OK: All packages OK [00:20:55] RECOVERY - Disk space on curium is OK: DISK OK [00:21:05] RECOVERY - configured eth on curium is OK: NRPE: Unable to read output [00:26:43] 6operations, 10Deployment-Systems: Random one-off deployment failure for one host - https://phabricator.wikimedia.org/T93983#1151789 (10Krenair) 3NEW a:3bd808 [00:28:58] (03PS1) 10Alex Monk: Remove another useless wgMasterWaitTimeout reference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199803 (https://phabricator.wikimedia.org/T31902) [00:29:33] (03PS7) 10Nuria: [WIP] Testing pageviews, logster and wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/197411 [00:30:24] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Testing pageviews, logster and wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/197411 (owner: 10Nuria) [00:33:22] (03CR) 10Dzahn: "@hashar it makes the webserver tell all clients to always use https to connect to this site and never ever use http again. should be ok si" [puppet] - 10https://gerrit.wikimedia.org/r/198819 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [00:34:43] (03CR) 10Dzahn: [C: 032] "already https-only and behind a login" [puppet] - 10https://gerrit.wikimedia.org/r/199142 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [00:35:06] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/199142 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [00:36:49] (03PS8) 10Nuria: [WIP] Testing pageviews, logster and wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/197411 [00:37:40] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Testing pageviews, logster and wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/197411 (owner: 10Nuria) [00:38:03] 6operations, 10Deployment-Systems: Random one-off deployment failure for one host - https://phabricator.wikimedia.org/T93983#1151838 (10bd808) It appears that mw1010.eqiad.wmnet was running rsync from another source at the same time as mw1012.eqiad.wmnet was syncing from the rsync server on mw1010.eqiad.wmnet.... [00:39:12] robh: can we play with codfw tomorrow? My brain is tired tonight. 18:38 is too late for me to start debugging new stuff. [00:40:02] (03PS9) 10Nuria: [WIP] Testing pageviews, logster and wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/197411 [00:40:58] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Testing pageviews, logster and wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/197411 (owner: 10Nuria) [00:42:30] (03PS1) 10Dzahn: iegreview: load Apache mod_headers [puppet] - 10https://gerrit.wikimedia.org/r/199808 [00:43:23] (03CR) 10Dzahn: [C: 032] iegreview: load Apache mod_headers [puppet] - 10https://gerrit.wikimedia.org/r/199808 (owner: 10Dzahn) [00:43:54] PROBLEM - Host curium is DOWN: PING CRITICAL - Packet loss = 100% [00:44:25] RECOVERY - Host curium is UP: PING OK - Packet loss = 0%, RTA = 3.53 ms [00:45:28] (03CR) 10Dzahn: "I tend to agree we shouldn't use patented software when it doesn't make a difference for security." [puppet] - 10https://gerrit.wikimedia.org/r/199582 (owner: 10BBlack) [00:48:19] (03PS3) 10Dzahn: puppetception: lint [puppet] - 10https://gerrit.wikimedia.org/r/195613 (owner: 10Matanya) [00:48:36] (03CR) 10Dzahn: [C: 032] puppetception: lint [puppet] - 10https://gerrit.wikimedia.org/r/195613 (owner: 10Matanya) [00:49:15] 6operations, 6Security, 10Wikimedia-Shop, 7HTTPS, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1151893 (10Krenair) As long as it's the "Wiki**m**edia Store" [00:49:27] (03CR) 10Dzahn: "@matanya it's right, it's just not related to quoting and the linked bug anymore, technically" [puppet] - 10https://gerrit.wikimedia.org/r/195613 (owner: 10Matanya) [00:50:45] (03PS4) 10Dzahn: reprepro: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/195658 (owner: 10Matanya) [00:51:32] (03CR) 10Dzahn: [C: 032] reprepro: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/195658 (owner: 10Matanya) [00:51:53] (03CR) 10Dzahn: "same here, adjusted commit message and removed bug link. the changes are fine though" [puppet] - 10https://gerrit.wikimedia.org/r/195658 (owner: 10Matanya) [00:53:40] (03CR) 10Dzahn: [C: 031] ferm: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195858 (https://phabricator.wikimedia.org/T91908) (owner: 10Matanya) [00:55:36] (03PS4) 10Dzahn: haproxy: quoting / lint [puppet] - 10https://gerrit.wikimedia.org/r/195652 (owner: 10Matanya) [00:56:56] !log Done running FlowFixEditCount in production [00:57:05] Logged the message, Master [00:57:59] (03CR) 10Dzahn: [C: 032] "only quotes and only monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/195652 (owner: 10Matanya) [01:00:37] (03CR) 10Dzahn: "I think it makes sense to have the same on all bastions." [puppet] - 10https://gerrit.wikimedia.org/r/196964 (owner: 10Hoo man) [01:04:14] (03CR) 10Dzahn: "i think we are on the way to killing dsh entirely now. we already deleted most groups. should just be about mediawiki-installation at this" [puppet] - 10https://gerrit.wikimedia.org/r/179121 (owner: 10Giuseppe Lavagetto) [01:12:08] (03CR) 10Dzahn: [C: 031] "anyone up to deploy it?" [puppet] - 10https://gerrit.wikimedia.org/r/185474 (https://phabricator.wikimedia.org/T87039) (owner: 10Glaisher) [01:14:24] (03PS8) 10Dzahn: Add apparmor profiles for avconv/ffmpeg2theora [puppet] - 10https://gerrit.wikimedia.org/r/38307 (https://phabricator.wikimedia.org/T42099) (owner: 10J) [01:15:37] (03PS9) 10Dzahn: Add apparmor profiles for avconv/ffmpeg2theora [puppet] - 10https://gerrit.wikimedia.org/r/38307 (https://phabricator.wikimedia.org/T42099) (owner: 10J) [01:19:56] (03CR) 10Dzahn: [C: 04-1] "@John i guess it's a general thing per https://phabricator.wikimedia.org/T87519 now. network.pp is supposed to be killed. also, my changed" [puppet] - 10https://gerrit.wikimedia.org/r/189196 (owner: 10John F. Lewis) [01:23:03] (03CR) 10Dzahn: "@_joe_ still think it should be blocked? adding ori too because of the "graphite moved behind misc-web"-comment on a related change" [puppet] - 10https://gerrit.wikimedia.org/r/181949 (owner: 10Hoo man) [01:25:36] (03CR) 10Dzahn: [C: 031] resolv: selector outside a resource [puppet] - 10https://gerrit.wikimedia.org/r/195516 (owner: 10Matanya) [01:32:02] (03CR) 10Dzahn: ganglia: autogenerate datasources from the list of clusters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/198721 (owner: 10Giuseppe Lavagetto) [01:36:07] 6operations, 7Graphite: logins on graphite - https://phabricator.wikimedia.org/T93158#1151995 (10Eevans) @Dzahn right, that describes what I am seeing. It was my hope that if logged in I could save graphs, but it sounds like that's not supported. [01:37:00] (03PS3) 10Dzahn: ganglia: DRY, use hiera [puppet] - 10https://gerrit.wikimedia.org/r/198566 (owner: 10Giuseppe Lavagetto) [01:37:45] (03CR) 10Dzahn: "PS3: fixed trailing whitespace and some indentation in common.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/198566 (owner: 10Giuseppe Lavagetto) [01:40:15] 6operations, 7Graphite: logins on graphite - https://phabricator.wikimedia.org/T93158#1152005 (10Dzahn) @eevans i'm afraid it's not. wondering now if we should keep this bug open to support saving graphs and slightly repurpose it [01:47:14] 6operations, 7Graphite: logins on graphite - https://phabricator.wikimedia.org/T93158#1152019 (10Eevans) @Dzahn I can't really speak to how useful this would be, I've never tried it (that's what I was trying to do when I discovered you couldn't login). With http://grafana.wikimedia.org an option, perhaps it's... [01:48:00] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1152021 (10Dzahn) grepping the docroot over trying to figure out the undocumented db schema is a good idea. let's do that. would you have one specific number as an example where i ca... [01:49:44] PROBLEM - Host berkelium is DOWN: PING CRITICAL - Packet loss = 100% [01:50:15] RECOVERY - Host berkelium is UP: PING OK - Packet loss = 0%, RTA = 1.35 ms [02:09:24] PROBLEM - Host berkelium is DOWN: PING CRITICAL - Packet loss = 100% [02:09:35] RECOVERY - Host berkelium is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [02:26:21] !log l10nupdate Synchronized php-1.25wmf22/cache/l10n: (no message) (duration: 05m 03s) [02:26:32] Logged the message, Master [02:28:15] PROBLEM - DPKG on berkelium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [02:29:50] !log LocalisationUpdate completed (1.25wmf22) at 2015-03-26 02:28:47+00:00 [02:29:55] RECOVERY - DPKG on berkelium is OK: All packages OK [02:29:56] Logged the message, Master [02:45:13] !log l10nupdate Synchronized php-1.25wmf23/cache/l10n: (no message) (duration: 02m 57s) [02:45:20] Logged the message, Master [02:47:41] !log LocalisationUpdate completed (1.25wmf23) at 2015-03-26 02:46:37+00:00 [02:47:46] Logged the message, Master [03:16:36] 6operations, 10ops-eqiad: asw-c4-eqiad hardware fault? - https://phabricator.wikimedia.org/T93730#1152075 (10faidon) @Cmjohnson, how much time do you think it would take you to unrack C4, rack a replacement and move all the connections? The steps for this are: http://www.juniper.net/techpubs/en_US/junos13.1/t... [03:24:18] 6operations, 10Wikimedia-General-or-Unknown, 7database: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1152077 (10Krinkle) [03:24:52] 6operations, 10Wikimedia-General-or-Unknown, 7database: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#272037 (10Krinkle) So.. where did that revision go? Can we scan the other clusters perhaps? [03:54:59] Is db67 (mentioned in wmf-config/db-secondary.php) still around? wouldn't that've been in pmtpa? [03:57:54] http://performance.wikimedia.org/xenon/svgs/daily/2015-03-26.svgz [03:58:08] _joe_: seems like a fair amount of time is spent getting redis connections [04:16:18] (03CR) 10KartikMistry: "Ping!" [puppet] - 10https://gerrit.wikimedia.org/r/198692 (owner: 10KartikMistry) [04:16:33] (03CR) 10KartikMistry: "Ping!" [puppet] - 10https://gerrit.wikimedia.org/r/198693 (owner: 10KartikMistry) [04:16:44] _joe_: for the job runners that is [04:23:50] (03PS1) 10KartikMistry: cx: Add Kannada (kn) in target wiki [puppet] - 10https://gerrit.wikimedia.org/r/199822 (https://phabricator.wikimedia.org/T93134) [04:26:48] (03PS2) 10KartikMistry: Use dblist for contenttranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199576 [04:29:13] (03PS1) 10KartikMistry: cx: Install ContentTranslation in Kannada (kn) wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199823 (https://phabricator.wikimedia.org/T93134) [04:35:42] (03CR) 10KartikMistry: "Alex, that's quite known issue and we can safely ignore it at moment. Whole locale stuff in debian/rules are actually added later when we " [debs/contenttranslation/apertium-fr-es] - 10https://gerrit.wikimedia.org/r/195577 (https://phabricator.wikimedia.org/T92252) (owner: 10KartikMistry) [04:40:48] 6operations, 10MediaWiki-Vagrant, 6WMF-Legal: RDoc puppet documentation should state license - https://phabricator.wikimedia.org/T93998#1152173 (10Mattflaschen) 3NEW [04:41:54] 6operations, 10MediaWiki-Vagrant, 6WMF-Legal: RDoc puppet documentation should state license - https://phabricator.wikimedia.org/T93998#1152189 (10Mattflaschen) [04:45:55] 6operations, 6Release-Engineering, 6WMF-Legal, 7Documentation: Sphinx generated documentation should state license in footer - https://phabricator.wikimedia.org/T94000#1152198 (10Mattflaschen) 3NEW [04:55:24] RECOVERY - Disk space on lanthanum is OK: DISK OK [05:07:05] 6operations, 6Security, 10Wikimedia-Shop, 7HTTPS, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1152245 (10MZMcBride) It's still completely unclear to me why any rename makes sense here. Is there some problem with "Wikimedia Shop" and shop.wikimedia... [05:11:46] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [05:13:05] RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 10, down: 0, shutdown: 0 [05:19:37] !log Ran cleanup script for T92775 [05:19:44] Logged the message, Master [05:35:25] (03CR) 10Dr0ptp4kt: [C: 04-1] "I gotta fix a regex." [puppet] - 10https://gerrit.wikimedia.org/r/198805 (owner: 10Dr0ptp4kt) [05:43:53] (03PS1) 10Yuvipanda: tools: Ensure that proxylistener service is running [puppet] - 10https://gerrit.wikimedia.org/r/199830 (https://phabricator.wikimedia.org/T93121) [05:46:01] (03PS2) 10Yuvipanda: tools: Ensure that proxylistener service is running [puppet] - 10https://gerrit.wikimedia.org/r/199830 (https://phabricator.wikimedia.org/T93121) [05:46:24] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Ensure that proxylistener service is running [puppet] - 10https://gerrit.wikimedia.org/r/199830 (https://phabricator.wikimedia.org/T93121) (owner: 10Yuvipanda) [05:48:15] PROBLEM - Disk space on vanadium is CRITICAL: DISK CRITICAL - free space: /srv 12711 MB (3% inode=99%): [05:50:35] uh oh [05:51:02] man, this timezone is very badly represented, isn’t it? [05:51:13] time between SF leaves office and Europe comes online [05:51:26] hmm [05:51:29] there’s 13G left still [05:51:36] let me just poke on the old bug [05:52:30] 6operations, 10Analytics, 10Analytics-EventLogging, 6Analytics-Kanban, 5Patch-For-Review: Disk space full on vanadium from logs in /var/log/upstart - https://phabricator.wikimedia.org/T93185#1152291 (10yuvipanda) 5Resolved>3Open > PROBLEM - Disk space on vanadium is CRITICAL: DISK CRITICAL - free sp... [06:21:15] PROBLEM - HHVM busy threads on mw1034 is CRITICAL: CRITICAL: 87.50% of data above the critical threshold [86.4] [06:21:16] (03PS5) 10Dr0ptp4kt: Do not fragment cache with provenance parameter [puppet] - 10https://gerrit.wikimedia.org/r/198805 [06:22:05] PROBLEM - HHVM queue size on mw1034 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [80.0] [06:23:28] (03CR) 10Dr0ptp4kt: "I think I fixed the regex now. When we're capturing the actual parameter value, we need to be sure to disinclude everything else from the " [puppet] - 10https://gerrit.wikimedia.org/r/198805 (owner: 10Dr0ptp4kt) [06:24:37] <_joe_> YuviPanda: need help? [06:24:46] <_joe_> I'm on a train so my connection is so-so [06:24:53] _joe_: for vanadium? ’tis ok, I think - has enough space, and I’ve notified analyutics [06:25:03] <_joe_> ok [06:27:53] <_joe_> who fucking removed the codfw appservers from dsh? [06:27:56] 6operations, 10Deployment-Systems: Use FQDNs for mediawiki-installation - https://phabricator.wikimedia.org/T93983#1152322 (10greg) [06:28:13] 6operations, 10Deployment-Systems: Use FQDNs for mediawiki-installation - https://phabricator.wikimedia.org/T93983#1151789 (10greg) p:5Triage>3Normal [06:29:27] _joe_: we did, because it broke scap [06:29:47] <_joe_> greg-g: how is that possible, they were there for 2 days [06:29:51] <_joe_> and they did not [06:29:51] I'm going to bed now, but it had to be done so scap could complete successfully [06:30:02] whatever, read the bugs and backscroll, I'm going to bed [06:31:05] PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:24] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:35] PROBLEM - puppet last run on mw1099 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:43] <_joe_> "whatever" is not the right response when people didn't take the time to also remove the icinga check and I wake up to 216 criticals [06:31:55] PROBLEM - puppet last run on amssq54 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:03] <_joe_> but well, I'll remember this is the appropriate answer to you next time ;) [06:32:36] <_joe_> (not saying it's your fault, to be clear, this is mainly on ops that did not fix it, and gave +2 to the change) [06:33:06] thats me, and then i had a dinner approintment and i thought bd was fixing it (turns out he didnt) [06:33:15] * robh is on the last email check before bed, saw backscroll [06:33:25] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:27] _joe_: so yea, i suppose its my fault. we disabled it and didnt realize it had the check [06:33:44] and then once it fired off all the criticals i think daniel went to icinga and stopped the check for them [06:33:50] and now i see in backscroll bd didnt fix [06:33:54] <_joe_> he didn't :) [06:33:55] (and i was afk) [06:34:05] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:10] yea, i had dinner plans, so i didnt follow up until now [06:34:15] (i got home a few minutes ago) [06:34:27] there is also a task for this.... lets see [06:34:54] <_joe_> ok, I'm disabling the check and I'll speak with bryan later [06:35:04] RECOVERY - Disk space on vanadium is OK: DISK OK [06:35:07] <_joe_> this is kinda strnage as yesterday's scap were working fine [06:35:24] <_joe_> sorry, tuesday's [06:35:24] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:59] yea... the task is assigned to him, but im having trouble finding it [06:36:04] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:04] i dont have prpoer filtering for phab emails [06:36:05] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 2 failures [06:37:06] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:10] here we go [06:38:10] https://gerrit.wikimedia.org/r/#/c/199763/ [06:38:18] https://phabricator.wikimedia.org/T93958 [06:38:45] ok, im passing out (was tried before i checked computer ;) [06:39:09] wasnt assigned to him, so was looking in wrong search. [06:39:12] 6operations, 6MediaWiki-Core-Team, 7Wikimedia-log-errors: rbf1001 and rbf1002 are timing out / dropping clients for Redis - https://phabricator.wikimedia.org/T92591#1152373 (10Joe) I see appendonly is manually disabled on the host and puppet is disabled again. I am pretty sure I fixed the replication glitch... [06:39:16] * MaxSem is slightly proud of having written that check :) [06:40:02] (03PS2) 10Santhosh: Enable ContentTranslation in Kannada (kn) wiki as beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199823 (https://phabricator.wikimedia.org/T93134) (owner: 10KartikMistry) [06:41:09] _joe_: you wake up to a bunch of criticals every day anyway :P [06:41:10] * YuviPanda runs [06:42:54] YuviPanda, well - that means that at least he sleeps from time to time, unlike someone ;) [06:42:56] <_joe_> TYYu not 220 [06:45:22] 6operations, 6Release-Engineering: Re-enable codfw scap targets - https://phabricator.wikimedia.org/T93958#1152382 (10Joe) So, just to understand. We merged that change /before/ morning swat the other day, we had 2 swats and a train deploy who worked just fine before you disabled this (also, not disabling the... [06:45:45] RECOVERY - puppet last run on mw1176 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:45:55] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:46:06] RECOVERY - puppet last run on mw1099 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:26] RECOVERY - puppet last run on amssq54 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:47:58] 6operations, 6Release-Engineering: Re-enable codfw scap targets - https://phabricator.wikimedia.org/T93958#1152386 (10MaxSem) Because eqiad hosts might've attempted to rsync from codfw or something? [06:50:37] (03PS1) 10Giuseppe Lavagetto: mediawiki: don't check dsh group in codfw [puppet] - 10https://gerrit.wikimedia.org/r/199834 [06:51:40] (03PS5) 1020after4: VarnishStatusCollector for diamond. [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) [06:53:46] (03PS1) 10Yuvipanda: tools: Ensure that portgranter is running [puppet] - 10https://gerrit.wikimedia.org/r/199835 (https://phabricator.wikimedia.org/T93120) [06:54:19] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: don't check dsh group in codfw [puppet] - 10https://gerrit.wikimedia.org/r/199834 (owner: 10Giuseppe Lavagetto) [06:54:46] (03PS2) 10Yuvipanda: tools: Ensure that portgranter is running [puppet] - 10https://gerrit.wikimedia.org/r/199835 (https://phabricator.wikimedia.org/T93120) [06:55:05] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Mar 26 06:53:58 UTC 2015 (duration 53m 57s) [06:55:13] Logged the message, Master [06:55:34] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Ensure that portgranter is running [puppet] - 10https://gerrit.wikimedia.org/r/199835 (https://phabricator.wikimedia.org/T93120) (owner: 10Yuvipanda) [06:55:53] _joe_: I merged yours too [06:56:23] <_joe_> YuviPanda: uhm, I think I merged yours [06:56:39] _joe_: haha [06:56:43] <_joe_> on strontium only [06:56:44] _joe_: that scripts needs at least *some* locking [06:56:46] <_joe_> :P [06:56:51] I think both our changes got merged [06:56:53] <_joe_> shit that is /lame/ [06:56:53] TEAMWORK!!!!1 [06:57:02] <_joe_> gee [07:05:04] (03PS6) 1020after4: VarnishStatusCollector for diamond. [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) [07:05:56] (03CR) 10jenkins-bot: [V: 04-1] VarnishStatusCollector for diamond. [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) (owner: 1020after4) [07:06:04] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [07:06:15] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [07:06:25] RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [07:06:35] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:06:51] (03PS7) 1020after4: VarnishStatusCollector for diamond. [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) [07:06:55] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [07:06:56] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [07:07:42] YuviPanda: mind merging one of mine ? [07:07:51] matanya: the lint ones? Probably not atm... [07:07:55] I’m not even working >_> [07:07:58] * YuviPanda is watching the match. [07:07:58] ok :) [07:08:10] and I don’t want to go back and readup on the consensus atm... [07:08:11] sorry [07:08:15] I can do some next week [07:14:45] (03PS8) 1020after4: VarnishStatusCollector for diamond. [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) [07:30:10] (03PS9) 1020after4: VarnishStatusCollector for diamond. [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) [07:30:13] (03CR) 10Giuseppe Lavagetto: [C: 04-1] contint: disable hhvm stacktraces / map (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/195035 (https://phabricator.wikimedia.org/T64788) (owner: 10Hashar) [07:30:53] <_joe_> YuviPanda: what match? [07:31:15] _joe_: cricket world cup semifinal :) [07:31:23] india vs aus [07:31:43] <_joe_> oh, wow [07:31:56] <_joe_> so it's you vs springle [07:35:10] <_joe_> good news is the match will not be over until you're in SF ;) [07:42:07] labs vs. database [07:42:47] <_joe_> oh ffs neon config broken again [07:42:52] <_joe_> I can't believe it [07:45:50] (03PS10) 1020after4: VarnishStatusCollector for diamond. [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) [07:52:02] (03PS1) 10Giuseppe Lavagetto: icinga: define labsnfs_codfw hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/199839 [07:53:58] (03CR) 10Giuseppe Lavagetto: [C: 032] icinga: define labsnfs_codfw hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/199839 (owner: 10Giuseppe Lavagetto) [07:59:17] <_joe_> ook, now "only" 10 criticals [08:06:04] 6operations, 6Labs: Puppet failure on labstore1001 - https://phabricator.wikimedia.org/T92615#1152413 (10yuvipanda) @Coren this has been fixed, hasn't it? [08:19:41] PROBLEM - puppetmaster https on virt1000 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:26:05] (03PS11) 1020after4: VarnishStatusCollector for diamond. [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) [08:29:57] (03PS12) 1020after4: VarnishStatusCollector for diamond. [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) [08:31:00] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 7.087 second response time [08:35:51] PROBLEM - puppetmaster https on virt1000 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:41:07] <_joe_> mh I'll check virt1000 [08:41:11] (03PS1) 10Giuseppe Lavagetto: icinga: check its own configuration [puppet] - 10https://gerrit.wikimedia.org/r/199841 [08:41:22] <_joe_> MaxSem: ^^ [08:41:24] <_joe_> :) [08:41:53] kart_: https://gerrit.wikimedia.org/r/#/c/199822/ good to merge at will? [08:45:50] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 7.278 second response time [08:52:08] (03CR) 10Filippo Giunchedi: VarnishStatusCollector for diamond. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) (owner: 1020after4) [08:54:53] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "a few nitpicks, but this is a lint commit, so..." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/195858 (https://phabricator.wikimedia.org/T91908) (owner: 10Matanya) [08:56:55] (03CR) 1020after4: "finally getting close. tested on beta cluster, this is working! :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) (owner: 1020after4) [08:58:09] _joe_: you confused me here :) in the first two comments you request to quote, and unquote in the last one for consistency ? [08:58:25] <_joe_> the ensure => 'present' [08:58:37] <_joe_> not the priority => '10' [08:59:08] <_joe_> but well, I'll be off in 5 minutes for about 30 minutes [08:59:38] _joe_: so just to make sure i got it: quote all the ensures, but not the 10's ? [08:59:59] <_joe_> or a bit more, depending on the status of public transport after yesterday's fnear-looding [09:00:26] <_joe_> matanya: the opposite [09:00:28] (03CR) 10Filippo Giunchedi: "couple of comments, LGTM otherwise" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/199841 (owner: 10Giuseppe Lavagetto) [09:00:39] <_joe_> unquote ensures, quote the 10's [09:00:57] ah, got it now, makes sense [09:01:10] * matanya should sleep from time to time [09:02:24] (03PS3) 10Matanya: ferm: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195858 (https://phabricator.wikimedia.org/T91908) [09:05:15] (03CR) 10Alexandros Kosiaris: [C: 04-1] icinga: check its own configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/199841 (owner: 10Giuseppe Lavagetto) [09:10:03] 6operations, 10hardware-requests, 3Continuous-Integration-Isolation: eqiad: 2 hardware access request for CI isolation on labsnet - https://phabricator.wikimedia.org/T93076#1152460 (10hashar) >>! In T93076#1151467, @RobH wrote: > These are two fairly low requirement systems, can they share a single host? To... [09:12:31] (03PS13) 1020after4: VarnishStatusCollector for diamond. [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) [09:13:24] (03CR) 10jenkins-bot: [V: 04-1] VarnishStatusCollector for diamond. [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) (owner: 1020after4) [09:14:19] (03PS14) 1020after4: VarnishStatusCollector for diamond. [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) [09:17:25] (03PS15) 1020after4: VarnishStatusCollector for diamond. [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) [09:18:15] (03CR) 1020after4: [C: 031] "ok I'm finally done testing this, it's ready for final review if there are no further issues :)" [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) (owner: 1020after4) [09:19:52] (03PS16) 1020after4: VarnishStatusCollector for diamond. [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) [09:21:34] (03CR) 1020after4: [C: 031] "spoke too soon. but now it's ready, really! :)" [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) (owner: 1020after4) [09:29:14] is wikitech login broken for anyone else or just me? [09:39:01] it says: Incorrect password entered. Please try again. but I'm using the same password I always do. also, takes a long time to respond to the login form submit. [09:48:37] twentyafterfour: broken for me also [09:49:57] !log swift weight for ms-be101[678] to 2000 [09:50:21] Could someone please restart keystone? [09:50:27] Wikitech login is stuck again [09:51:47] I'll take a look [09:57:27] !log restart keystone on virt1000 [09:57:35] Logged the message, Master [09:57:38] try again? [09:58:04] works now :) [09:58:05] thanks [09:58:18] (03CR) 10Steinsplitter: "why schoult this affect site perforence? bjorsch aka anome pls explain. (see pagb)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195938 (https://phabricator.wikimedia.org/T87431) (owner: 10Glaisher) [09:59:50] anomie: you was pinged in phab. Pls comment. [10:00:04] Not helpful blocking tasks witout commenting.... **Sigh** [10:05:25] 6operations, 10ops-codfw: ms-be2002.codfw.wmnet: slot=4 dev=sde failed - https://phabricator.wikimedia.org/T94014#1152528 (10fgiunchedi) 3NEW [10:06:52] ACKNOWLEDGEMENT - RAID on ms-be2002 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) Filippo Giunchedi https://phabricator.wikimedia.org/T94014 [10:07:01] ACKNOWLEDGEMENT - puppet last run on ms-be2002 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi https://phabricator.wikimedia.org/T94014 [10:11:55] (03CR) 10Filippo Giunchedi: [C: 031] "no problem, happy to help if I can :) LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) (owner: 1020after4) [10:12:19] * _joe_ back [10:23:15] 6operations, 10Parsoid, 6Services, 10service-template-node, 7service-runner: Create a standard service template / init / logging / package setup - https://phabricator.wikimedia.org/T88585#1152563 (10mobrovac) [10:36:32] godog: sorry about the zuul package :( [10:36:46] godog: must be tedious to review since it is one huge change [10:37:05] hashar: nah it is fine, no worries [10:43:32] 6operations, 10MediaWiki-API: mw1135 has errors, depooled - https://phabricator.wikimedia.org/T93626#1152595 (10Joe) Upon inspection, no log file or system metrics suggested any system problem; so I restarted HHVM (which needed to be restarted anyway following some upgrade of shared libraries), and put the ser... [10:48:34] godog: nope. tonight only. [10:48:41] godog: if you're here, https://gerrit.wikimedia.org/r/198693 [10:48:46] please merge it :) [10:49:00] godog: and https://gerrit.wikimedia.org/r/198692 [10:50:57] (03PS2) 10Filippo Giunchedi: Beta: Add missing 'af' and 'az' [puppet] - 10https://gerrit.wikimedia.org/r/198692 (owner: 10KartikMistry) [10:51:03] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Beta: Add missing 'af' and 'az' [puppet] - 10https://gerrit.wikimedia.org/r/198692 (owner: 10KartikMistry) [10:51:12] (03PS2) 10Filippo Giunchedi: Apertium: Install needed language pairs for CX [puppet] - 10https://gerrit.wikimedia.org/r/198693 (owner: 10KartikMistry) [10:51:21] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Apertium: Install needed language pairs for CX [puppet] - 10https://gerrit.wikimedia.org/r/198693 (owner: 10KartikMistry) [10:51:30] Thanks! [10:51:39] hi kart_ [10:52:02] kart_: np, please -1 or -2 the other code review if it isn't supposed to be merged [10:52:11] godog: okay! [10:52:16] Nikerabbit: hola [10:53:39] godog: do we need to do anything to install those packages in cxserver? [10:53:52] or it will be install with next puppet run? [10:55:07] kart_: the latter, puppet will take care of those [10:55:40] PROBLEM - RAID on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:55:51] PROBLEM - dhclient process on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:56:03] (03PS1) 10Hashar: zuul: update zuul-merger init info [puppet] - 10https://gerrit.wikimedia.org/r/199852 [10:56:20] PROBLEM - DPKG on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:56:40] PROBLEM - SSH on labstore1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:57:00] PROBLEM - puppet last run on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:57:10] PROBLEM - salt-minion processes on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:58:17] <_joe_> mmmh that doesn't look good at all [10:58:40] RECOVERY - puppet last run on labstore1001 is OK: OK: Puppet is currently enabled, last run 11 minutes ago with 0 failures [10:58:40] RECOVERY - salt-minion processes on labstore1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:58:49] RECOVERY - RAID on labstore1001 is OK: OK: optimal, 72 logical, 72 physical [10:59:00] RECOVERY - dhclient process on labstore1001 is OK: PROCS OK: 0 processes with command name dhclient [10:59:41] RECOVERY - DPKG on labstore1001 is OK: All packages OK [10:59:52] (03PS2) 10KartikMistry: cx: Add Kannada (kn) and Ukranian (uk) in target wikis [puppet] - 10https://gerrit.wikimedia.org/r/199822 [11:00:00] RECOVERY - SSH on labstore1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [11:01:31] (03PS3) 10KartikMistry: cx: Enable ContentTranslation in Kannada (kn) and Ukranian (uk) wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199823 [11:01:39] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1152632 (10fgiunchedi) >>! In T78514#1150555, @GWicke wrote: > @fgiunchedi: Since we depend on this & are on track to add mor... [11:07:26] 6operations, 7Graphite, 5Patch-For-Review: replace txstatsd - https://phabricator.wikimedia.org/T90111#1152634 (10fgiunchedi) ok so ignoring for now mediawiki and eventlogging meters these are the ones that will stop working: ``` graphite1001:~$ grep 'txstatsd meter' rename-all.log | grep -v -e /whisper/job... [11:09:18] (03PS2) 10Giuseppe Lavagetto: icinga: check its own configuration [puppet] - 10https://gerrit.wikimedia.org/r/199841 [11:09:31] (03CR) 10Giuseppe Lavagetto: icinga: check its own configuration (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/199841 (owner: 10Giuseppe Lavagetto) [11:24:51] (03PS1) 10Steinsplitter: Adding *.loc.gov to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199854 (https://phabricator.wikimedia.org/T94017) [11:35:28] 6operations, 10Continuous-Integration: Get python-gear 0.5.5 to trusty-wikimedia and jessie-wikimedia - https://phabricator.wikimedia.org/T92684#1152660 (10hashar) [11:42:52] (03PS1) 10Filippo Giunchedi: scap: improve deploy2graphite [puppet] - 10https://gerrit.wikimedia.org/r/199857 (https://phabricator.wikimedia.org/T1387) [11:44:37] 6operations, 10Deployment-Systems, 6Release-Engineering, 5Patch-For-Review: /usr/local/bin/deploy2graphite broken on tin due to nc command syntax - https://phabricator.wikimedia.org/T1387#1152683 (10fgiunchedi) thanks Bryan for looking into this! I've improved the script at https://gerrit.wikimedia.org/r/1... [11:49:35] (03CR) 10Filippo Giunchedi: [C: 04-1] icinga: check its own configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/199841 (owner: 10Giuseppe Lavagetto) [11:50:21] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] zuul: update zuul-merger init info [puppet] - 10https://gerrit.wikimedia.org/r/199852 (owner: 10Hashar) [11:50:57] godog: it is tedious to keep all of that in sync :D [11:56:23] hashar: indeedly [12:00:29] PROBLEM - puppet last run on mw2052 is CRITICAL: CRITICAL: puppet fail [12:12:24] 7Puppet, 7Browser-Tests: [Regression] QA: Puppet failing for Role::Ci::Slave::Browsertests/elasticsearch - https://phabricator.wikimedia.org/T74255#1152735 (10zeljkofilipin) [12:13:27] 6operations, 10ops-eqiad: asw-c4-eqiad hardware fault? - https://phabricator.wikimedia.org/T93730#1152738 (10Cmjohnson) The racking portion will only take 15 minutes...that's removing the old switch and replacing with a new switch. The current JunOS version on asw-c5 is JUNOS 11.4 R6.5. I went to the Juniper... [12:18:20] RECOVERY - puppet last run on mw2052 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [12:24:40] (03PS1) 10Hashar: zuul: provide sane defaults in init scripts [puppet] - 10https://gerrit.wikimedia.org/r/199861 [12:33:48] 6operations, 6Labs: Puppet failure on labstore1001 - https://phabricator.wikimedia.org/T92615#1152771 (10coren) 5Open>3Resolved a:3coren Puppet applies cleanly, and that error is a DNS fail resolving 'puppet' so I'm a little mystified on how that could even have happened in prod. [12:35:03] 7Puppet, 10Deployment-Systems, 6Release-Engineering: Puppet failure on deployment-sentry2 - https://phabricator.wikimedia.org/T78411#1152777 (10zeljkofilipin) [12:41:25] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1152789 (10Aklapper) >>! In T85141#1150129, @JohnLewis wrote: > * Get the bugid of all security bugs P440 if you want to hardcode (accessible by mutante, not sure how paranoid I shou... [12:46:38] (03PS1) 10KartikMistry: Beta: CX: Add Gujarati as target language [puppet] - 10https://gerrit.wikimedia.org/r/199866 (https://phabricator.wikimedia.org/T93999) [12:46:55] godog: ^^ This can go. [12:49:42] (03CR) 10KartikMistry: [C: 04-1] "Not to merge/deploy until deployment for config is done." [puppet] - 10https://gerrit.wikimedia.org/r/199822 (owner: 10KartikMistry) [12:52:15] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Beta: CX: Add Gujarati as target language [puppet] - 10https://gerrit.wikimedia.org/r/199866 (https://phabricator.wikimedia.org/T93999) (owner: 10KartikMistry) [12:54:46] kart_: yup, done [13:04:15] (03PS4) 10Filippo Giunchedi: contint: migrate to require_package() [puppet] - 10https://gerrit.wikimedia.org/r/188034 (owner: 10Hashar) [13:04:20] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] contint: migrate to require_package() [puppet] - 10https://gerrit.wikimedia.org/r/188034 (owner: 10Hashar) [13:04:36] godog: thanks! [13:12:45] 6operations, 10ops-codfw: Set up pdu's - https://phabricator.wikimedia.org/T84416#1152860 (10mark) eqiad row D and some of row C is missing in LibreNMS too, so those should be setup as well. [13:13:04] 6operations, 10ops-codfw: Set up missing PDUs in codfw and eqiad - https://phabricator.wikimedia.org/T84416#1152862 (10mark) [13:13:47] * Steinsplitter send anomie a ":-)" token [13:16:49] 6operations, 10ops-codfw: Set up missing PDUs in codfw and eqiad - https://phabricator.wikimedia.org/T84416#1152872 (10mark) p:5Normal>3High [13:20:23] 6operations, 10Wikimedia-Labs-Other, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1152877 (10coren) [13:23:38] (03CR) 10Jgreen: [C: 031] donate - Enable HSTS max-age=7 days [puppet] - 10https://gerrit.wikimedia.org/r/199200 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [13:43:54] 6operations, 10Beta-Cluster, 6Labs: Core dumps fill up /var on labs instances - https://phabricator.wikimedia.org/T1259#1152920 (10coren) a:5coren>3None The new partitioning scheme has more room in /var for stray core dumps; though this does not address the necessity of cleaning/collecting them as apropr... [13:48:47] 6operations, 10Continuous-Integration, 6Labs, 10Wikimedia-Labs-Infrastructure: dnsmasq returns SERVFAIL for (some?) names that do not exist instead of NXDOMAIN - https://phabricator.wikimedia.org/T92351#1152934 (10coren) 5Open>3Resolved This has been worked around in beta, and the new DNS server (see T... [13:49:56] 6operations, 7network: determine networking for ganeti2001-2006 - https://phabricator.wikimedia.org/T93932#1152939 (10akosiaris) Hello, We got unfortunately a row restriction at this point as with the current networking setup we can not have VMs span rows. That might not be the case in the future either becau... [14:00:59] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Just so there is confirmation from ops for what Antoine and Timo already said, no pip in production please. It's unfortunately a no go. I " [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [14:03:26] jouncebot, next [14:03:26] In 0 hour(s) and 56 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150326T1500) [14:10:30] 6operations, 10Wikimedia-General-or-Unknown, 7database: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1152976 (10Krenair) I suspect that if we had this blob laying around before, it may have been lost during https://wikitech.wikimedia.... [14:18:28] (03CR) 10Alexandros Kosiaris: Package builder module (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/194471 (owner: 10Alexandros Kosiaris) [14:27:55] (03PS3) 10KartikMistry: cx: Add Kannada (kn) and Ukrainian (uk) in target wikis [puppet] - 10https://gerrit.wikimedia.org/r/199822 [14:28:17] (03PS4) 10KartikMistry: cx: Enable ContentTranslation in Kannada (kn) and Ukrainian (uk) wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199823 [14:35:10] 6operations, 6MediaWiki-Core-Team, 7Wikimedia-log-errors: rbf1001 and rbf1002 are timing out / dropping clients for Redis - https://phabricator.wikimedia.org/T92591#1153031 (10chasemp) >>! In T92591#1152373, @Joe wrote: > I see appendonly is manually disabled on the host and puppet is disabled again. I am pr... [14:42:33] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1153055 (10GWicke) > sadly I think the machine is already maxed with 4x SSD, how many metrics you had in mind? orders of magn... [14:45:44] (03CR) 10JanZerebecki: "That argument is like "We should not use FooSoftware as it is copyrighted software, although it is available under a free software licence" [puppet] - 10https://gerrit.wikimedia.org/r/199582 (owner: 10BBlack) [14:46:00] paravoid: Can you check https://gerrit.wikimedia.org/r/#/c/199297/4 ? Has your suggested numbers. [14:46:17] 6operations, 7network: determine networking for ganeti2001-2006 - https://phabricator.wikimedia.org/T93932#1153072 (10Papaul) I think the best place to put those server will be Rack in position 5 A5U27 A5U28 B5U15 B5U32 C5U21 C5U27 @Rob what do you think? [14:48:45] 6operations, 7Graphite: logins on graphite - https://phabricator.wikimedia.org/T93158#1153091 (10Dzahn) I think grafana is the answer for that indeed. Let me close this then. We can always reopen. [14:49:03] 6operations, 7Graphite: logins on graphite - https://phabricator.wikimedia.org/T93158#1153092 (10Dzahn) 5Open>3Invalid a:3Dzahn [14:50:54] ^d, thcipriani, marktraceur, Krenair: Who wants to SWAT this morning? [14:51:07] kart_: Ping for SWAT in about 9 minutes [14:51:28] bit busy today [14:51:36] Hrm. [14:51:44] I'm going to bow out too, things are exciting [14:51:49] I'll probably be more helpful next month [14:52:14] <^d> jouncebot: next [14:52:14] In 0 hour(s) and 7 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150326T1500) [14:55:45] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1153127 (10Dzahn) Thank you very much Andre, that was helpful and what we wanted. [14:56:58] here now. [14:57:18] my patch is quite simple :) [14:58:02] anomie: I think I can do this one. As long as whatever was going on with scap yesterday is resolved... [14:58:21] thcipriani: Ok, I'll be around if anything goes wrong. What was going on with scap yesterday? [14:58:58] some of the rsync proxies weren't syncing, I guess. twentyafterfour probably has more detail than that... [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur, Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150326T1500). [15:00:44] (03CR) 10KartikMistry: [C: 031] "Good to go anytime now." [puppet] - 10https://gerrit.wikimedia.org/r/199822 (owner: 10KartikMistry) [15:00:58] anomie: sync-common was running real slow and eventually we had to kill some of the ssh tasks ...it seemed to be caused by all the new codfw servers so they were temporarily removed and then everything sync'd quickly [15:01:22] godog: you can merge https://gerrit.wikimedia.org/r/199822 now+5m [15:01:49] PROBLEM - Host mw2031 is DOWN: PING CRITICAL - Packet loss = 100% [15:02:50] RECOVERY - Host mw2031 is UP: PING WARNING - Packet loss = 93%, RTA = 42.85 ms [15:02:56] kart_: +2ing now [15:03:43] (03CR) 10Thcipriani: [C: 032] cx: Enable ContentTranslation in Kannada (kn) and Ukrainian (uk) wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199823 (owner: 10KartikMistry) [15:04:40] akosiaris: godog can you merge https://gerrit.wikimedia.org/r/199822? Thanks :) [15:05:54] (03CR) 10Alexandros Kosiaris: [C: 032] cx: Add Kannada (kn) and Ukrainian (uk) in target wikis [puppet] - 10https://gerrit.wikimedia.org/r/199822 (owner: 10KartikMistry) [15:07:11] thanks, akosiaris ! [15:07:32] * godog highfives akosiaris [15:08:05] That Grafana thing RULES! [15:09:11] :-) [15:09:38] Do you know where the dashboards are saved? Browser-local storage? [15:09:52] (03Merged) 10jenkins-bot: cx: Enable ContentTranslation in Kannada (kn) and Ukrainian (uk) wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199823 (owner: 10KartikMistry) [15:12:55] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT [[gerrit:199823]] (duration: 00m 06s) [15:13:03] thcipriani: thanks! [15:13:05] Logged the message, Master [15:13:12] kart_: yup! [15:13:32] (03PS1) 10Dzahn: add check_iostat nagios plugin [puppet] - 10https://gerrit.wikimedia.org/r/199915 (https://phabricator.wikimedia.org/T93783) [15:14:08] kart_: lmk if everything looks correct. If so, thus concludes my first solo SWAT [15:16:01] wish my first swat was that simple >_> [15:16:36] Krenair: thankful my first was so simple :) [15:21:25] (03PS1) 10Cmjohnson: Adding init.pp & power.xml entries for ps1-d1 to d8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/199918 [15:24:09] (03PS2) 10Cmjohnson: (fixing white space) adding init.pp & power.xml entries for ps1-d1 to d8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/199918 [15:28:02] robh: mind reviewing https://gerrit.wikimedia.org/r/#/c/199918/1? [15:28:21] ya [15:29:10] (03CR) 10RobH: [C: 031] (fixing white space) adding init.pp & power.xml entries for ps1-d1 to d8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/199918 (owner: 10Cmjohnson) [15:29:36] thx [15:30:02] (03CR) 10Cmjohnson: [C: 032] (fixing white space) adding init.pp & power.xml entries for ps1-d1 to d8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/199918 (owner: 10Cmjohnson) [15:34:58] (03PS9) 10Alexandros Kosiaris: Package builder module [puppet] - 10https://gerrit.wikimedia.org/r/194471 [15:42:36] (03PS10) 10Alexandros Kosiaris: Package builder module [puppet] - 10https://gerrit.wikimedia.org/r/194471 [15:43:02] (03PS2) 10Dzahn: add check_iostat nagios plugin [puppet] - 10https://gerrit.wikimedia.org/r/199915 (https://phabricator.wikimedia.org/T93783) [15:50:18] akosiaris: hello! I love your package_builder module :) [15:50:46] hashar: :-) [15:50:53] akosiaris: I am confused by your comment on https://gerrit.wikimedia.org/r/#/c/194471/8/modules/package_builder/manifests/init.pp [15:51:00] about the apt.wm.o / thirdpary component [15:51:03] thanks for the comments. I am actually implementing them right now [15:51:14] I think we had some custom/backport packages uploaded under 'main' [15:51:21] (03CR) 10Chmarkine: "I think so because patent was one of the reasons that Wikimedia Commons decided not to support MP4 video format." [puppet] - 10https://gerrit.wikimedia.org/r/199582 (owner: 10BBlack) [15:51:26] oh, so the idea is to be able to build packages that rely as little as possible on apt.w.o [15:51:33] unless that is impossible [15:51:57] that is why I proposed to have pristine 'jessie' and a custom 'jessie-wikimedia' [15:52:00] in which case, hooks should be fine to add that functionality [15:52:13] this way when we build for 'jessie' we are sure there is nothing from our repo coming in :D [15:52:46] (03PS1) 10Gerardduenas: Create 'Portal' and 'Portal_Discussió' namespaces at cawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199923 (https://phabricator.wikimedia.org/T93811) [15:52:50] on my machine I end up with precise-wikimedia trusty-wikimedia jessie-wikimedia and jessie [15:52:56] yeah the more I am working on it, the more I see -wikimedia basepaths under /var/cache/pbuilder becoming true [15:53:32] so I think I am slowly converging to what you have in mind [15:53:41] last night, at looked at jenkins debian glue (the shell script that make it trivial to build package in Jenkins). It would be able to reuse those pbuilder images [15:53:49] so in Jenkins we would configure a repo with something like: [15:53:52] name: varnish [15:54:02] - varnish-debian-glue: [15:54:07] jessie-wikimedia [15:54:45] and if we ever maintain a package that has to be for both upstream jessie and our we can pass both distributions (and thus trigger two jobs) [15:54:55] - our-supersoftware: [15:55:02] - jessie-wikimedia [15:55:06] - jessie [15:55:08] - sid [15:55:09] etc [15:55:22] yeah sounds like the most flexible approach [15:55:28] and quite nice indeed :-) [15:55:55] but you will have to detect the $DIST is suffixed with -wikimedia to (a) inject the apt.wikimedia.org (b) add the 'thirdparty' component [15:55:59] maybe it can be done in the hook [15:58:13] (03CR) 10Hoo man: [C: 031] "Approving from an AbuseFilter perspective (no idea what the community things, this can be disruptive on their side): This should not cause" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195938 (https://phabricator.wikimedia.org/T87431) (owner: 10Glaisher) [15:58:21] ^d: I just created a gerrit project with a typo in the name. Can I delete it from the web interface or do I need to appeal to a higher power for cleanup? [15:59:13] <^d> `ssh -p 29418 gerrit.wikimedia.org deleteproject delete --yes-really-delete path/to/bogus/repo` [15:59:26] <^d> Then I'll have to clean up replication targets in Gitblit and Github [15:59:32] <^d> What's the repo name? [15:59:46] sink_nova_mixed_multi [16:00:00] (03PS6) 10BBlack: Do not fragment cache with provenance parameter [puppet] - 10https://gerrit.wikimedia.org/r/198805 (owner: 10Dr0ptp4kt) [16:00:03] It should’ve been sink_nova_fixed_multi (which I have since also created and am now using) [16:00:05] kart_, ^d: Dear anthropoid, the time has come. Please deploy Content Translation/cxserver (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150326T1600). [16:00:25] ^d: ah, sounds like you have more urgent things to attend to. Should I just make you a phab task to cleanup at your leisure? [16:00:48] <^d> I just nuked the thing from Github [16:01:20] <^d> You can do the ssh invocation and gitblit (antimony:/var/lib/git/*.git) [16:01:23] <^d> Just rm -R [16:01:30] ok, doing now. [16:01:31] Thanks [16:01:34] akosiaris: and the reason I asked for $arch to be added to the basepath, it is because that is jenkins debian glue default behavior. Can be overriden anyway [16:02:10] hashar: yeah, I think I am gonna add it anyway. Makes sense [16:02:45] akosiaris: from there, it will be trivial to add a bunch of Jenkins jobs :) [16:03:34] Running little late for deployment today. [16:03:55] ^d ^^ [16:04:00] (03CR) 10BBlack: "There was a bunch of excess complexity from (a) unnecessary use of grouping parentheses where no grouping was needed and (b) using groupin" [puppet] - 10https://gerrit.wikimedia.org/r/198805 (owner: 10Dr0ptp4kt) [16:06:08] (03PS1) 10Gerardduenas: Add import sources for cawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199927 (https://phabricator.wikimedia.org/T93750) [16:06:17] (03CR) 10jenkins-bot: [V: 04-1] Add import sources for cawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199927 (https://phabricator.wikimedia.org/T93750) (owner: 10Gerardduenas) [16:09:33] 6operations, 10ops-codfw: Receive 6 new misc virt cluster nodes - https://phabricator.wikimedia.org/T91977#1153423 (10RobH) [16:09:34] 6operations, 7network: determine networking for ganeti2001-2006 - https://phabricator.wikimedia.org/T93932#1153421 (10RobH) 5Open>3Resolved @papaul, Your proposed layout is directly in contradiction with @akosiaris statement that they have to all remain in the same row. If we didn't have that restriction... [16:11:23] (03PS2) 10Gerardduenas: Add import sources for cawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199927 (https://phabricator.wikimedia.org/T93750) [16:15:58] 6operations, 10ops-codfw: rack/onsite setup of ganeti2001-2006 - https://phabricator.wikimedia.org/T91977#1153445 (10RobH) [16:17:35] ok. back. [16:18:56] (03PS1) 10RobH: setting ganeti2001-2006 mgmt entries [dns] - 10https://gerrit.wikimedia.org/r/199929 (https://phabricator.wikimedia.org/T91977) [16:19:02] (03PS1) 10Gerardduenas: Enable GeoData at cawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199930 (https://phabricator.wikimedia.org/T93637) [16:19:52] 6operations, 10ops-codfw: set asset tag mgmt dns entries - https://phabricator.wikimedia.org/T94041#1153461 (10RobH) 3NEW a:3Papaul [16:20:03] 6operations, 10ops-codfw, 5Patch-For-Review: rack/onsite setup of ganeti2001-2006 - https://phabricator.wikimedia.org/T91977#1153468 (10RobH) p:5Triage>3High [16:20:44] (03CR) 10RobH: [C: 032] setting ganeti2001-2006 mgmt entries [dns] - 10https://gerrit.wikimedia.org/r/199929 (https://phabricator.wikimedia.org/T91977) (owner: 10RobH) [16:20:54] PROBLEM - puppet last run on amssq45 is CRITICAL: CRITICAL: puppet fail [16:21:13] (03PS1) 10KartikMistry: CX: Add missing source languages [puppet] - 10https://gerrit.wikimedia.org/r/199931 [16:21:24] 6operations, 10ops-codfw: mw2088 has a faulty RAM - https://phabricator.wikimedia.org/T93370#1153471 (10Papaul) 5Open>3Resolved memory replacement complete. [16:22:50] <^d> kart_: You shouldn't need me anymore with the key fixes we've done in prod :) [16:23:00] <^d> scap & co don't use forwarded keys, and the repos are all https [16:23:29] ^d: yay! [16:23:34] <^d> (the latter having been your problem) [16:23:38] ^d: yes. I did it since last two deployment [16:24:53] <^d> akosiaris: Is there actually a way in sshd_config or somesuch to disable logging in by users who have their agent forwarded? [16:25:01] <^d> (because that'd be cool :p) [16:29:31] akosiaris: can you merge https://gerrit.wikimedia.org/r/#/c/199931/ if not sleepy? [16:29:34] Thanks :) [16:29:44] ^d: ForwardAgent no [16:30:06] kart_: meeting [16:30:20] kart_: will do later today [16:30:20] ^d: oh, i got that wrong, that's client config [16:30:42] akosiaris: no worry. [16:31:28] <^d> mutante: Yeah, I've got that on my client and everyone else should too :) [16:31:39] <^d> But I was wondering if the daemon could detect the case and yell [16:32:59] ^d: yea, gotcha. i'm trying to find out now:) [16:33:08] 6operations: setup/deploy ganeti2001-2006 - https://phabricator.wikimedia.org/T94042#1153480 (10RobH) 3NEW a:3RobH [16:33:46] <^d> mutante: AllowAgentForwarding? [16:34:42] <^d> Specifies whether ssh-agent(1) forwarding is permitted. The de- fault is "yes". Note that disabling agent forwarding does not improve security unless users are also denied shell access, as they can always install their own forwarders. [16:35:46] ^d: yes!:) [16:36:53] (03CR) 10Santhosh: [C: 031] CX: Add missing source languages [puppet] - 10https://gerrit.wikimedia.org/r/199931 (owner: 10KartikMistry) [16:37:04] ^d: let's make a patch for that? did you want to or should i? [16:37:47] <^d> Installing your own forwarder would be a no-no and probably violate the server agreement [16:38:20] yea [16:38:53] RECOVERY - puppet last run on amssq45 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:39:11] <^d> This is mainly to keep people from accidentally doing unsafe things :) [16:39:18] 6operations, 10MediaWiki-extensions-GWToolset, 6Multimedia, 7Performance: Can Commons support a mass upload of 14 million files (1.5 TB)? - https://phabricator.wikimedia.org/T88758#1153498 (10Steinsplitter) [16:39:23] even if it doesn't help against malicious users it.. [16:40:17] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1153510 (10RobH) I'm not sure what part I have not addressed. We are currently planning to allow #wmf-nda access to the proposed #procurement project tasks. We do NOT want to have to start maintain... [16:42:13] (03PS3) 10Steinsplitter: cleanup: upload has been disabled on outrechwiki, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196885 [16:42:34] FYI: I've just told Jenkins to restart safely, which means it's going to wait on some currently running jobs and then restart. Sorry for the inconvenience. [16:44:42] greg-g: geeee! :) [16:44:54] Hope that I won't miss my window [16:45:16] sorry :( [16:45:26] it's to fix the beta code update lock [16:45:55] (03PS1) 10Chad: [sshd] Disable agent forwarding [puppet] - 10https://gerrit.wikimedia.org/r/199936 [16:46:00] <^d> mutante: Feel free to throw a dozen folks on that ^ [16:46:02] <^d> Or 2 [16:46:10] (03CR) 10JanZerebecki: "Yes the argumentation in that RFC is about free/open standards/formats, but not about mere existence of a patent. With patents like with c" [puppet] - 10https://gerrit.wikimedia.org/r/199582 (owner: 10BBlack) [16:46:44] 7Blocked-on-Operations, 6operations, 10Continuous-Integration, 6Scrum-of-Scrums: Jenkins is using php-luasandbox 1.9-1 for zend unit tests; precise should be upgraded to 2.0-8 or equivalent - https://phabricator.wikimedia.org/T88798#1153531 (10Anomie) [16:47:21] 7Blocked-on-Operations, 6operations, 10Continuous-Integration, 6Scrum-of-Scrums: Jenkins is using php-luasandbox 1.9-1 for zend unit tests; precise should be upgraded to 2.0-8 or equivalent - https://phabricator.wikimedia.org/T88798#1020616 (10Anomie) (updating bug title since I see a newer version in [[ht... [16:47:52] ^d: ok, i was talking to a guy in #openssh about it :) [16:48:05] 6operations, 10Datasets-General-or-Unknown, 10Wikidata, 3§ Wikidata-Sprint-2015-03-24: Wikidata dumps contain old-style serialization. - https://phabricator.wikimedia.org/T74348#1153537 (10hoo) [16:49:50] <^d> mutante: Ah? [16:51:29] ^d: well, yea. what we already knew mostly. < BasketCase> note that agent forwarding is a security risk for them not you" [16:51:29] !log kartik Started scap: Update ContentTranslation [16:51:36] Logged the message, Master [16:51:55] withing window greg-g :) [16:52:33] <^d> mutante: For the "user" [16:52:39] <^d> If the user has the wrong key in their chain ;-) [16:52:41] ^d: well, also: < BasketCase> it may not be completely safe but it is very useful sometimes and can be done in safer than default ways < BasketCase> it can still be useful. If I forward an agent to a system that I am not the only root person on I require confirmation for access. [16:52:59] (because i mentioned having labs and prod etc) [16:53:46] <^d> Which it's not a bad idea, but "it's useful for servers I have the only root on" isn't really compelling to me ;-) [16:54:39] <^d> imho "people accidentally exposing themselves" trumps "this is easy!" [16:56:15] <^d> Anyway, worth having a discussion about :) [16:57:41] ^d: isn't it good that we discovered this after so many years? (agent fwd) :) [16:58:09] <^d> Well, nobody ever thought it was /safe/ :p [16:58:23] <^d> We just knew it was hella convenient and hadn't bothered getting it out of the deployment process [16:58:25] that's why different keys for labs and prod [16:58:58] <^d> Yes, but people accidentally put both keys into the same agent when switching between the two [16:59:04] <^d> And boom your prod key is exposed to labs [16:59:34] PROBLEM - puppet last run on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:59:42] PROBLEM - salt-minion processes on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:59:44] they would forward their agent to labs? [16:59:58] <^d> Again, best practices already discourage the behaviors we're talking about here, I'm mainly talking about people accidentalling themselves. [17:00:02] PROBLEM - dhclient process on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:00:03] PROBLEM - RAID on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:00:23] PROBLEM - DPKG on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:00:42] PROBLEM - SSH on labstore1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:01:01] <^d> Krenair: My line of reasoning is "nobody should forward anyway via best practices" -> "let's make it harder to accidentally do so then" [17:04:06] !log Jenkins is stuck trying to restart (it's having problems stopping). We're on it over in -releng [17:04:44] RECOVERY - dhclient process on labstore1001 is OK: PROCS OK: 0 processes with command name dhclient [17:04:45] can I get a root, actually, to help kill Jenkins on gallium if what bd808 tries now doesn't work? [17:04:53] RECOVERY - RAID on labstore1001 is OK: OK: optimal, 72 logical, 72 physical [17:05:12] RECOVERY - DPKG on labstore1001 is OK: All packages OK [17:05:23] RECOVERY - SSH on labstore1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [17:05:30] mutante: robh help with kill Jenkins plz? [17:06:04] RECOVERY - puppet last run on labstore1001 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures [17:06:05] RECOVERY - salt-minion processes on labstore1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:06:21] Stopping with the init script on gallium didn't work so it needs a kill -9 from a root [17:06:57] what needs killing? [17:07:10] https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Restart_all_of_Jenkins [17:07:13] hardcore way [17:07:25] Jenkins process on gallium. PID 1597 [17:07:26] doing now [17:07:44] Krinkle: don't look now, but Jenkins is dead [17:07:48] ok, done [17:07:54] kill -9 and restarted per instrucitons [17:08:02] Infallible way to stop anything: "racadm serveraction hardreset" :-) [17:08:07] starting now [17:08:08] greg-g: thx for link =] [17:08:51] oh, morebots is gone [17:09:12] (but not qa-morebots) [17:09:31] heh. it's trying to join #wikimedia-qa [17:09:35] jenkins is [17:10:03] lots of gearman things happening in the log [17:11:03] hmm, I wonder where that's set (joining -qa) [17:11:27] probably the default for the irc bot config [17:12:26] it should have started by now :( [17:12:43] "Please wait while Jenkins is getting ready to work" [17:13:36] 5-20 minutes [17:15:25] ok. it's only been about 6 minutes [17:16:27] * Krinkle updates documentation and merges mw.org/Jenkins#restart and wikitech Jenkins#restart [17:16:45] Krinkle: ty [17:22:46] https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Restart [17:22:50] greg-g: yw [17:22:58] (03PS1) 10Thcipriani: Add silver dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199940 [17:23:31] greg-g: eh. we found issue while deployment is going on. Should I redo extension update once scap is over? [17:24:43] !log kartik Finished scap: Update ContentTranslation (duration: 33m 14s) [17:24:51] Logged the message, Master [17:24:54] kart_: sure [17:26:29] greg-g: thanks! [17:26:34] (03CR) 10Dzahn: [C: 031] [sshd] Disable agent forwarding [puppet] - 10https://gerrit.wikimedia.org/r/199936 (owner: 10Chad) [17:28:32] (03PS1) 10Dzahn: move jobqueue monitoring out of ganglia.pp [puppet] - 10https://gerrit.wikimedia.org/r/199942 [17:28:34] (03PS1) 10Dzahn: ganglia: remove class ganglia::logtailer [puppet] - 10https://gerrit.wikimedia.org/r/199943 [17:29:22] (03PS2) 10Dzahn: move jobqueue monitoring out of ganglia.pp [puppet] - 10https://gerrit.wikimedia.org/r/199942 (https://phabricator.wikimedia.org/T93776) [17:29:27] (03PS2) 10Dzahn: ganglia: remove class ganglia::logtailer [puppet] - 10https://gerrit.wikimedia.org/r/199943 (https://phabricator.wikimedia.org/T93776) [17:30:45] (03PS4) 10Dzahn: ganglia: DRY, use hiera [puppet] - 10https://gerrit.wikimedia.org/r/198566 (https://phabricator.wikimedia.org/T93776) (owner: 10Giuseppe Lavagetto) [17:31:02] (03PS2) 10Dzahn: ganglia: autogenerate datasources from the list of clusters [puppet] - 10https://gerrit.wikimedia.org/r/198721 (https://phabricator.wikimedia.org/T93776) (owner: 10Giuseppe Lavagetto) [17:31:06] (03PS2) 10Dzahn: ganglia: remove unused configs from ganglia::collector::config [puppet] - 10https://gerrit.wikimedia.org/r/198720 (https://phabricator.wikimedia.org/T93776) (owner: 10Giuseppe Lavagetto) [17:32:15] 6operations, 10Analytics, 10Analytics-EventLogging, 6Analytics-Kanban, 5Patch-For-Review: Disk space full on vanadium from logs in /var/log/upstart - https://phabricator.wikimedia.org/T93185#1153713 (10Nuria) Given that eventlogging does not write anything to /srv (other than code source, which is tiny)... [17:33:32] (03CR) 10Gerardduenas: [C: 031] Remove another useless wgMasterWaitTimeout reference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199803 (https://phabricator.wikimedia.org/T31902) (owner: 10Alex Monk) [17:33:43] PROBLEM - Host mw2128 is DOWN: PING CRITICAL - Packet loss = 100% [17:35:16] (03CR) 10KartikMistry: [C: 031] [sshd] Disable agent forwarding [puppet] - 10https://gerrit.wikimedia.org/r/199936 (owner: 10Chad) [17:36:28] (03CR) 10Thcipriani: [C: 032] Add silver dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199940 (owner: 10Thcipriani) [17:36:33] (03Merged) 10jenkins-bot: Add silver dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199940 (owner: 10Thcipriani) [17:37:33] (03PS3) 10Dzahn: ganglia: autogenerate datasources from the list of clusters [puppet] - 10https://gerrit.wikimedia.org/r/198721 (https://phabricator.wikimedia.org/T93776) (owner: 10Giuseppe Lavagetto) [17:37:47] (03CR) 10Gerardduenas: [C: 031] cleanup: upload has been disabled on outrechwiki, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196885 (owner: 10Steinsplitter) [17:38:06] (03CR) 10Dzahn: "changed "octect" to "octet", fixed minor whitespace" [puppet] - 10https://gerrit.wikimedia.org/r/198721 (https://phabricator.wikimedia.org/T93776) (owner: 10Giuseppe Lavagetto) [17:42:49] 6operations, 10Analytics, 10Analytics-EventLogging, 6Analytics-Kanban, 5Patch-For-Review: Disk space full on vanadium from logs in /var/log/upstart - https://phabricator.wikimedia.org/T93185#1153740 (10Nuria) Issues with getting a larger infklow of events and thus disk getting filled up on /srv (as event... [17:43:35] (03CR) 10Dzahn: [C: 031] "confirmed. File uploads are disabled." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196885 (owner: 10Steinsplitter) [17:45:09] 6operations, 10Analytics, 10Analytics-EventLogging, 6Analytics-Kanban, 5Patch-For-Review: Disk space full on vanadium from logs in /var/log/upstart - https://phabricator.wikimedia.org/T93185#1153751 (10Nuria) Pending ticket to sap vanadium box: https://phabricator.wikimedia.org/T90363 [17:50:25] (03PS1) 10MaxSem: Deploy Gather in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199950 [17:50:44] (03PS1) 10Ori.livneh: Configure OCG to report counters rather than meters [puppet] - 10https://gerrit.wikimedia.org/r/199952 [17:50:47] godog: ^ [17:51:23] (03PS1) 10MaxSem: Enable Gather on test and test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199953 [17:54:27] (03CR) 10Filippo Giunchedi: [C: 031] Configure OCG to report counters rather than meters [puppet] - 10https://gerrit.wikimedia.org/r/199952 (owner: 10Ori.livneh) [17:55:12] ori: LGTM, thanks! [17:56:08] ok. question. [17:56:25] I just re updated extensions/CX after full scap run. [17:56:39] Do I need to run scap again or anything quick? [17:57:22] godog: https://github.com/wikimedia/service-runner/pull/27 should take care of citoid and parsoid [17:57:50] ori: are you moving away from txstatsd? [17:57:58] gwicke: no, running away [17:58:10] or "fleeing from" [17:58:13] what to? [17:58:20] statsite [17:58:36] did you find a solution for the output differences? [17:58:36] ori: sync-dir php-1.20wmf12/extensions/Whatever - still recommaded for small change in extension? [17:58:45] * Coren adopts grafana. [17:58:49] kart_: yes. [17:58:55] ori: thanks! [17:59:21] (03PS11) 10Alexandros Kosiaris: Package builder module [puppet] - 10https://gerrit.wikimedia.org/r/194471 [17:59:23] (03CR) 10Alex Monk: "Looks like this deploys to all of the beta cluster wikis, but does not enable it in production..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199950 (owner: 10MaxSem) [17:59:29] godog: can you fill gwicke in? [18:00:41] I will name it George, and I will hug it and pet it and squeeze it. [18:01:10] ori, godog: is there info on the task? [18:01:49] * gwicke looks at https://phabricator.wikimedia.org/T90111 [18:01:58] (03PS1) 10Nuria: Vanadium to keep logs for 30 days [puppet] - 10https://gerrit.wikimedia.org/r/199957 (https://phabricator.wikimedia.org/T93185) [18:02:07] (03CR) 10MaxSem: "Yep, there's a followup;)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199950 (owner: 10MaxSem) [18:02:55] gwicke: yep that's the one [18:03:11] does it compute the same derived metrics like rate, percentiles etc? [18:04:10] those are quite popular in dashboards [18:04:46] PROBLEM - uWSGI web apps on graphite1002 is CRITICAL: NRPE: Command check_uwsgi not defined [18:05:02] gwicke: so for counters the value is pushed, see the difference here https://phabricator.wikimedia.org/T90111#1058631 [18:05:12] 6operations, 7Graphite, 5Patch-For-Review: replace txstatsd - https://phabricator.wikimedia.org/T90111#1153802 (10GWicke) Which changes do you expect in computed metrics like rate, percentiles, mean / median etc? Will the keys be the same, so that dashboards keep working? [18:05:27] PROBLEM - Graphite Carbon on graphite1002 is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [18:05:58] godog: ah, thx [18:06:43] godog: is it possible to configure the names of the derived metrics to match those of (tx)statsd? [18:07:22] gwicke: not out of the box, no [18:07:42] !log kartik Synchronized php-1.25wmf22/extensions/ContentTranslation: (no message) (duration: 00m 08s) [18:07:49] Logged the message, Master [18:08:06] !log kartik Synchronized php-1.25wmf23/extensions/ContentTranslation: (no message) (duration: 00m 08s) [18:08:12] Logged the message, Master [18:09:30] (03PS12) 10Alexandros Kosiaris: Package builder module [puppet] - 10https://gerrit.wikimedia.org/r/194471 [18:12:34] godog: would adapting the metric names require monkey-patching? [18:12:58] (03PS3) 10Alexandros Kosiaris: Ganeti module/role introduced [puppet] - 10https://gerrit.wikimedia.org/r/198794 (https://phabricator.wikimedia.org/T87258) [18:13:18] 6operations, 7Graphite, 5Patch-For-Review: replace txstatsd - https://phabricator.wikimedia.org/T90111#1153830 (10fgiunchedi) the basic steps of the transition including renaming are here: https://phabricator.wikimedia.org/T90111#1065553 essentially: * rename metrics from txstatsd names to statsite names **... [18:13:58] gwicke: it is all C, so patching alright but monkeys won't be involved I think [18:14:45] godog: no monkeys were harmed in the production of this patch ;) [18:15:34] hehe indeed [18:15:45] godog: more seriously, changing the names in one place (statsite source) would probably be less work than fixing all the dashboards [18:16:28] carrying an upstream patch forever is not a good idea, though [18:16:33] dashboards is something we control internally [18:16:59] & if there is an ultra-fast successor of statsite at some point written in assembly it would be more likely to use the same names as statsd / txstatsd I think [18:17:57] paravoid: my main worry is that we'll only get more manually created dashboards over time [18:18:10] and updating all those is rather expensive compared to patching the source [18:18:23] we can revisit "over time" [18:18:29] not a real issue right now aiui [18:19:05] we need to rename metrics again down the line too anyway [18:19:08] gwicke: assembly? we are going to make a RISC architecture that does just statsd [18:19:09] I know that changing just the dashboards I use will take maybe an hour [18:19:16] right now the hierarchy is in still a bit of a mess [18:19:33] and we are about to create a more of them for cassandra [18:19:43] gdash? [18:19:50] grafana [18:19:53] (03PS13) 10Alexandros Kosiaris: Package builder module [puppet] - 10https://gerrit.wikimedia.org/r/194471 [18:20:11] well ok dunno in that case [18:20:18] grafana is in a bit of an experimental state anyway [18:20:31] proper clustering never worked (because of graphite metric naming/assumptions grafana makes) [18:20:56] can we rename in grafana programmatically? e.g. with sed? [18:21:06] there is also alerts [18:21:13] godog: will it take less than an hour? :) [18:21:14] we are just about to set up a couple for RB/C* [18:21:16] doubt it [18:21:29] paravoid: yeah sed is pretty fast [18:21:51] afaik the dashboards are stored in elasticsearch [18:22:08] that's my understanding as well [18:22:40] I do think that changing now is doable [18:23:05] but if we ever move away from statsite it'll be more expensive [18:23:53] yep, every statsd implementation basically made up names AFAICT [18:24:54] actually wonder if statsd / txstatsd use the same [18:25:59] actually, statsd uses upper/lower instead of min/max [18:26:10] which matches statsite [18:27:08] (03PS14) 10Alexandros Kosiaris: Package builder module [puppet] - 10https://gerrit.wikimedia.org/r/194471 [18:29:20] godog: from what I see there is no agreement on percentiles at all, so whatever we do will probably break dashboards [18:29:55] heh, yeah it is a sad ecosystem [18:32:47] 6operations, 7Graphite, 5Patch-For-Review: replace txstatsd - https://phabricator.wikimedia.org/T90111#1153922 (10GWicke) For the record, it looks like statsite and statsd actually agree on lower/upper (txstatsd is the odd one out with min/max), but none of those agree on percentiles. So, +1 for going with n... [18:32:54] (03CR) 10Dzahn: [C: 04-1] "eh, we don't want the 2009 version, we want the 2014 version :p" [puppet] - 10https://gerrit.wikimedia.org/r/199915 (https://phabricator.wikimedia.org/T93783) (owner: 10Dzahn) [18:34:24] (03CR) 10Chmarkine: "I see. Thanks for your explanation, JanZerebecki!" [puppet] - 10https://gerrit.wikimedia.org/r/199582 (owner: 10BBlack) [18:45:09] (03CR) 10Ottomata: [C: 032] Vanadium to keep logs for 30 days [puppet] - 10https://gerrit.wikimedia.org/r/199957 (https://phabricator.wikimedia.org/T93185) (owner: 10Nuria) [18:48:27] !log replacing pfw1-codfw/pfw2/codfw [18:48:35] Logged the message, Master [18:50:23] ... why did my email to engineering-l get reposted? [18:50:27] * Coren confuses a bit. [18:51:58] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 114, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/3: down - Core: pfw1-codfw:xe-6/0/0 {#10900} [10Gbps DF]BR [18:52:53] (03PS3) 10Dzahn: add check_iostat nagios plugin [puppet] - 10https://gerrit.wikimedia.org/r/199915 (https://phabricator.wikimedia.org/T93783) [18:53:06] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 110, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/3: down - Core: pfw2-codfw:xe-6/0/0 {#10901} [10Gbps DF]BR [18:53:11] mutante: you rock :) [18:56:20] (03PS1) 10Negative24: Update uninstalled applications IDs [puppet] - 10https://gerrit.wikimedia.org/r/199966 [18:56:57] PROBLEM - Host praseodymium is DOWN: PING CRITICAL - Packet loss = 100% [18:58:17] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [18:58:48] paravoid: oh:) [18:59:23] FYI, https://svn.wikimedia.org has an old cert. "The certificate expired on 01/31/2015 05:53 AM. The current time is 03/26/2015 02:51 PM." [19:00:04] bd808: Respected human, time to deploy Grant review (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150326T1900). Please do the needful. [19:00:44] Pennth: thanks for reporting. we have a ticket for it at https://phabricator.wikimedia.org/T88731 but we are also doing https://phabricator.wikimedia.org/T86655 [19:01:19] ok, ty. [19:02:36] tgr: ping [19:03:34] (03CR) 10Dzahn: [C: 032] add check_iostat nagios plugin [puppet] - 10https://gerrit.wikimedia.org/r/199915 (https://phabricator.wikimedia.org/T93783) (owner: 10Dzahn) [19:06:08] PROBLEM - puppet last run on holmium is CRITICAL: Timeout while attempting connection [19:06:22] Do ops here also work on mediawiki.org? The manual there has several links to svg, e.g.http://www.mediawiki.org/wiki/Manual:Hooks/UserLoadFromSession [19:06:57] PROBLEM - Host holmium is DOWN: PING CRITICAL - Packet loss = 100% [19:07:25] Pennth: mediawiki itself is a bit more in #wikimedia-dev but there is always overlap of people in channels [19:08:07] PROBLEM - Host labs-ns2.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [19:08:50] andrewbogott: is holmium down = labs work? [19:09:03] mutante: yes, and it won’t break anything [19:09:06] I’m just working on it [19:09:08] i see it's designate. ok. thanks [19:13:37] RECOVERY - Host holmium is UP: PING OK - Packet loss = 0%, RTA = 2.12 ms [19:13:56] RECOVERY - Host labs-ns2.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 2.39 ms [19:14:13] urandom gwicke the additional cassandra metrics from the test cluster are going to fill up graphite's disk btw [19:14:17] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [19:14:58] can you please not send those? [19:15:52] godog: it's a staging cluster, so the goal is to have everything identical to prod [19:16:46] removing the metrics there would require some puppet changes [19:16:59] we should reclaim those boxes actually [19:17:20] it was a hack to begin with, it's starting to become a problem [19:17:24] paravoid: fine with me as long as we have some other staging environment [19:17:26] you should really use labs for that kind of testing :) [19:17:34] labs isn't staging [19:17:46] well that's what the rest of the org uses for staging [19:18:02] maybe the new staging cluster could fill that gap, we'll see [19:18:04] there is even a project called "staging" in it [19:18:16] depends on how close it actually is [19:18:23] what's missing from labs/ [19:18:24] it is certainly useless for perf testing [19:19:09] ok, do we actually need perf testing right now? [19:19:29] we can't use prod for that any more [19:19:45] as stressing prod does affect client latency [19:20:10] how/when do you actually stress test? [19:20:34] for example, we are looking into using lzma compression in cassandra [19:21:03] and we are also looking into the effect of compression block sizes on compression ratios [19:21:35] vs. read latencies [19:22:09] ori: pong [19:22:16] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 116, down: 0, dormant: 0, excluded: 0, unused: 0 [19:22:42] gwicke: I've already pointed out that's too many metrics, can we reduce it? [19:23:15] well the current status quo can't continue, we should discuss it with the Labs team [19:23:17] godog: maybe, urandom knows more about which we absolutely need & which are optional [19:23:23] maybe keep it as hardware but move it under Labs [19:24:03] "CRITICAL: Puppet last ran 15 days ago" is really a big problem [19:24:28] is that currently the case? [19:24:35] last I knew puppet was enabled there [19:24:49] ah no, that's ruthenium [19:24:57] :/ [19:25:21] that would be a good candidate for Labs as well [19:25:40] so you guys can have root and your own puppetmaster and everything [19:25:51] and so that we don't get exposed to all that [19:25:57] I'll file a ticket in a bit [19:25:58] own puppetmaster is actually a pain [19:26:17] doable, but not much fun to set up [19:26:21] with salt etc in the mix [19:26:48] if we can get hardware in labs then that would be fine too [19:27:10] especially if somebody else manages a prod-like puppet / salt master [19:27:17] what do you mean someone else? [19:27:22] this is all puppetized & documented [19:27:25] (and optional, obviously) [19:27:34] https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster [19:27:48] tgr: I'm going to update your patch to be much smaller than it is currently. If you don't like it, you can revert the change to the previous change-set. is that ok? [19:27:52] right, I documented the full process at https://wikitech.wikimedia.org/wiki/Labs_node_setup [19:28:11] you can keep relying on the labs puppetmaster (i.e. production) to which you can't commit [19:28:19] or you can set up a new puppetmaster to experiment with [19:28:30] and then push changes into prod [19:28:36] ori, sure [19:28:41] paravoid: issue is hardware primarily [19:28:43] I thought the latter would be preferrable, but I don't mind either way :) [19:28:52] although I'm not sure what's left in it to omit apart from doc/tests [19:28:53] gwicke, godog: I'm not sure how to answer that question when it is framed as "need". [19:29:11] yeah what I'd like to see happen would be to have a hardware node in the Labs domain [19:29:17] not sure how trivial it would be [19:29:32] that said, what you /can/ do without hardware, you should [19:29:43] if there are any metrics that are truly extraneous, it's so few it won't make a difference [19:29:49] marko didn't seem to know about the Labs clusters you were talking about the other day [19:29:52] (03PS1) 10Dzahn: have sysstat and bc installed on list servers [puppet] - 10https://gerrit.wikimedia.org/r/199979 (https://phabricator.wikimedia.org/T93783) [19:30:06] urandom: e.g. the system CF stats, basically what's in https://phabricator.wikimedia.org/T78514 [19:30:11] paravoid: he actually updated the deployment-prep instances [19:30:23] (03PS2) 10Dzahn: ensure sysstat and bc installed on list servers [puppet] - 10https://gerrit.wikimedia.org/r/199979 (https://phabricator.wikimedia.org/T93783) [19:30:39] not talking about deployment-prepe [19:30:48] just a separate labs cluster with puppetmasters and such [19:30:52] puppetmaster* [19:30:54] godog: yeah, i wouldn't label those as extraneous either [19:31:04] paravoid: to us those two are basically the same thing [19:31:14] both provide kvm instances [19:31:27] godog: but it sounds like there is only room to have so many, so we should get rid of some, whether they are needed or not [19:31:30] useful for testing functionality, not so much for perf / scale [19:31:56] the issue with running our own project is that we also have to run our own logstash, statsd, graphite etc [19:32:07] so deployment-prep is easier [19:32:29] urandom: yeah there's some stats on how much space they use on the ticket, anyways I'm off [19:33:14] !log Applied schema changes to iegreview@m2-master.eqiad.wmnet for T92391 [19:33:20] Logged the message, Master [19:33:31] (03CR) 10Dzahn: "this is for https://gerrit.wikimedia.org/r/#/c/199915/" [puppet] - 10https://gerrit.wikimedia.org/r/199979 (https://phabricator.wikimedia.org/T93783) (owner: 10Dzahn) [19:33:31] godog, ttyl! [19:33:39] (03CR) 10Gage: [C: 031] ensure sysstat and bc installed on list servers [puppet] - 10https://gerrit.wikimedia.org/r/199979 (https://phabricator.wikimedia.org/T93783) (owner: 10Dzahn) [19:34:00] ori: have you looked at https://gerrit.wikimedia.org/r/#/c/199789 ? [19:34:01] !log Updated iegreview to 7797bfc (Change email address used for sending out grants-related mails) for T92391 [19:34:05] Logged the message, Master [19:34:13] I'd really rather not cram all that stuff into mediawiki.js [19:34:24] which is already a monster [19:35:14] plus I am still not sure some of the async stuff won't need to be readded later [19:35:42] greg-g: hey! we missed the Bouncehandler deploy window yesterday ( crossed ~2am here ). Can we have that one shifted to next available day ? [19:36:06] tonythomas: when ya want it? [19:36:21] the next available day. hope Jeff_Green would be around ! [19:36:36] having a terrible exam week over here :\ [19:36:37] That would be Monday [19:37:04] wow. that would be great. Jeff_Green : you will be around, right ? [19:37:24] yep monday is fine [19:37:46] yay ! [19:38:32] tonythomas: Jeff_Green what support do you need from my team? can you two JFDI on Monday? [19:39:11] greg-g: of course. hope Jeff_Green has his test mail server at trouser.org ready :) [19:39:54] i'm not familiar with the mechanics of deploys but I can do log squinting support etc [19:41:12] ya'll need a config change sync'd at least, any code updates tonythomas ? [19:41:37] nope. the only one we had was backported and synced last-last week [19:42:12] coolio, then as long as the patch is written then I bet Jeff_Green can figure out how to sync-file :) [19:42:25] its just a +2 away ! [19:42:40] greg-g: possibly maybe :-P [19:42:48] :) ping us/me if you need help :) [19:43:05] ok [19:43:20] tgr: I don't think we need https://gerrit.wikimedia.org/r/#/c/199789/, to be honest. Errors shouldn't happen, and if they happen we should fix them fast. Issuing the user a "case number" for the error is silly, IMO. [19:44:31] for one thing, what should happen and what actually will are quite different in this case [19:44:43] or are you volunteering to fix all the gadets? :) [19:45:39] but error ids are not meant to support or justify keeping bugs around for long [19:46:06] they are simply tools to connect error reports made by users to exceptions [19:46:52] there are going to be errors that are nontrivial to reproduce and impossible to fix just based on an exception log [19:47:06] which, for many browsers, won't even include stack traces [19:47:17] IMO, the correct way to do this is to ship something minimal and safe, and extend it as appropriate based on the experience we acquire [19:47:42] the patch in its current state brings us from operating blindly to having an ability to track errors [19:47:55] let's not pile things on it that we think will be useful later [19:47:58] agreed, I just don't see how stripping all the tests and documentation and inlining everything makes it either more minimal or more safe :) [19:48:16] we have machines for minifying code [19:48:31] i kept the doc-block describing the event [19:48:49] the rest was just sprawl; i don't think the patchset as it exists now needs tests [19:50:44] ori: why not? [19:50:53] what would you test? [19:50:58] !log praseodymium - log in via mgmt, run puppet to restore flushed iptables rules [19:51:06] it's nontrivial functionality, it has at least three separate contracts [19:51:06] Logged the message, Master [19:51:43] calling mw.track, calling the old handler and preserving its return value, functioning correctly when there is no old handler [19:53:15] OK, I think that's not necessary, but it wouldn't harm to test that [19:53:37] RECOVERY - Host praseodymium is UP: PING OK - Packet loss = 0%, RTA = 2.27 ms [19:54:20] !log praseodymium - fix firewalling [19:54:25] Logged the message, Master [19:56:03] tgr: it's your patch, and your call -- amend it as you see fit (and reject my changes if you like). The gist of my advice here is to seek reliability by zealously avoiding complexity, slashing away anything that is even a little bit iffy, rather than by trying to solve all problems comprehensively. [19:56:16] 6operations, 6Zero, 6Zero-Team: mdot/zerodot webroot Accept-Language redirects for zero-rated access - https://phabricator.wikimedia.org/T1229#1154304 (10Yurik) [19:57:30] gwicke: it's back, i opened it up so it could talk to puppet again, started ferm service, ran puppet again to restore rules [19:57:57] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [19:59:17] ori: most of the slashing makes sense [19:59:41] do you have strong feelings about adding a separate file, though? [19:59:42] tgr: i do agree with you that tests for the things you mentioned would be better [20:00:16] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 114, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/3: down - Core: pfw1-codfw:xe-6/0/0 {#10900} [10Gbps DF]BR [20:00:22] it needs a public method for the tests, and I really don't want to add another chunk of independent functionality to the already-huge mediawiki.js [20:00:57] fwiw it's not separate module, so there is not much overhead apart from the self-executing function wrapper [20:01:36] window.onerror is already public, no? :) [20:01:39] gwicke: please no manual iptables because it conflicts with ferm. are you at office? you have a meeting apparently [20:01:45] and an extra request in debug mode, but as I understand it debug mode is goiing to go away anyway [20:01:48] separate file is fine [20:01:57] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 116, down: 0, dormant: 0, excluded: 0, unused: 0 [20:02:04] it is, the self-executing function you are using is not, though [20:02:48] besides, messing with window.onerror in a test is probably unhealthy since the test framework relies on it to detect errors [20:02:56] (03CR) 10Dzahn: [C: 032] ensure sysstat and bc installed on list servers [puppet] - 10https://gerrit.wikimedia.org/r/199979 (https://phabricator.wikimedia.org/T93783) (owner: 10Dzahn) [20:02:59] (03CR) 10CSteipp: [C: 031] [sshd] Disable agent forwarding [puppet] - 10https://gerrit.wikimedia.org/r/199936 (owner: 10Chad) [20:05:35] ^d: how are we supposed to push to gerrit from tin then? [20:06:47] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 114, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/3: down - Core: pfw1-codfw:xe-6/0/0 {#10900} [10Gbps DF]BR [20:07:07] <^d> legoktm: https! [20:07:18] <^d> (which should also be rare) [20:08:03] alright [20:09:18] (03CR) 10Dzahn: "sodium already had these, so noop, but puppetized it for future servers" [puppet] - 10https://gerrit.wikimedia.org/r/199979 (https://phabricator.wikimedia.org/T93783) (owner: 10Dzahn) [20:09:27] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 112, down: 0, dormant: 0, excluded: 0, unused: 0 [20:09:35] (ignore those) [20:14:27] PROBLEM - Host mw2166 is DOWN: PING CRITICAL - Packet loss = 100% [20:14:45] (03PS1) 10Thcipriani: Trebuchet group wikidev; mw-staging owner mwdeploy [puppet] - 10https://gerrit.wikimedia.org/r/199988 (https://phabricator.wikimedia.org/T94054) [20:15:24] !log set email for User:ThistleDew172@enwiki and attached to global [20:15:30] Logged the message, Master [20:15:47] RECOVERY - Host mw2166 is UP: PING OK - Packet loss = 0%, RTA = 43.17 ms [20:16:03] <^d> legoktm: So if you're doing those one by one how long will it take you to finish SUL finalization? ;-) [20:18:25] re: SUL all mutante users are me, except the one on es. and pt. wp and those are very inactive [20:18:27] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [20:18:40] ^d: these are just the special ones, 99% of the people just need to go to Special:MergeAccount, type in their password and everything will be ok. [20:19:26] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [20:19:43] <^d> legoktm: Yes, I'm making fun of you :p [20:20:17] :P [20:20:20] oh, the mutante on es.wp has been usurped i see.. maybe i can finish it ? checking [20:21:24] omg, i'm unified already :) [20:21:44] (03PS1) 10BBlack: strongswan: divert sysvinit script, order pkg inst deps [puppet] - 10https://gerrit.wikimedia.org/r/199990 [20:23:06] (03CR) 10BryanDavis: [C: 031] "I'd suggest trying it out as a cherry-pick on deployment-salt to make sure things work as expected. You'll need a manual migration step to" [puppet] - 10https://gerrit.wikimedia.org/r/199988 (https://phabricator.wikimedia.org/T94054) (owner: 10Thcipriani) [20:23:08] (03PS2) 10BBlack: strongswan: divert sysvinit script, order pkg inst deps [puppet] - 10https://gerrit.wikimedia.org/r/199990 [20:23:10] (03PS1) 10Dzahn: install check_iostat on list server [puppet] - 10https://gerrit.wikimedia.org/r/199991 (https://phabricator.wikimedia.org/T93783) [20:23:25] (03CR) 10jenkins-bot: [V: 04-1] install check_iostat on list server [puppet] - 10https://gerrit.wikimedia.org/r/199991 (https://phabricator.wikimedia.org/T93783) (owner: 10Dzahn) [20:23:55] (03PS2) 10Dzahn: install check_iostat on list server [puppet] - 10https://gerrit.wikimedia.org/r/199991 (https://phabricator.wikimedia.org/T93783) [20:24:54] (03CR) 10BBlack: [C: 032 V: 032] strongswan: divert sysvinit script, order pkg inst deps [puppet] - 10https://gerrit.wikimedia.org/r/199990 (owner: 10BBlack) [20:26:06] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:26:08] 7Blocked-on-Operations, 6Scrum-of-Scrums, 6Zero-Team, 7Varnish: Some traffic is not identified as Zero in Varnish - https://phabricator.wikimedia.org/T88366#1154404 (10Yurik) [20:26:47] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [20:27:10] (03PS1) 10Andrew Bogott: Use the custom nova_fixed_multi handler for sink. [puppet] - 10https://gerrit.wikimedia.org/r/199995 (https://phabricator.wikimedia.org/T93928) [20:27:13] 6operations, 6Collaboration-Team, 6Editing, 6Engineering-Community, and 15 others: Create team projects for all teams participating in scrum of scrums - https://phabricator.wikimedia.org/T1211#1154422 (10Yurik) [20:27:36] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 110, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/3: down - Core: pfw2-codfw:xe-6/0/0 {#10901} [10Gbps DF]BR [20:28:01] 6operations, 6Collaboration-Team, 6Editing, 6Engineering-Community, and 14 others: Create team projects for all teams participating in scrum of scrums - https://phabricator.wikimedia.org/T1211#20962 (10Yurik) [20:28:04] (03CR) 10jenkins-bot: [V: 04-1] Use the custom nova_fixed_multi handler for sink. [puppet] - 10https://gerrit.wikimedia.org/r/199995 (https://phabricator.wikimedia.org/T93928) (owner: 10Andrew Bogott) [20:28:36] (03CR) 10Dzahn: [C: 032] install check_iostat on list server [puppet] - 10https://gerrit.wikimedia.org/r/199991 (https://phabricator.wikimedia.org/T93783) (owner: 10Dzahn) [20:29:26] PROBLEM - Host curium is DOWN: PING CRITICAL - Packet loss = 100% [20:29:47] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 116, down: 0, dormant: 0, excluded: 0, unused: 0 [20:30:22] (03PS2) 10Andrew Bogott: Use the custom nova_fixed_multi handler for sink. [puppet] - 10https://gerrit.wikimedia.org/r/199995 (https://phabricator.wikimedia.org/T93928) [20:30:47] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 112, down: 0, dormant: 0, excluded: 0, unused: 0 [20:31:37] RECOVERY - Host curium is UP: PING OK - Packet loss = 0%, RTA = 3.11 ms [20:31:43] (03CR) 10Andrew Bogott: [C: 032] Use the custom nova_fixed_multi handler for sink. [puppet] - 10https://gerrit.wikimedia.org/r/199995 (https://phabricator.wikimedia.org/T93928) (owner: 10Andrew Bogott) [20:33:15] bblack: would you mind taking a look at the update i made last night on the vcl? it's https://gerrit.wikimedia.org/r/198805 [20:33:27] PROBLEM - puppet last run on sodium is CRITICAL: CRITICAL: puppet fail [20:33:28] yurik: ^ any chance we could try this on the beta cluster? [20:33:31] ^^ [20:33:48] dr0ptp4kt, sec [20:33:49] yurik: not introduction of a critical puppet fail, the vcl...the *working* vcl :) [20:34:35] dr0ptp4kt: already did and updated it further [20:34:43] (03PS1) 10Eevans: trim list of Cassandra metrics [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/199998 (https://phabricator.wikimedia.org/T78514) [20:34:50] (03PS1) 10Dzahn: lists: fix duplicate definition of systat package [puppet] - 10https://gerrit.wikimedia.org/r/199999 [20:35:01] bblack: i shoulda known [20:35:07] PROBLEM - DPKG on curium is CRITICAL: Connection refused by host [20:35:16] PROBLEM - RAID on curium is CRITICAL: Connection refused by host [20:35:16] I think it will work now, just needs testing to confirm [20:35:27] PROBLEM - Disk space on curium is CRITICAL: Connection refused by host [20:35:33] (03CR) 10Dzahn: [C: 032] lists: fix duplicate definition of systat package [puppet] - 10https://gerrit.wikimedia.org/r/199999 (owner: 10Dzahn) [20:35:50] heh why is curium even in icinga alerts? ignore that [20:35:57] PROBLEM - configured eth on curium is CRITICAL: Timeout while attempting connection [20:36:07] PROBLEM - dhclient process on curium is CRITICAL: Timeout while attempting connection [20:36:16] 6operations, 10ops-codfw, 5Patch-For-Review: rack/onsite setup of ganeti2001-2006 - https://phabricator.wikimedia.org/T91977#1154511 (10Papaul) Rack table update physical label in place mgmt and Bios seetings complete test complete ganeti2001 10.193.2.165 ge-1/0/7 B1 ganeti2002 10.193.2.166 ge-1/0/8 B1 g... [20:36:17] PROBLEM - salt-minion processes on curium is CRITICAL: Timeout while attempting connection [20:37:57] bblack: that's awesome! i'm looking [20:38:27] RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [20:41:10] (03PS1) 10BBlack: fix exec: s/cmd/command/ [puppet] - 10https://gerrit.wikimedia.org/r/200003 [20:41:28] 6operations, 10ops-codfw, 5Patch-For-Review: rack/onsite setup of ganeti2001-2006 - https://phabricator.wikimedia.org/T91977#1154671 (10Papaul) 5Open>3Resolved commplete [20:41:29] 6operations: setup/deploy ganeti2001-2006 - https://phabricator.wikimedia.org/T94042#1154673 (10Papaul) [20:41:30] (03CR) 10BBlack: [C: 032 V: 032] fix exec: s/cmd/command/ [puppet] - 10https://gerrit.wikimedia.org/r/200003 (owner: 10BBlack) [20:44:22] bblack: that looks great! and i prefer non-escaped question marks in character classes like you did [20:45:59] (03CR) 10Dr0ptp4kt: [C: 031] "This is looking really good. Chatting with @yurik in order to try this on the beta cluster." [puppet] - 10https://gerrit.wikimedia.org/r/198805 (owner: 10Dr0ptp4kt) [20:46:17] RECOVERY - configured eth on curium is OK: NRPE: Unable to read output [20:46:26] RECOVERY - dhclient process on curium is OK: PROCS OK: 0 processes with command name dhclient [20:46:26] RECOVERY - salt-minion processes on curium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:46:51] (03CR) 10RobH: [C: 04-1] "You have a string of typos in the 10. file. You meant to put WMF6162, etc... and you have wtp6162, etc...." (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/199275 (owner: 10Papaul) [20:47:29] (03CR) 10RobH: add mgmt asset tag info for wtp200(1-20) (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/199275 (owner: 10Papaul) [20:47:36] RECOVERY - Disk space on curium is OK: DISK OK [20:47:37] RECOVERY - DPKG on curium is OK: All packages OK [20:47:46] RECOVERY - RAID on curium is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [20:48:38] dr0ptp4kt, sorry, had to step away for a sec [20:48:41] updating it now [20:48:53] dr0ptp4kt, do you want me to show you how? [20:50:01] yurik: on another call, so can't do a video call right now. would you be able to do a screen recording thing? [20:50:51] yurik: or even just the set of commands if you email them to me. i can go looking around at paths and that sort of thing to piece together /why/ it works. been a while since i setup a varnish server! (the interview task!) [20:51:05] but i can figure it out given the hosts and commands [20:53:21] dr0ptp4kt, mobile or text? [20:54:06] dr0ptp4kt, changed and restarted mobile, try it [20:54:08] yurik: both possible? then i can check the header in both places [20:54:15] yurik: i'll try mobile now [20:55:59] dr0ptp4kt, text-* is not even on that server - deployment-cache-mobile03 [20:56:26] yurik: that's okay. i do think the test on mobile technically is sufficient [20:56:34] yurik: it's working splendidly cc bblack [20:56:57] bblack, want to make that patch as part of the new global tagging file? :D [20:57:55] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1154884 (10Eevans) To summarize the conversation on #wikimedia-operations earlier today, there isn't enough Graphite storage... [20:58:32] (03PS2) 10RobH: raid10-gpt.cfg partman fixed [puppet] - 10https://gerrit.wikimedia.org/r/199647 (https://phabricator.wikimedia.org/T93113) [20:58:40] yurik: :P [20:58:46] (03CR) 10RobH: [C: 032] raid10-gpt.cfg partman fixed [puppet] - 10https://gerrit.wikimedia.org/r/199647 (https://phabricator.wikimedia.org/T93113) (owner: 10RobH) [20:59:12] dr0ptp4kt, let me know when i can reenable autosync [20:59:26] dr0ptp4kt: ok so we're tested and good, merge to prod? [20:59:26] (03CR) 10Dr0ptp4kt: "It worked on the beta cluster for mobile as well." [puppet] - 10https://gerrit.wikimedia.org/r/198805 (owner: 10Dr0ptp4kt) [20:59:29] ok [20:59:33] yurik: bblack i think we're good to go [20:59:40] yurik: you can re-enable autosync [20:59:40] retsoring [20:59:54] (03PS7) 10BBlack: Do not fragment cache with provenance parameter [puppet] - 10https://gerrit.wikimedia.org/r/198805 (owner: 10Dr0ptp4kt) [21:00:01] (03CR) 10BBlack: [C: 032 V: 032] Do not fragment cache with provenance parameter [puppet] - 10https://gerrit.wikimedia.org/r/198805 (owner: 10Dr0ptp4kt) [21:00:03] bblack: yurik, as always, very thankful to get to work with you guys :) [21:00:08] no worries [21:00:17] what's francium? [21:00:35] (03PS1) 10Dzahn: lists: activate I/O monitoring on sodium [puppet] - 10https://gerrit.wikimedia.org/r/200013 [21:00:38] an element? :) [21:00:48] beat me to it [21:00:52] :) [21:01:17] the chemical element of atomic number 87, a radioactive member of the alkali metal group. Francium occurs naturally as a decay product in uranium and thorium ores. [21:02:11] a server that was once a blog server and is now a spare [21:02:13] paravoid: bblack and yurik are HEAVY metal [21:02:13] https://phabricator.wikimedia.org/rODNSc4539738dac7bca75b96c2475e055180ec206d26 [21:02:22] [21:02:47] 6operations: setup/deploy ganeti2001-2006 - https://phabricator.wikimedia.org/T94042#1155008 (10RobH) a:5RobH>3akosiaris I'm not really certain of a few things for this, as such, I'm going to assign this to Alex for input. - What will the first(primary) NIC route to, internal or external vlan? - These have... [21:02:49] PROBLEM - Auth DNS for labs pdns on labs-ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [21:02:56] well robh is setting it up as far as I can see [21:03:47] paravoid: ah, he allocated it for the HTML / zim dumps [21:03:55] https://phabricator.wikimedia.org/T91853#1129530 [21:05:30] PROBLEM - puppet last run on amssq43 is CRITICAL: CRITICAL: Puppet has 1 failures [21:06:01] (03PS1) 10RobH: setting ganeti2001-2006 dhcp lease entries [puppet] - 10https://gerrit.wikimedia.org/r/200014 (https://phabricator.wikimedia.org/T94042) [21:06:06] mutante: whats the problem? [21:06:23] paravoid: francium was setup by me and is in a state for service implemetnation [21:06:32] (03PS1) 10Andrew Bogott: Fresh start: new domain ids for sink [puppet] - 10https://gerrit.wikimedia.org/r/200015 [21:06:32] its kind of up for grabs between gabriel or ariel to work out [21:06:39] PROBLEM - puppet last run on amssq44 is CRITICAL: CRITICAL: Puppet has 1 failures [21:06:39] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: Puppet has 1 failures [21:06:40] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: Puppet has 1 failures [21:06:50] PROBLEM - puppet last run on amssq50 is CRITICAL: CRITICAL: Puppet has 1 failures [21:06:51] i missed the pings cuz i was doing phab tasking and assumed it was that, sorry ;D [21:07:20] PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: Puppet has 1 failures [21:07:29] PROBLEM - puppet last run on amssq53 is CRITICAL: CRITICAL: Puppet has 1 failures [21:07:29] PROBLEM - puppet last run on amssq51 is CRITICAL: CRITICAL: Puppet has 1 failures [21:07:30] (03CR) 10Andrew Bogott: [C: 032] Fresh start: new domain ids for sink [puppet] - 10https://gerrit.wikimedia.org/r/200015 (owner: 10Andrew Bogott) [21:07:32] robh: just answering the question what francium us [21:08:18] its in the post puppet run but no service implementation yet iirc [21:08:29] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: Puppet has 1 failures [21:08:39] and it turns out gwicke says they need to have sudo to do things on it, so i advised then we need an access request ticket to do that [21:08:43] and it has to wait for monday meeting for review [21:09:09] RECOVERY - Auth DNS for labs pdns on labs-ns2.wikimedia.org is OK: DNS OK: 0.057 seconds response time. nagiostest.eqiad.wmflabs returns [21:09:09] PROBLEM - puppet last run on amssq32 is CRITICAL: CRITICAL: Puppet has 1 failures [21:09:18] though im not sure why they need sudo, there isnt a ticket in yet for the request. [21:09:49] PROBLEM - puppet last run on amssq54 is CRITICAL: CRITICAL: Puppet has 1 failures [21:09:59] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [21:10:19] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [21:10:20] 6operations, 5Patch-For-Review: setup/deploy ganeti2001-2006 - https://phabricator.wikimedia.org/T94042#1155064 (10RobH) Feel free to assign back to me for the rest of the setup after questions are addressed. [21:10:29] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [21:10:38] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [21:10:59] PROBLEM - puppet last run on amssq46 is CRITICAL: CRITICAL: Puppet has 1 failures [21:11:10] PROBLEM - puppet last run on amssq34 is CRITICAL: CRITICAL: Puppet has 1 failures [21:11:18] 6operations, 10Datasets-General-or-Unknown, 6Services, 10hardware-requests: Hardware for HTML / zim dumps - https://phabricator.wikimedia.org/T91853#1155065 (10RobH) 5Open>3Resolved [21:11:18] 6operations, 10Datasets-General-or-Unknown: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503#1155067 (10RobH) [21:11:49] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: Puppet has 1 failures [21:11:51] 6operations, 10Datasets-General-or-Unknown, 6Services, 10hardware-requests: Hardware for HTML / zim dumps - https://phabricator.wikimedia.org/T91853#1097489 (10RobH) hardware has arrived, and linked T93113 is the deployment. the #hardware-request is resolved. [21:11:59] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 1 failures [21:12:19] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: Puppet has 1 failures [21:12:19] PROBLEM - puppet last run on amssq36 is CRITICAL: CRITICAL: Puppet has 1 failures [21:12:38] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 1 failures [21:12:40] PROBLEM - puppet last run on amssq40 is CRITICAL: CRITICAL: Puppet has 1 failures [21:13:19] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: Puppet has 1 failures [21:13:27] 6operations, 5Patch-For-Review: deploy francium for html/zim dumps - https://phabricator.wikimedia.org/T93113#1155073 (10RobH) So this is testing and not production ready? (Not sure why sudo would be needed otherwise.) [21:13:39] PROBLEM - puppet last run on amssq42 is CRITICAL: CRITICAL: Puppet has 1 failures [21:14:09] PROBLEM - puppet last run on amssq41 is CRITICAL: CRITICAL: Puppet has 1 failures [21:14:17] 6operations, 5Patch-For-Review: Access to francium - https://phabricator.wikimedia.org/T94093#1155079 (10GWicke) 3NEW a:3GWicke [21:14:26] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to francium - https://phabricator.wikimedia.org/T94093#1155079 (10GWicke) [21:14:29] PROBLEM - puppet last run on amssq62 is CRITICAL: CRITICAL: Puppet has 1 failures [21:14:39] PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: Puppet has 1 failures [21:15:14] 6operations, 5Patch-For-Review: deploy francium for html/zim dumps - https://phabricator.wikimedia.org/T93113#1155098 (10GWicke) @RobH, created T94093 for the access. [21:15:30] PROBLEM - puppet last run on amssq58 is CRITICAL: CRITICAL: Puppet has 1 failures [21:15:52] 6operations, 5Patch-For-Review: setup/deploy ganeti2001-2006 - https://phabricator.wikimedia.org/T94042#1155108 (10RobH) I'll note that T93932 states private IPs. I'd assume to just put them all into private1-b-codfw together. I just wanted to confirm that subnet is correct & get the partition/raid info. [21:16:08] PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: Puppet has 1 failures [21:16:30] gwicke: why is there a patch for review tag on that? [21:16:39] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Puppet has 1 failures [21:16:41] the access request [21:16:50] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: Puppet has 1 failures [21:16:50] 10Ops-Access-Requests, 6operations: Access to francium - https://phabricator.wikimedia.org/T94093#1155121 (10RobH) [21:17:06] robh: I think it's the helpful sub-ticket inheritance [21:17:13] yea, cool [21:17:24] who is 'us'? [21:17:32] in the we need to have shell access [21:17:36] sorry, who is 'we' just you? [21:17:46] 10Ops-Access-Requests, 6operations: Access to francium - https://phabricator.wikimedia.org/T94093#1155134 (10GWicke) [21:17:55] clarified: it's services [21:18:03] dude, i dunno who is in services... you cannot just list shell names for me? [21:18:08] PROBLEM - puppet last run on amssq31 is CRITICAL: CRITICAL: Puppet has 1 failures [21:18:10] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 1 failures [21:18:21] i can look it up but meh... i rather folks provide this info ;D [21:18:26] 10Ops-Access-Requests, 6operations: Access to francium - https://phabricator.wikimedia.org/T94093#1155079 (10GWicke) [21:18:26] robh: done [21:18:30] PROBLEM - puppet last run on amssq33 is CRITICAL: CRITICAL: Puppet has 1 failures [21:18:36] cool, i'll note it on the next opsmeeting for sudo review [21:18:42] but i dunno how well it will fly [21:18:54] it didnt sound like this was in testing when we ordered the hardware [21:18:59] robh: if somebody from ops can do the setup then that's fine with me too [21:19:04] well, not so much that production would require sudo [21:19:11] is the setup puppetized? [21:19:20] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 1 failures [21:19:26] not right now, but could be if somebody wants to do that [21:19:36] the nginx install is temporary though [21:19:37] so this wasnt tested in labs at all? [21:19:39] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 1 failures [21:19:47] until there is disk space on the actual download boxes [21:19:49] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Puppet has 1 failures [21:19:57] i thought we had this tested and ready to go when you guys asked for hardware, was a bad assumption on my part [21:20:11] the htmldumper script has been tested a lot [21:20:19] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: Puppet has 1 failures [21:20:19] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 1 failures [21:20:19] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Puppet has 1 failures [21:20:20] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures [21:20:25] we have not however set up a real dump server with a lot of disk space before [21:20:29] PROBLEM - puppet last run on amssq39 is CRITICAL: CRITICAL: Puppet has 1 failures [21:20:41] gwicke: yes, but couldnt the setup have been puppetized and tested in labs before now? [21:20:49] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [21:20:53] ignore all the amssq/cp puppet failures, it's the wprov patch and it's not critical [21:21:06] robh: could have, sure [21:21:08] PROBLEM - puppet last run on amssq57 is CRITICAL: CRITICAL: Puppet has 1 failures [21:21:11] and wasnt... [21:21:19] dr0ptp4kt: I missed a tiny detail in reviewing that patch: it contains the VCL changes and a new VCL file, but not the puppet code to actually deploy the new file :) [21:21:19] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures [21:21:26] this is a problem, and it makes me regret not having asked more in depth quesitons before we spent money ordering hardware... [21:21:29] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [21:21:29] PROBLEM - puppet last run on amssq45 is CRITICAL: CRITICAL: Puppet has 1 failures [21:21:39] PROBLEM - puppet last run on amssq56 is CRITICAL: CRITICAL: Puppet has 1 failures [21:21:39] PROBLEM - puppet last run on amssq37 is CRITICAL: CRITICAL: Puppet has 1 failures [21:21:44] we arent supposed to just give out sudo to live hack on bare metal in lieu of labs testing [21:21:50] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 1 failures [21:22:09] PROBLEM - puppet last run on amssq49 is CRITICAL: CRITICAL: Puppet has 1 failures [21:22:11] robh: this is dumps, not something hit by every request [21:22:19] PROBLEM - puppet last run on amssq52 is CRITICAL: CRITICAL: Puppet has 1 failures [21:22:19] (03CR) 10Thcipriani: "Cherry pick works on deployment-salt as expected. No manual changes needed, all file owners/groups updated as expected." [puppet] - 10https://gerrit.wikimedia.org/r/199988 (https://phabricator.wikimedia.org/T94054) (owner: 10Thcipriani) [21:22:22] gwicke: so? [21:22:23] and html dumps are something we are just starting to do [21:22:24] bblack: oh, ha [21:22:28] That doesnt mean we should skip labs testing [21:22:38] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Puppet has 1 failures [21:22:41] and spend money on things rather than taking the proper steps [21:22:45] test and puppetize a service in labs [21:22:47] then put on bare metal [21:22:51] it's not a service [21:22:56] it's just a cron script at this point [21:22:59] PROBLEM - puppet last run on amssq59 is CRITICAL: CRITICAL: Puppet has 1 failures [21:23:02] and an nginx to make those available [21:23:07] that doesnt change anything of what i said [21:23:14] if it can be puppetized and tested in labs, it should [21:23:19] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [21:23:20] it can't [21:23:28] you just said it could be puppetized in labs! [21:23:29] PROBLEM - puppet last run on amssq61 is CRITICAL: CRITICAL: Puppet has 1 failures [21:23:34] at least not at the scale required to find the scale issues [21:23:48] yes but the puppetization in labs would eliminate a large reason for your need of sudo, right? [21:24:02] an ops person investing an hour can eliminate that too [21:24:09] (03PS1) 10BBlack: actually deployed the provenance VCL [puppet] - 10https://gerrit.wikimedia.org/r/200029 [21:24:35] (03PS2) 10Dzahn: lists: activate I/O monitoring on sodium [puppet] - 10https://gerrit.wikimedia.org/r/200013 (https://phabricator.wikimedia.org/T93783) [21:24:37] gwicke: and we do, but the point of labs is so you can do it as wel [21:24:40] (03CR) 10BBlack: [C: 032 V: 032] actually deployed the provenance VCL [puppet] - 10https://gerrit.wikimedia.org/r/200029 (owner: 10BBlack) [21:24:45] as such, im telling you now i'll be objecting to sudo for this. [21:24:49] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Puppet has 1 failures [21:24:58] robh: make it a no-sudo request then [21:25:13] but be prepared for us asking for help setting this up [21:25:22] then you'll need to point to a puppetized items to include in site.pp [21:25:31] or work with mark to get an ops person assigned to this [21:26:00] (03PS3) 10Dzahn: lists: activate I/O monitoring on sodium [puppet] - 10https://gerrit.wikimedia.org/r/200013 (https://phabricator.wikimedia.org/T93783) [21:26:09] RECOVERY - puppet last run on amssq54 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [21:26:45] 6operations, 5Patch-For-Review: deploy francium for html/zim dumps - https://phabricator.wikimedia.org/T93113#1155195 (10RobH) It seems that this service could have been tested in labs, but wasn't. So now services either needs to puppetize this for their use, request ops do so, or have sudo on the box. I ele... [21:27:18] RECOVERY - puppet last run on amssq46 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [21:27:55] i dunno the answer, dumps is ariel [21:28:09] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [21:28:14] gwicke: i think the answer is you, ariel, and i figure out who is going to do the puppetization work [21:28:19] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [21:28:20] though if you guys wanted sudo to do it on the host [21:28:25] not sure why you guys wouldnt do the same in labs [21:28:39] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [21:28:40] otherwise what else would you do with sudo? [21:28:49] robh: there are no labs nodes large enough to do dumps [21:29:09] RECOVERY - puppet last run on amssq40 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [21:29:10] RECOVERY - puppet last run on amssq34 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:29:10] I thought you stated you could puppetize the service using labs? [21:29:15] I'm not saying full load testing [21:29:18] I didn't [21:29:20] you said that [21:29:26] but it is of course possible [21:29:40] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:29:41] but at this point there isn't much to puppetize [21:29:58] gwicke: yes, but couldnt the setup have been puppetized and tested in labs before now? [21:30:00] RECOVERY - puppet last run on amssq42 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [21:30:01] then you answer robh: could have, sure [21:30:02] the specifics depend on partition layout etc [21:30:04] which we don't know [21:30:19] RECOVERY - puppet last run on amssq36 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:30:29] RECOVERY - puppet last run on amssq41 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [21:30:30] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:30:49] RECOVERY - puppet last run on amssq62 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [21:30:52] I'm also not sure if it's worth installing nginx via puppet if we are going to remove it a month from now or so [21:30:58] but that's up to you guys to decide [21:31:07] awlrite, Gather be upon us [21:31:13] i think we've made it clear that everything is always installed by puppet no? [21:32:00] 6operations, 5Patch-For-Review: setup/deploy ganeti2001-2006 - https://phabricator.wikimedia.org/T94042#1155224 (10akosiaris) >>! In T94042#1155008, @RobH wrote: > I'm not really certain of a few things for this, as such, I'm going to assign this to Alex for input. > > - What will the first(primary) NIC route... [21:32:23] 6operations, 5Patch-For-Review: setup/deploy ganeti2001-2006 - https://phabricator.wikimedia.org/T94042#1155225 (10akosiaris) a:5akosiaris>3RobH [21:32:28] RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:32:34] akosiaris: cool, so uh, / only for the full disk? [21:32:39] RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:32:44] robh: no [21:32:52] seemed odd, glad i asked [21:32:53] a small / (15G?) [21:33:07] but the rest in the VG [21:33:10] RECOVERY - puppet last run on amssq60 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [21:33:12] aka all in LVM [21:33:22] and the / in a small LV 15Gs [21:33:30] RECOVERY - puppet last run on amssq58 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [21:33:34] the rest of the LVM will be managed by ganeti [21:33:48] cool [21:33:48] 6operations, 5Patch-For-Review: setup/deploy ganeti2001-2006 - https://phabricator.wikimedia.org/T94042#1155228 (10RobH) all in lvm, small / (root) lvm, rest in lvm container to be managed by ganeti [21:33:56] i think i have the info i need for those then [21:33:58] thx! [21:33:59] btw IIRC there is some autonaming of the VG [21:34:10] like hostname-vg ? [21:34:18] yea i have not written shit for lvm in partman recipes... [21:34:19] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [21:34:22] it'll be an adventure ;D [21:34:29] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [21:34:29] RECOVERY - puppet last run on amssq31 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [21:34:36] yeah, I 'll probably have to research that part too a bit [21:34:38] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [21:34:38] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:34:44] (i got gpt and raid10 to work so this is just more info on top of that) [21:34:45] cause I need the same VG name for all machines [21:34:54] ahh, so we should set it manually... hrmm [21:34:55] but don't worry about that [21:35:00] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [21:35:00] duly noted, i'll look into it some too [21:35:09] RECOVERY - puppet last run on amssq39 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [21:35:11] but i'll get the rest of it all ready as well, so that'll be last thing i work on [21:35:14] since it'll take the longest [21:35:26] ok, sounds like a deal. I 'll work on it tomorrow as well [21:35:29] (that way if i get to slow and dont figure it out, you arent blocked by other shit) [21:35:40] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:35:40] that's what I wanted to ask for. Great [21:35:43] thanks! [21:35:49] you too, have a nice evening =] [21:35:50] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:36:19] RECOVERY - puppet last run on amssq37 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [21:36:27] gwicke: so yea, i guess we need to figure out the pupeptizeation and who does that. it'll get discussed between you ariel and i hopefully before monday. if not though then monday meeting with sudo requests will genreate this discussion then =] [21:36:29] RECOVERY - puppet last run on amssq33 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:36:30] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [21:36:30] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:36:39] but the ticket is now on my list for the meeting review [21:37:00] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [21:37:19] RECOVERY - puppet last run on amssq57 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [21:37:40] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [21:37:49] RECOVERY - puppet last run on amssq56 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:38:09] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [21:38:10] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:38:29] RECOVERY - puppet last run on amssq49 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:38:29] RECOVERY - puppet last run on amssq52 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [21:39:09] RECOVERY - puppet last run on amssq59 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:39:15] !log Created Gather tables on test and test2 [21:39:20] RECOVERY - puppet last run on amssq45 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:39:21] Logged the message, Master [21:39:29] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [21:40:29] (03CR) 10MaxSem: [C: 032] Deploy Gather in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199950 (owner: 10MaxSem) [21:40:30] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:41:10] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [21:41:29] RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:41:53] (03CR) 10RobH: [C: 032] setting ganeti2001-2006 dhcp lease entries [puppet] - 10https://gerrit.wikimedia.org/r/200014 (https://phabricator.wikimedia.org/T94042) (owner: 10RobH) [21:42:29] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:42:30] RECOVERY - puppet last run on amssq44 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [21:43:09] RECOVERY - puppet last run on amssq43 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:44:04] (03CR) 10Alexandros Kosiaris: [C: 032] CX: Add missing source languages [puppet] - 10https://gerrit.wikimedia.org/r/199931 (owner: 10KartikMistry) [21:44:08] RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [21:44:29] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:44:29] RECOVERY - puppet last run on amssq50 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:44:49] RECOVERY - puppet last run on cp1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:45:00] RECOVERY - puppet last run on amssq53 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:45:11] RECOVERY - puppet last run on amssq51 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:45:11] RECOVERY - puppet last run on amssq32 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [21:45:18] (03PS1) 10RobH: setting ganeti2001-2006 dns entries [dns] - 10https://gerrit.wikimedia.org/r/200033 (https://phabricator.wikimedia.org/T94042) [21:45:49] (03CR) 10jenkins-bot: [V: 04-1] setting ganeti2001-2006 dns entries [dns] - 10https://gerrit.wikimedia.org/r/200033 (https://phabricator.wikimedia.org/T94042) (owner: 10RobH) [21:45:59] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [21:46:19] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [21:47:25] (03PS2) 10RobH: setting ganeti2001-2006 dns entries [dns] - 10https://gerrit.wikimedia.org/r/200033 (https://phabricator.wikimedia.org/T94042) [21:47:59] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:49:00] (03CR) 10MaxSem: [V: 032] "Oh come on, I see all tests passed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199950 (owner: 10MaxSem) [21:49:46] (03CR) 10RobH: [C: 032] setting ganeti2001-2006 dns entries [dns] - 10https://gerrit.wikimedia.org/r/200033 (https://phabricator.wikimedia.org/T94042) (owner: 10RobH) [21:50:28] !log maxsem Started scap: Enable Gather [21:50:33] Logged the message, Master [21:51:10] oh, we haz Ganeti boxes already? :) [21:52:48] bblack, i ran puppet update on betalabs, and got this... any idea? [21:52:49] Error: Could not start Service[nginx]: Execution of '/etc/init.d/nginx start' returned 1: [21:52:49] Error: /Stage[main]/Nginx/Service[nginx]/ensure: change from stopped to running failed: Could not start Service[nginx]: Execution of '/etc/init.d/nginx start' returned 1: [21:53:12] (03PS15) 10Alexandros Kosiaris: Package builder module [puppet] - 10https://gerrit.wikimedia.org/r/194471 [21:53:21] bd808, ^^ [21:53:30] yurik: yeah it's been like that since longer than betalabs syslogs go back, and nobody noticed till I pointed it out yesterday. [21:53:42] lol [21:53:56] i noticed )) [21:54:01] undoubtedly it's related to the cert/nginx/etc refactors of the past month or two, but without a start date to point at it's going to be funner finding which change broke it [21:54:11] yurik: and pingged me for fun? [21:54:12] I haven't really dug into it yet [21:54:35] bd808, always wanted to say hi to you :) plus you knew about puppets yesterday, thoughty ou might have an idea [21:54:36] Oh the ssl terminators in beta have be broken since we moved labs to equiad [21:54:52] They have no certs [21:54:59] they seem to be failing on something to do with deploying the fake private key for star.wmflabs.org [21:55:10] well s/fake/locally-generated-and-not-really-secret/ [21:55:16] I think? [21:55:18] There were certs that ops had setup in pmtpa but that was never replicated in eqiad [21:55:34] so there is no ssl for the beta cluster [21:55:42] ok [21:56:06] It would be nice to have a self-singed cert for things but nobody has ever cared enough to make it happen [21:56:30] at one point the plan was to get real certs but that got blown up too [21:56:42] the ugly history is in bug...phab [21:56:52] https://phabricator.wikimedia.org/T50501 [21:56:54] that [21:57:00] yeah. too long and boring to read [21:57:09] tl;dr no ssl for beat cluster [21:57:12] *beta [21:57:17] robh: ok, thx. We don't have a lot of time to spare for dumps right now so would appreciate help from ops on puppetization / setup. [21:57:19] Blocks T53494: Use Beta cluster as a true canary for code deployments (tracking) [21:58:01] it actually claims that A) We have self-signed certs in place on Beta [21:58:13] that was about getting not self signed [21:58:21] we did in pmtpa [21:58:30] but they didn't move to eqiad [21:58:37] and only about money it seems [21:59:01] We've opened up root in beta cluster now too which would blow up the idea again [21:59:29] even a locally-generated self-signed cert would be better than just "borked config, borked puppet run, no SSL at all" [21:59:49] well, we do it for *.wmflabs.org for the proxy too, right [21:59:57] (putting key on instance, just locked down instance) [22:00:04] maxsem, kaldari: Dear anthropoid, the time has come. Please deploy Mobile Web (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150326T2200). [22:02:39] i wonder how much [22:02:53] for real certs [22:03:11] hmm, sync-proxies looks slowwwishhhhh [22:03:22] (to say the very least) [22:03:37] (03PS1) 10Cscott: Collection: Remove deprecated $wgCollectionHierarchyDelimiter configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200037 [22:03:38] (03PS1) 10Cscott: Collection/OCG: Turn on plain text output format in Book Creator. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200038 [22:03:45] also, we have a Labs CA [22:04:11] or do we [22:07:44] (03PS1) 10BBlack: repool cp1047 in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/200041 [22:07:47] (03PS1) 10BBlack: add cp1008 to ipsec test host set [puppet] - 10https://gerrit.wikimedia.org/r/200042 [22:07:58] (03CR) 10BBlack: [C: 032 V: 032] repool cp1047 in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/200041 (owner: 10BBlack) [22:08:07] (03CR) 10BBlack: [C: 032 V: 032] add cp1008 to ipsec test host set [puppet] - 10https://gerrit.wikimedia.org/r/200042 (owner: 10BBlack) [22:09:05] ugh [22:09:34] 6operations, 5Patch-For-Review: setup/deploy ganeti2001-2006 - https://phabricator.wikimedia.org/T94042#1155361 (10RobH) [22:10:11] 6operations, 5Patch-For-Review: setup/deploy ganeti2001-2006 - https://phabricator.wikimedia.org/T94042#1153480 (10RobH) [22:10:30] 6operations: setup/deploy ganeti2001-2006 - https://phabricator.wikimedia.org/T94042#1153480 (10RobH) [22:10:50] (03PS1) 10BBlack: fix yaml syntax, no trailing commas! [puppet] - 10https://gerrit.wikimedia.org/r/200045 [22:11:06] (03CR) 10BBlack: [C: 032 V: 032] fix yaml syntax, no trailing commas! [puppet] - 10https://gerrit.wikimedia.org/r/200045 (owner: 10BBlack) [22:15:08] 6operations: setup/deploy ganeti2001-2006 - https://phabricator.wikimedia.org/T94042#1155396 (10RobH) [22:15:42] bblack: that's probably going to be common because in puppet we are supposed to add them all the time [22:17:53] (03CR) 10Dzahn: [C: 032] "OK - I/O stats: Transfers/Sec=29.40 Read Requests/Sec=0.50 Write Requests/Sec=18.70 KBytes Read/Sec=3.20 KBytes_Written/Sec=219.30" [puppet] - 10https://gerrit.wikimedia.org/r/200013 (owner: 10Dzahn) [22:21:30] !log maxsem Finished scap: Enable Gather (duration: 31m 02s) [22:21:39] Logged the message, Master [22:22:06] MaxSem: what was the slow part of that scap? l10n? [22:22:11] or syncing out to the hosts? [22:22:32] greg-g, actually, it seems to wrok ok mostly [22:23:05] cool [22:26:11] (03CR) 10MaxSem: [C: 032] Enable Gather on test and test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199953 (owner: 10MaxSem) [22:27:06] 6operations, 7Mail, 7Monitoring, 5Patch-For-Review: Mailing lists alerts - https://phabricator.wikimedia.org/T93783#1155445 (10Dzahn) https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=sodium https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=sodium&service=mailman+I%2FO... [22:35:44] !log maxsem Synchronized php-1.25wmf22/extensions/Gather/: (no message) (duration: 00m 08s) [22:35:49] Logged the message, Master [22:35:58] !log maxsem Synchronized php-1.25wmf23/extensions/Gather/: (no message) (duration: 00m 07s) [22:36:03] Logged the message, Master [22:36:19] 6operations, 10Wikimedia-Git-or-Gerrit: Git.wikimedia.org keeps going down - https://phabricator.wikimedia.org/T73974#1155506 (10Spage) http://git.wikimedia.org/ is giving me a lightly-styled Apache HTML error page Internal error Return to home page It's HTTP status 500 (see below), not status 503 like task d... [22:37:08] ^ git.wikimedia.org is dead, but e.g. https://git.wikimedia.org/summary/mediawiki%2Fcore.git is OK [22:38:00] just use github [22:38:09] (03Merged) 10jenkins-bot: Enable Gather on test and test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199953 (owner: 10MaxSem) [22:38:18] just use phabricator [22:38:20] :P [22:39:40] !log maxsem Synchronized wmf-config/InitialiseSettings.php: Gather on test & test2 (duration: 00m 07s) [22:39:46] Logged the message, Master [22:43:00] 6operations, 10ops-codfw: Set up missing PDUs in codfw and eqiad - https://phabricator.wikimedia.org/T84416#1155515 (10Cmjohnson) All of codfw and eqiad pdus have been added to librenms [22:43:02] 10Ops-Access-Requests, 6operations, 7database: Can't access x1-analytics-slave - https://phabricator.wikimedia.org/T93708#1155516 (10Mattflaschen) >>! In T93708#1146917, @Springle wrote: > Actually, I think the `research` grant might be slightly wrong too for X1 as it uses `%wiki%` rather than `%wik%` (no `i... [22:44:30] greg-g, MaxSem: the only reason I visited is our Gerrit wiki page says "To make an anonymous git clone of core MediaWiki you can clone from https://gerrit.wikimedia.org/r/p/mediawiki/core.git or https://git.wikimedia.org/git/mediawiki/core.git". I'm removing the second one. [22:47:03] yup, gitblit mustdie [22:47:30] MaxSem: I thought ^demon|away installed gitblit because "gitweb must die" :) [22:47:36] also, we seem to be doned, greg-g [22:47:57] and then phabricator because... [22:48:12] the Gerrit page says 'To simply browse & fork our code you can use the GitHub mirror'. Where's marktraceur to fight the proprietary... [22:48:15] next thing you know we're installing gitlab [22:49:08] * ^demon|away ignores all the git talk [22:49:30] git'r'done [22:51:23] * greg-g puts on his camo hat [22:51:42] <^demon|away> grrrit-wm: We could've had our offsite in the woods somewhere [22:51:45] <^demon|away> greg-g: ^ [22:51:50] <^demon|away> grrrit-wm don't care [22:52:20] ^demon|away: next year, we'll call it our cost saving measures [22:53:03] ^demon|away: FYI I updated valhallasw's https://www.mediawiki.org/wiki/Gerrit/GitHub page to explain we mirror, but don't support pull requests (until YuviPanda|flight touches down :). Does that sound right? [22:54:44] <^demon|away> spagewmf: without clicking yes :p [22:55:21] thx [22:58:28] jouncebot, next [22:58:30] In 0 hour(s) and 1 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150326T2300) [22:58:44] I won't be able to help tonight [22:59:43] i can push em out [23:00:04] RoanKattouw, ^d, Krenair, ebernhardson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150326T2300). [23:00:18] Krinkle: ready for swat? [23:09:28] Krinkle: around? I'll need you to test the RL changes in swat [23:09:38] ebernhardson: Testing [23:09:49] its not out yet, just making sure your here before i go merging :) [23:09:52] merging now [23:09:55] OK :) [23:19:54] !log ebernhardson Synchronized php-1.25wmf23/extensions/Flow/: Bump flow submodule for 1.25wmf23 (duration: 00m 09s) [23:20:01] Logged the message, Master [23:21:21] PROBLEM - Host labs-ns2.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:21:37] Krinkle, ^ [23:21:54] !log ebernhardson Synchronized php-1.25wmf23/extensions/Flow/: Bump flow submodule for 1.25wmf23 (duration: 00m 09s) [23:21:59] * ebernhardson forgot to git submodule update ... i guess its been a little while :) [23:22:01] PROBLEM - Host holmium is DOWN: PING CRITICAL - Packet loss = 100% [23:22:40] og [23:23:01] PROBLEM - puppet last run on xenon is CRITICAL: Connection refused by host [23:23:14] !log ebernhardson Synchronized php-1.25wmf22/extensions/Flow: Bump flow submodule in 1.25wmf22 for swat (duration: 00m 08s) [23:23:20] Logged the message, Master [23:23:31] PROBLEM - configured eth on xenon is CRITICAL: Connection refused by host [23:23:40] PROBLEM - configured eth on cerium is CRITICAL: Connection refused by host [23:23:41] PROBLEM - DPKG on xenon is CRITICAL: Connection refused by host [23:23:51] PROBLEM - dhclient process on cerium is CRITICAL: Connection refused by host [23:23:51] PROBLEM - configured eth on praseodymium is CRITICAL: Connection refused by host [23:23:51] PROBLEM - dhclient process on praseodymium is CRITICAL: Connection refused by host [23:23:52] PROBLEM - puppet last run on cerium is CRITICAL: Connection refused by host [23:23:52] PROBLEM - DPKG on praseodymium is CRITICAL: Connection refused by host [23:24:01] PROBLEM - RAID on cerium is CRITICAL: Connection refused by host [23:24:02] PROBLEM - salt-minion processes on praseodymium is CRITICAL: Connection refused by host [23:24:02] PROBLEM - Disk space on praseodymium is CRITICAL: Connection refused by host [23:24:02] PROBLEM - Disk space on xenon is CRITICAL: Connection refused by host [23:24:11] PROBLEM - salt-minion processes on xenon is CRITICAL: Connection refused by host [23:24:12] PROBLEM - RAID on xenon is CRITICAL: Connection refused by host [23:24:20] PROBLEM - Cassandra database on cerium is CRITICAL: Connection refused by host [23:24:21] PROBLEM - SSH on xenon is CRITICAL: Connection refused [23:24:32] PROBLEM - SSH on cerium is CRITICAL: Connection refused [23:24:32] PROBLEM - Cassandra database on xenon is CRITICAL: Connection refused by host [23:24:32] PROBLEM - dhclient process on xenon is CRITICAL: Connection refused by host [23:24:32] RECOVERY - Host holmium is UP: PING OK - Packet loss = 0%, RTA = 4.74 ms [23:24:33] PROBLEM - salt-minion processes on cerium is CRITICAL: Connection refused by host [23:24:33] PROBLEM - DPKG on cerium is CRITICAL: Connection refused by host [23:24:41] PROBLEM - puppet last run on praseodymium is CRITICAL: Connection refused by host [23:24:51] PROBLEM - Disk space on cerium is CRITICAL: Connection refused by host [23:24:51] PROBLEM - SSH on praseodymium is CRITICAL: Connection refused [23:25:01] PROBLEM - Cassandra database on praseodymium is CRITICAL: Connection refused by host [23:25:07] Holey sheets [23:25:09] eh? [23:25:11] PROBLEM - RAID on praseodymium is CRITICAL: Connection refused by host [23:25:22] I think we just lost a switch. [23:25:25] 404 wiki not found [23:25:31] RECOVERY - Host labs-ns2.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 2.39 ms [23:25:32] j/k [23:25:40] It's still there [23:25:46] Bsadowski1: not cool [23:26:08] 6operations, 10Continuous-Integration, 6Labs, 10Wikimedia-Labs-Infrastructure: dnsmasq returns SERVFAIL for (some?) names that do not exist instead of NXDOMAIN - https://phabricator.wikimedia.org/T92351#1155635 (10scfc) 5Resolved>3declined That may be, but this is certainly not //resolved//, and whethe... [23:26:11] Coren: I rebooted those hosts [23:26:17] these are cassandra test hosts [23:26:19] working with mutante on the firewall [23:26:27] greg-g: I would have said something else if it were true [23:26:33] ;) [23:26:37] Like the actual message [23:26:46] * Coren breathes again. [23:26:50] !log rebooted xenon, cerium, praseodymium to reload the firewall from scratch [23:26:55] Logged the message, Master [23:26:58] Coren: sorry, should have logged earlier [23:27:11] mutante: I have a firewall question as well, if you aren’t already hip-deep in other stuff [23:27:18] i'm truing to get on the hosts but i',m on a train [23:27:25] gwicke: No worry - that looked like we had just lost a whole rack. :-) [23:27:28] it might be the same issue — I did some iptables stuff on the commandline, and now ferm doesn’t seem to be applied [23:27:39] I've been doing stuff in Linux a lot lately. I'm a bit new to it but yeah. Just using a basic distro of Ubuntu [23:27:48] :P [23:27:56] Teaching myself things [23:28:30] eh, it looks like they just didnt reboot now [23:28:41] not output on console rather than firewall [23:28:58] 6operations, 7HTTPS, 7Performance, 7notice, 7user-notice: Support SPDY - https://phabricator.wikimedia.org/T35890#1155650 (10gpaumier) [23:29:08] resets the drac on xenon [23:29:55] 6operations, 7HTTPS, 7Performance, 7notice, 7user-notice: Support SPDY - https://phabricator.wikimedia.org/T35890#370227 (10gpaumier) (Added a link to the English Wikipedia article for people who come here from [[ https://meta.wikimedia.org/wiki/Tech/News/2015/14 | Tech News ]] and don't know what SPDY is.) [23:29:56] mutante: k, thx [23:30:02] arr, i have real problems with the connection though [23:30:07] because i'm moving [23:30:25] mutante is riding a bullet.. train [23:30:30] lol, yes [23:30:38] 6operations, 10Continuous-Integration, 6Labs, 10Wikimedia-Labs-Infrastructure: dnsmasq returns SERVFAIL for (some?) names that do not exist instead of NXDOMAIN - https://phabricator.wikimedia.org/T92351#1155653 (10coren) That may still be an option, once he have at least //one// that actually works right.... [23:30:39] too fast for internet [23:31:02] it's really the BabyBullet [23:32:26] [-1;-1fContinue to wait, or Press S to skip mounting or M for manual recovery [23:33:00] ebernhardson: Got it. [23:33:02] The disk drive for /mnt/data is not ready yet or not present. [23:33:12] presses S [23:33:19] and sees cerium come back [23:33:21] RECOVERY - configured eth on cerium is OK: NRPE: Unable to read output [23:33:32] RECOVERY - dhclient process on cerium is OK: PROCS OK: 0 processes with command name dhclient [23:33:41] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 31 minutes ago with 0 failures [23:33:41] RECOVERY - RAID on cerium is OK: OK: no disks configured for RAID [23:33:53] Krenair: Why ping? I'm not Flow. [23:34:11] RECOVERY - SSH on cerium is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [23:34:12] RECOVERY - salt-minion processes on cerium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:34:21] RECOVERY - DPKG on cerium is OK: All packages OK [23:34:22] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 36 minutes ago with 0 failures [23:34:25] mutante: good thing that these are only test runs [23:34:30] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 30 minutes ago with 0 failures [23:34:31] RECOVERY - Disk space on cerium is OK: DISK OK [23:34:31] RECOVERY - SSH on praseodymium is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [23:34:32] / hosts [23:34:38] !log cerium, xenon, praseodymium - stuck at boot because /mnt/data not ready, skipped mounting to reboot [23:34:43] Logged the message, Master [23:34:51] RECOVERY - configured eth on xenon is OK: NRPE: Unable to read output [23:35:01] yes [23:35:01] RECOVERY - RAID on praseodymium is OK: OK: no disks configured for RAID [23:35:01] RECOVERY - DPKG on xenon is OK: All packages OK [23:35:10] RECOVERY - configured eth on praseodymium is OK: NRPE: Unable to read output [23:35:11] RECOVERY - dhclient process on praseodymium is OK: PROCS OK: 0 processes with command name dhclient [23:35:20] RECOVERY - DPKG on praseodymium is OK: All packages OK [23:35:21] RECOVERY - salt-minion processes on praseodymium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:35:21] RECOVERY - Disk space on praseodymium is OK: DISK OK [23:35:30] RECOVERY - Disk space on xenon is OK: DISK OK [23:35:31] RECOVERY - salt-minion processes on xenon is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:35:40] RECOVERY - RAID on xenon is OK: OK: no disks configured for RAID [23:35:50] RECOVERY - SSH on xenon is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [23:35:51] RECOVERY - dhclient process on xenon is OK: PROCS OK: 0 processes with command name dhclient [23:36:00] Krinkle, oops, sorry [23:36:07] * Krenair should probably go to bed soon [23:36:16] Arr. these Zend tests are getting really slow now aren't they [23:36:33] Time to slash away some tests using database for no reason. [23:36:48] andrewbogott: which host was yours on? [23:36:57] mutante: holmium [23:37:47] mutante: for example, I think that 9001 should be closed except to select hosts [23:38:03] mutante: looks like the ssd array changed names to md127 [23:38:14] from md2 [23:38:31] PROBLEM - puppet last run on cerium is CRITICAL: CRITICAL: Puppet has 1 failures [23:38:43] andrewbogott: ssh is blocked now on holmium and mgmt console doesn't work :/ [23:38:49] andrewbogott: also resetting drac ther [23:39:07] it’s not blocked from iron [23:39:17] I have a login — it’s fine [23:39:20] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: Puppet has 1 failures [23:39:21] PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: Puppet has 1 failures [23:39:30] sigh ... this RL stuff has been merging for >20 minutes... [23:39:50] i kinda wish we could find a way to run all the tests ahead of time so we can just merge and go [23:40:04] andrewbogott: my bad, i used wmnet , not wikimedia.org [23:40:32] but console on mgmt wasn't working [23:41:18] 10Ops-Access-Requests, 6operations, 7database: Can't access x1-analytics-slave - https://phabricator.wikimedia.org/T93708#1155671 (10Springle) 5Open>3Resolved I've changed the above to `%wik%.*`. There is already a second glob for `flowdb.*` [23:42:19] !log starting ferm service on holmium [23:42:23] Logged the message, Master [23:42:44] Krinkle: i'm not sure what the 'Got it' was about? [23:42:47] andrewbogott: i started the ferm service and ran puppet [23:42:53] mutante: why wasn’t puppet starting it? [23:43:04] good question [23:43:22] ebernhardson: I was afk for a bit someone asked me irl. [23:43:32] well, you said port 9001? [23:43:42] mutante: yeah [23:43:42] Krinkle: ahh, well your patchstill not out, its almost through zuul though :) [23:43:49] mutante: but, your change worked — it’s blocked properly now [23:43:58] actually it just merged. [23:44:43] !log ebernhardson Synchronized php-1.25wmf23/extensions/EventLogging/: Bump EventLogging in 1.25wmf23 for SWAT (duration: 00m 08s) [23:44:45] Krinkle: pushed to wmf23, please test [23:44:50] Logged the message, Master [23:44:51] andrewbogott: good :) i was bout to say iptables -L | grep 9001 [23:45:00] ebernhardson: Doing. [23:45:11] mutante: the failure of ferm to start will have to wait for another day, I’m out for now. Thank you for restoring my firewall! [23:45:43] andrewbogott: gwicke: in both cases the ferm service was stopped for some reason [23:45:50] ebernhardson: as for merge duration, as temporal workaround I think it'd be okay if we +2 cherry-picks shortly before the swat window. Other people shouldn't be deploying and they shouldn't overlap anyway. [23:46:00] you're welcome [23:46:02] mutante: ok, that means that our base firewall class isn’t working properly :( [23:46:30] andrewbogott: it works on a whole bunch of others though it seems [23:46:40] ebernhardson: confirmed on mw.org for wmf23 [23:46:50] i think manual iptables somehow does it [23:46:59] yeah, seems likely. [23:47:05] But really puppet should clobber local changes [23:47:28] !log ebernhardson Synchronized php-1.25wmf22/extensions/EventLogging/: Bump EventLogging in 1.25wmf22 for SWAT (duration: 00m 07s) [23:47:33] Logged the message, Master [23:47:36] Krinkle: hmm, sounds reasonable. I'll push them through ahead of time, next time [23:47:39] Krinkle: and 22 is out now [23:48:10] mutante: that's a bit worrying if we are going to rely on firewalling [23:48:54] ebernhardson: confirmed on enwiki in wmf22 [23:49:01] well, we shouldn't manually use iptables when we also apply them with pupppet [23:49:01] Krinkle: excellent thanks [23:49:08] that seems to conflict really bad [23:49:09] and that wraps up SWAT, [23:54:21] in both cases the word "test" was involved but it was on hosts that we monitlor