[00:00:29] (03CR) 10Gehel: [C: 04-1] "No, we still want to deploy it, but only after we have the manual deploy. If we need to reinstall a node (or add new nodes) we can do it f" [puppet] - 10https://gerrit.wikimedia.org/r/282743 (https://phabricator.wikimedia.org/T132376) (owner: 10EBernhardson) [00:01:47] (03CR) 10Gehel: "Sorry, I mixed up patches, I thought this was the patch to update reprepro (seems my brain stops working correctly after midnight)." [puppet] - 10https://gerrit.wikimedia.org/r/282743 (https://phabricator.wikimedia.org/T132376) (owner: 10EBernhardson) [00:02:36] (03Abandoned) 10EBernhardson: elasticsearch: Pin elasticsearch package to specific version [puppet] - 10https://gerrit.wikimedia.org/r/282743 (https://phabricator.wikimedia.org/T132376) (owner: 10EBernhardson) [00:14:23] PROBLEM - puppet last run on mw1247 is CRITICAL: CRITICAL: Puppet has 1 failures [00:41:14] RECOVERY - puppet last run on mw1247 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:54:21] 06Operations, 06Labs, 10Labs-Infrastructure: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#1717673 (10AlexMonk-WMF) ```krenair@bastion-01:~$ host 10.68.16.66 ;; Truncated, retrying in TCP mode. 66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikime... [01:49:17] PROBLEM - puppet last run on mw2070 is CRITICAL: CRITICAL: puppet fail [01:54:13] 06Operations, 06Labs, 10Labs-Infrastructure: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2229722 (10AlexMonk-WMF) So far there's been one instance in testlabs and quite a lot in contintcloud @hashar: please point me to the code that sets these contintcl... [02:18:07] RECOVERY - puppet last run on mw2070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:22:30] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.21) (duration: 10m 12s) [02:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:46] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Apr 22 02:31:45 UTC 2016 (duration 9m 15s) [02:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:17:15] (03PS2) 10Alex Monk: servermon: require_package python-ldap instead [puppet] - 10https://gerrit.wikimedia.org/r/284731 [03:17:17] (03PS1) 10Alex Monk: Try to separate trebuchet stuff from role::deployment::server [puppet] - 10https://gerrit.wikimedia.org/r/284851 [03:26:27] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [03:27:07] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqiad:xe-4/2/0 (Telia, IC-307235, 34ms) {#10693} [10Gbps wave]BR [03:43:26] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 [03:44:06] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [04:10:58] (03CR) 10KartikMistry: "Marko, related changes in cxserver are, https://gerrit.wikimedia.org/r/284634 and https://gerrit.wikimedia.org/r/284686 This is fine in Be" [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) (owner: 10KartikMistry) [04:12:45] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [04:16:54] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 [04:21:17] (03PS7) 10KartikMistry: WIP: Read config from cxserver [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) [04:22:25] (03CR) 10jenkins-bot: [V: 04-1] WIP: Read config from cxserver [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) (owner: 10KartikMistry) [04:28:48] 06Operations: Access to wikitech-static server? - https://phabricator.wikimedia.org/T133372#2229765 (10BBlack) [04:39:22] (03PS8) 10KartikMistry: WIP: Read config from cxserver [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) [04:40:38] (03CR) 10jenkins-bot: [V: 04-1] WIP: Read config from cxserver [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) (owner: 10KartikMistry) [04:40:40] (03PS1) 1020after4: Fix race in puppet::self (puppet.conf compilation) [puppet] - 10https://gerrit.wikimedia.org/r/284852 (https://phabricator.wikimedia.org/T132689) [04:40:58] 07Puppet, 07Beta-Cluster-reproducible: deployment-prep puppet failures due to "Could not find class" or "Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type" - https://phabricator.wikimedia.org/T131946#2229781 (10Krenair) I think it's something to do with running puppet over sa... [04:41:09] 07Puppet, 07Beta-Cluster-reproducible: puppet failures due to "Could not find class" or "Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type" - https://phabricator.wikimedia.org/T131946#2229783 (10Krenair) [04:41:20] 07Puppet, 07Beta-Cluster-reproducible: puppet failures due to "Could not find class" or "Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type" - https://phabricator.wikimedia.org/T131946#2183851 (10Krenair) [04:41:22] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2229784 (10Krenair) [04:41:53] 07Puppet, 10Beta-Cluster-Infrastructure, 06Revision-Scoring-As-A-Service, 03Scap3: deployment-((sca|aqs)01|ores-web) puppet failures due to scap3 errors - https://phabricator.wikimedia.org/T132267#2229786 (10Krenair) [04:41:58] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-cache-parsoid05 puppet failures due to removal of role::cache::parsoid - https://phabricator.wikimedia.org/T132260#2229787 (10Krenair) [04:45:55] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [04:46:17] (03PS9) 10KartikMistry: WIP: Read config from cxserver [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) [04:46:46] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2229791 (10Krenair) a:03mmodell (I went and found the code in puppet... [04:49:28] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2229793 (10mmodell) I think I found the race condition: The order of o... [04:50:04] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 [04:50:13] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2229794 (10mmodell) I'm gonna cherry pick the patch on beta. We'll see... [04:50:53] akosiaris: ping me when around. I guess patch is ready. Addressed Marco's comment and also default registry will be read if it is beta. [04:52:28] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2229795 (10mmodell) Actually, come to think of it, I'm not sure if it's... [04:58:02] 07Puppet, 07Beta-Cluster-reproducible: puppet failures due to "Could not find class" or "Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type" - https://phabricator.wikimedia.org/T131946#2183851 (10mmodell) I'd like to open the broader discussion of accountability for breaking b... [05:02:12] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2206880 (10yuvipanda) See also T120159 [05:21:27] (03PS10) 10KartikMistry: Read config from cxserver [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) [06:15:33] mobrovac: do you know if "cxserver::restbase_url": http://deployment-restbase02.deployment-prep.eqiad.wmflabs:7231/@lang.wikipedia.beta.wmflabs.org/v1/page/html/@title is correct? [06:15:52] beta seems not loading any page for CX. [06:24:08] kart__: why not test it? [06:28:25] kart__: works for me [06:30:06] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:26] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: puppet fail [06:31:27] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:56] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:13] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:34] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:54] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures [06:42:07] Nikerabbit: sure? [06:43:04] kart__: yes I tested wget deployment-restbase02.deployment-prep.eqiad.wmflabs:7231/en.wikipedia.beta.wmflabs.org/v1/page/html/Main_Page [06:43:33] okay. [06:43:44] So, something else is wrong with config. [06:49:22] kart__: nothing in beta logstash? [06:52:12] PROBLEM - puppet last run on mw1155 is CRITICAL: CRITICAL: Puppet has 1 failures [06:53:23] Nikerabbit: nothing I can find so far.. [06:55:44] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:57:12] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:57:13] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:57:53] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:02] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:12] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:58:13] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:58:33] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:17:43] Nikerabbit: okay. Issue was source for en is enwiki in beta, not from Production. Bad for CX, but that explains errors. [07:19:12] RECOVERY - puppet last run on mw1155 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:30:22] kart_: yes, the RB URL is correct [07:47:41] (03CR) 10Yuvipanda: [C: 031] Update font package dependency in toollabs::exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/284653 (owner: 10Muehlenhoff) [07:52:45] (03PS2) 10Muehlenhoff: Update font package dependency in toollabs::exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/284653 [07:53:41] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update font package dependency in toollabs::exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/284653 (owner: 10Muehlenhoff) [07:54:47] (03PS2) 10Muehlenhoff: Add additional Gujarati fonts (Rekha) (fonts-gujr-extra) [puppet] - 10https://gerrit.wikimedia.org/r/284655 (https://phabricator.wikimedia.org/T129500) [07:55:03] PROBLEM - puppet last run on mw2008 is CRITICAL: CRITICAL: puppet fail [08:00:49] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add additional Gujarati fonts (Rekha) (fonts-gujr-extra) [puppet] - 10https://gerrit.wikimedia.org/r/284655 (https://phabricator.wikimedia.org/T129500) (owner: 10Muehlenhoff) [08:01:29] (03CR) 10Mobrovac: [C: 04-1] "There is a bug in I5a3b87141da586955b5749e96f683262561c1bbd (hence -1 here): if app.conf.registry is not defined, then it will not be foun" [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) (owner: 10KartikMistry) [08:01:34] 06Operations, 06Labs, 10Labs-Infrastructure: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2229924 (10hashar) The system is Nodepool which request creation and deletion of instances via the OpenStack API end point. It creates hundreds of instances per day... [08:05:33] 06Operations, 07Puppet, 06Commons, 10Wikimedia-SVG-rendering, and 2 others: Add Gujarati fonts to Wikimedia servers - https://phabricator.wikimedia.org/T129500#2229926 (10MoritzMuehlenhoff) 05Open>03Resolved fonts-gujr-extra has been added to the package list and deployed. [08:06:45] (03PS2) 10Muehlenhoff: Blacklist usbip kernel modules [puppet] - 10https://gerrit.wikimedia.org/r/284138 [08:09:50] 06Operations, 10OCG-General, 05codfw-rollout: Document eqiad/codfw transition plan for OCG - https://phabricator.wikimedia.org/T133164#2229930 (10Joe) [08:16:20] (03CR) 10Muehlenhoff: [C: 032 V: 032] Blacklist usbip kernel modules [puppet] - 10https://gerrit.wikimedia.org/r/284138 (owner: 10Muehlenhoff) [08:22:51] RECOVERY - puppet last run on mw2008 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [08:27:09] 06Operations, 10Continuous-Integration-Infrastructure: Investigate usage of ttf-ubuntu-font-family which is not available on Jessie - https://phabricator.wikimedia.org/T103325#2229946 (10hashar) On `integration-slave-jessie1001` (which has the puppet class `mediawiki::packages`) puppet is happy: ``` Notice: /S... [08:36:30] (03PS1) 10Raimond Spekking: Add namespace translation 'Portal' for diq [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284866 [08:50:12] PROBLEM - LVS HTTP IPv6 on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: Connection timed out [08:50:23] <_joe_> wtf? [08:50:24] uh [08:50:33] PROBLEM - LVS HTTPS IPv6 on upload-lb.codfw.wikimedia.org_ipv6 is CRITICAL: Connection timed out [08:50:37] <_joe_> shit [08:50:38] akosiaris perhaps ^ ? [08:50:41] <_joe_> akosiaris: ^^ [08:50:45] uh? [08:50:49] doubtfull [08:50:54] PROBLEM - LVS HTTPS IPv6 on misc-web-lb.codfw.wikimedia.org_ipv6 is CRITICAL: Connection timed out [08:50:58] only IPv6 ? [08:51:06] <_joe_> only ipv6 it seems [08:51:09] so far [08:51:26] <_joe_> ipv4 pings [08:51:42] cps reporting IPsec problems across the fleet [08:52:03] <_joe_> so, let's put codfw out of the geoip map? [08:52:12] well icinga reporting IPsec for CPs across the fleet [08:52:23] <_joe_> and it's ipv6 [08:53:07] <_joe_> I can't get to cp2011 at [08:53:09] <_joe_> *atm [08:53:13] <_joe_> ssh into it I mean [08:54:24] RECOVERY - LVS HTTP IPv6 on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 555 bytes in 0.073 second response time [08:55:58] (03PS1) 10Giuseppe Lavagetto: Set codfw globally down [dns] - 10https://gerrit.wikimedia.org/r/284867 [08:55:59] I see nothing on cr{1,2}-codfw that would explain what happened yet [08:56:11] <_joe_> should we? ^^ [08:56:23] sorry, local distraction [08:56:26] <_joe_> paravoid akosiaris what do you think? [08:56:43] we already got on recovery [08:56:45] one* [08:57:10] salt works on cp2011.codfw.wmnet (and it worked when joe could not ssh) [08:57:14] <_joe_> keep in mind that it will anyways take up to 10 minutes for traffic to drain [08:57:36] <_joe_> jynus: yeah it's strange, I can ssh into all the non-cp hosts I tried [08:57:54] grrrr my laptop crashed [08:58:08] but a traceroute6 from eqiad to cp2011 fails. stuck at xe-5-0-1.cr2-codfw.wikimedia.org [08:58:09] <_joe_> paravoid: should I depool codfw? [08:58:15] do it [08:58:19] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v6, cp2011_v6 [08:58:19] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v6, cp2011_v6 [08:58:19] Is there a way to detect errors before the LVS? [08:58:19] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v6, cp2011_v6 [08:58:20] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2007_v6, cp2010_v6 [08:58:20] PROBLEM - IPsec on cp4004 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2012_v6 [08:58:22] I'm investigating in the meantime [08:58:29] PROBLEM - IPsec on cp4003 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2012_v6 [08:58:32] <_joe_> akosiaris: can you look at the change? [08:58:38] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2012_v6 [08:58:38] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v6, cp2011_v6 [08:58:41] I'll look [08:58:49] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2012_v6 [08:58:49] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v6, cp2011_v6 [08:58:49] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v6, cp2011_v6 [08:58:49] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v6, cp2011_v6 [08:58:49] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp2007_v6, cp2010_v6 [08:58:49] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v6, cp2011_v6 [08:58:49] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 18 connecting: (unnamed), cp1045_v6, cp1051_v6, cp1058_v6, cp1061_v6, cp3007_v6, cp3008_v6, cp3009_v6, cp3010_v6, cp4001_v6, cp4002_v6 not-conn: cp4003_v6, cp4004_v6,kafka1012_v6,kafka1013_v6,kafka1014_v6,kafka1018_v6,kafka1020_v6,kafka1022_v6 [08:58:50] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp2007_v6, cp2010_v6 [08:58:50] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp2007_v6, cp2010_v6 [08:58:51] PROBLEM - IPsec on cp4002 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2012_v6 [08:58:51] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v6, cp2011_v6 [08:58:52] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp2007_v6, cp2010_v6 [08:58:56] looks fine [08:58:59] (03CR) 10Faidon Liambotis: [C: 032] Set codfw globally down [dns] - 10https://gerrit.wikimedia.org/r/284867 (owner: 10Giuseppe Lavagetto) [08:59:09] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 119 connecting: cp2007_v6, cp2008_v6, cp2010_v6, cp2011_v6, cp2012_v6 [08:59:09] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 119 connecting: cp2007_v6, cp2008_v6, cp2010_v6, cp2011_v6, cp2012_v6 [08:59:09] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 119 connecting: cp2007_v6, cp2008_v6, cp2010_v6, cp2011_v6, cp2012_v6 [08:59:10] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v6, cp2011_v6 [08:59:10] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp2007_v6, cp2010_v6 [08:59:18] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: (unnamed), cp1052_v6, cp1053_v6, cp1054_v6, cp1055_v6, cp1065_v6, cp1066_v6, cp1067_v6, cp1068_v6 not-conn: cp3030_v6, cp3031_v6, cp3032_v6, cp3033_v6, cp3040_v6, cp3041_v6, cp3042_v6, cp3043_v6, cp4008_v6, cp4009_v6, cp4010_v6, cp4016_v6, cp4017_v6, cp4018_v6,kafka1012_v6,kafka1013_v6,kafka1014_v6,kafka1018_v6,kafka1020_v6,kafka1022_v6 [08:59:18] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp2007_v6, cp2010_v6 [08:59:19] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp2007_v6, cp2010_v6 [08:59:19] PROBLEM - IPsec on cp3009 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2012_v6 [08:59:19] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2012_v6 [08:59:20] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v6, cp2011_v6 [08:59:20] _joe_: merged/pushed [08:59:29] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp2007_v6, cp2010_v6 [08:59:30] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2007_v6, cp2010_v6 [08:59:30] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v6, cp2011_v6 [08:59:30] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2007_v6, cp2010_v6 [08:59:30] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2007_v6, cp2010_v6 [08:59:36] <_joe_> paravoid: ack [08:59:39] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v6, cp2011_v6 [08:59:39] PROBLEM - IPsec on cp1045 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2012_v6 [08:59:48] PROBLEM - IPsec on cp1058 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2012_v6 [08:59:48] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2007_v6, cp2010_v6 [08:59:48] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2012_v6 [08:59:48] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2007_v6, cp2010_v6 [08:59:49] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v6, cp2011_v6 [08:59:49] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2012_v6 [08:59:49] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 35 connecting: (unnamed), cp1048_v6, cp1049_v6, cp1050_v6, cp1062_v6, cp1063_v6, cp1064_v6, cp1071_v6, cp1072_v6, cp1073_v6, cp1074_v6, cp1099_v6 not-conn: cp3034_v6, cp3035_v6, cp3036_v6, cp3037_v6, cp3038_v6, cp3039_v6, cp3044_v6, cp3045_v6, cp3046_v6, cp3047_v6, cp3048_v6, cp3049_v6, cp4005_v6, cp4006_v6, cp4007_v6, cp4013_v6, cp4014_v6, cp4015_v6,kafka1012_v6,kafka10 [08:59:49] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp2007_v6, cp2010_v6 [08:59:50] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v6, cp2011_v6 [08:59:50] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v6, cp2011_v6 [08:59:50] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp2007_v6, cp2010_v6 [08:59:58] grrr those fucking checks [09:00:00] PROBLEM - IPsec on cp4001 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2012_v6 [09:00:04] :-) [09:00:09] PROBLEM - IPsec on cp4008 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp2007_v6, cp2010_v6 [09:00:09] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 119 connecting: cp2007_v6, cp2008_v6, cp2010_v6, cp2011_v6, cp2012_v6 [09:00:09] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v6, cp2011_v6 [09:00:09] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v6, cp2011_v6 [09:00:09] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v6, cp2011_v6 [09:00:09] PROBLEM - IPsec on cp4016 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp2007_v6, cp2010_v6 [09:00:09] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: (unnamed), cp1052_v6, cp1053_v6, cp1054_v6, cp1055_v6, cp1065_v6, cp1066_v6, cp1067_v6, cp1068_v6 not-conn: cp3030_v6, cp3031_v6, cp3032_v6, cp3033_v6, cp3040_v6, cp3041_v6, cp3042_v6, cp3043_v6, cp4008_v6, cp4009_v6, cp4010_v6, cp4016_v6, cp4017_v6, cp4018_v6,kafka1012_v6,kafka1013_v6,kafka1014_v6,kafka1018_v6,kafka1020_v6,kafka1022_v6 [09:00:18] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v6, cp2011_v6 [09:00:18] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v6, cp2011_v6 [09:00:19] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp2007_v6, cp2010_v6 [09:00:19] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v6, cp2011_v6 [09:00:19] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2007_v6, cp2010_v6 [09:00:19] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v6, cp2011_v6 [09:00:20] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 35 connecting: (unnamed), cp1048_v6, cp1049_v6, cp1050_v6, cp1062_v6, cp1063_v6, cp1064_v6, cp1071_v6, cp1072_v6, cp1073_v6, cp1074_v6, cp1099_v6 not-conn: cp3034_v6, cp3035_v6, cp3036_v6, cp3037_v6, cp3038_v6, cp3039_v6, cp3044_v6, cp3045_v6, cp3046_v6, cp3047_v6, cp3048_v6, cp3049_v6, cp4005_v6, cp4006_v6, cp4007_v6, cp4013_v6, cp4014_v6, cp4015_v6,kafka1012_v6,kafka10 [09:00:29] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v6, cp2011_v6 [09:00:29] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp2007_v6, cp2010_v6 [09:00:29] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v6, cp2011_v6 [09:00:39] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2007_v6, cp2010_v6 [09:00:39] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v6, cp2011_v6 [09:00:40] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2008_v6, cp2011_v6 [09:00:44] monitoring a distributed system is hard [09:00:49] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 119 connecting: cp2007_v6, cp2008_v6, cp2010_v6, cp2011_v6, cp2012_v6 [09:01:16] <_joe_> jynus: in particular when your monitoring system is inadequate and your abstractions for it are lame :/ [09:01:27] yeah, but still [09:02:13] so, multiple servers, right? [09:02:31] 7 / 8 / 11 [09:03:25] no default route [09:03:26] is that only row B by any chance? codfw has alerts indeed only for cp2007/8/10/11 [09:03:26] wtf [09:03:36] and 2012 [09:03:38] and yes, not all rows [09:03:48] so it must be my igmp snooping change [09:03:51] yes [09:04:01] router advertisements are broken [09:04:04] from cp2007 [09:04:37] ok, I 'll rollback and let's see what happens [09:04:58] k [09:05:22] yup [09:05:24] works now [09:05:28] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 54 ESP OK [09:05:28] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 124 ESP OK [09:05:38] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 124 ESP OK [09:05:38] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 124 ESP OK [09:05:43] <_joe_> traffic is draining now, heh [09:05:43] RECOVERY - LVS HTTPS IPv6 on upload-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 954 bytes in 0.167 second response time [09:05:44] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 124 ESP OK [09:05:44] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [09:05:44] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 44 ESP OK [09:05:48] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 56 ESP OK [09:05:48] RECOVERY - IPsec on cp4018 is OK: Strongswan OK - 44 ESP OK [09:05:57] ok then [09:05:58] RECOVERY - IPsec on cp3031 is OK: Strongswan OK - 44 ESP OK [09:05:59] RECOVERY - IPsec on cp3009 is OK: Strongswan OK - 28 ESP OK [09:05:59] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 28 ESP OK [09:05:59] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 54 ESP OK [09:06:02] hmm [09:06:04] RECOVERY - LVS HTTPS IPv6 on misc-web-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 438 bytes in 0.183 second response time [09:06:04] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 44 ESP OK [09:06:08] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 44 ESP OK [09:06:08] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK [09:06:09] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 44 ESP OK [09:06:09] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 44 ESP OK [09:06:19] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 54 ESP OK [09:06:19] RECOVERY - IPsec on cp1045 is OK: Strongswan OK - 24 ESP OK [09:06:19] RECOVERY - IPsec on cp1058 is OK: Strongswan OK - 24 ESP OK [09:06:19] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 44 ESP OK [09:06:19] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 24 ESP OK [09:06:20] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 44 ESP OK [09:06:22] how it only affected a subset of the servers? [09:06:28] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [09:06:28] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 70 ESP OK [09:06:28] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 28 ESP OK [09:06:28] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 44 ESP OK [09:06:28] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 54 ESP OK [09:06:29] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [09:06:29] so I could not get igmp-snooping to work without declaring explicitly private1-b-codfw on the switch [09:06:29] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 44 ESP OK [09:06:35] ah! [09:06:38] RECOVERY - IPsec on cp4001 is OK: Strongswan OK - 28 ESP OK [09:06:39] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 124 ESP OK [09:06:39] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [09:06:39] RECOVERY - IPsec on cp4008 is OK: Strongswan OK - 44 ESP OK [09:06:48] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 54 ESP OK [09:06:48] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 54 ESP OK [09:06:48] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 56 ESP OK [09:06:48] RECOVERY - IPsec on cp4016 is OK: Strongswan OK - 44 ESP OK [09:06:49] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 54 ESP OK [09:06:49] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK [09:06:50] RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 54 ESP OK [09:06:50] RECOVERY - IPsec on cp4009 is OK: Strongswan OK - 44 ESP OK [09:06:50] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 44 ESP OK [09:06:50] but that somehow broke NDP which is multicast [09:06:51] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [09:06:58] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [09:06:59] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 54 ESP OK [09:06:59] RECOVERY - IPsec on cp4017 is OK: Strongswan OK - 44 ESP OK [09:06:59] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 54 ESP OK [09:07:08] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [09:07:08] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 44 ESP OK [09:07:09] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [09:07:09] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [09:07:09] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 44 ESP OK [09:07:09] that is good news, I think [09:07:09] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [09:07:10] RECOVERY - IPsec on cp4004 is OK: Strongswan OK - 28 ESP OK [09:07:10] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 54 ESP OK [09:07:17] good I did not have time to push the change to all the switches [09:07:19] RECOVERY - IPsec on cp4003 is OK: Strongswan OK - 28 ESP OK [09:07:19] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 124 ESP OK [09:07:19] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 28 ESP OK [09:07:28] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 54 ESP OK [09:07:29] :-) [09:07:29] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 24 ESP OK [09:07:29] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [09:07:30] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [09:07:30] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [09:07:38] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK [09:07:38] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 44 ESP OK [09:07:39] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 36 ESP OK [09:07:39] RECOVERY - IPsec on cp4010 is OK: Strongswan OK - 44 ESP OK [09:07:39] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 44 ESP OK [09:07:39] RECOVERY - IPsec on cp4002 is OK: Strongswan OK - 28 ESP OK [09:07:39] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 54 ESP OK [09:07:40] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 44 ESP OK [09:07:40] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 54 ESP OK [09:08:20] also something is weird with lldp over there [09:08:39] it reports only a few of the cp neighors [09:08:52] which led me off track initially [09:09:06] as for example it was not reporting cp2010 (which it does now) [09:09:19] but now it does not report cp2008 for example [09:10:00] when someone was showing me around the network switches (paravo.id or chase.mp not sure) in eqiad it had the same thing (lldp didn't discover a good chunk of machines) [09:10:19] this was quite a while ago tho [09:10:36] YuviPanda: the funny things is on the boxes side it seems to work quite reliably [09:11:11] ah, so I see at least 5 entries for lldp neighbors that are without a "System Name" [09:11:42] in eqiad it's because of ancient junos [09:11:43] <_joe_> akosiaris: do you want to take the time to try that change correctly, or should I re-insert codfw in the pool? [09:12:02] _joe_: don't pool it yet .. let's figure out what's going on first [09:12:14] <_joe_> akosiaris: ack [09:12:51] brb, laptop's fucked :P [09:13:05] show lldp neighbors |match cp returns inconsistent results indeed. everytime something different [09:13:22] sometimes 2 boxes, sometimes 5, sometimes 3 [09:13:26] nice juniper... [09:13:55] (it's 6 in total btw) [09:16:32] (03PS2) 10Giuseppe Lavagetto: apache-fast-test: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/284697 [09:17:11] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] apache-fast-test: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/284697 (owner: 10Giuseppe Lavagetto) [09:17:27] (03PS7) 10Filippo Giunchedi: graphite: port to jessie/systemd [puppet] - 10https://gerrit.wikimedia.org/r/211685 (https://phabricator.wikimedia.org/T132717) [09:17:29] (03PS1) 10Filippo Giunchedi: statsite: port to jessie/systemd [puppet] - 10https://gerrit.wikimedia.org/r/284871 [09:19:49] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:20:03] ---^ this is Cassandra timing out :( [09:20:57] it auto-recovers, we are still waiting for new hardware/settings [09:23:49] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [09:24:57] (03CR) 10KartikMistry: "https://gerrit.wikimedia.org/r/#/c/284872/ is followup patch as fix suggested." [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) (owner: 10KartikMistry) [09:24:59] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:25:38] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:25:40] !log installing ircbalance bugfix updates (preventing massive logspam on some systems) [09:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:25:58] irqbalance, presumably :) [09:26:08] too much irclib? ;) [09:26:42] haha [09:27:40] akosiaris: can you review, https://gerrit.wikimedia.org/r/#/c/284654/ again? (There are some changes since yesterday) [09:28:34] haha, indeed :-) [09:29:50] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [09:30:03] * YuviPanda balances moritzm's IRC [09:33:18] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [09:33:46] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [09:41:57] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:42:47] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:44:56] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [09:45:47] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:46:06] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [09:46:16] !log installing PHP security updates [09:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:46:37] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Puppet has 1 failures [09:51:17] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:53:10] 06Operations, 10DBA, 13Patch-For-Review: implement performance_schema for mysql monitoring - https://phabricator.wikimedia.org/T99485#2230090 (10jcrespo) [09:53:27] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [09:54:08] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [09:54:35] 06Operations, 10DBA, 13Patch-For-Review: implement performance_schema for mysql monitoring - https://phabricator.wikimedia.org/T99485#1292599 (10jcrespo) p_s generally available on all masters and many slaves- it only needs some restarts on some pending servers, but new collectors can already be programmed a... [09:54:37] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:56:47] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [09:58:14] 06Operations, 10DBA, 07Tracking: Migrate MySQLs to use ROW-based replication (tracking) - https://phabricator.wikimedia.org/T109179#2230099 (10jcrespo) [09:58:16] 06Operations, 10DBA, 13Patch-For-Review: Implement mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#2230100 (10jcrespo) [09:58:18] 06Operations, 10DBA, 13Patch-For-Review: implement performance_schema for mysql monitoring - https://phabricator.wikimedia.org/T99485#2230102 (10jcrespo) [09:58:20] 06Operations, 13Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#2230101 (10jcrespo) [09:58:22] 06Operations, 10DBA, 13Patch-For-Review: Perform a rolling restart of all MySQL slaves (masters too for those services with low traffic) - https://phabricator.wikimedia.org/T120122#2230096 (10jcrespo) 05Open>03Resolved I am going to go ahead and say this is done, the rest of the servers will be done at a... [10:00:35] 06Operations, 13Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#2230107 (10jcrespo) given that "role::coredb::*" is to be deprecated, I would close this ticket after: 1) a pending-to-be-applied-ferm db hosts is listed 2) new rules deleting iron are appl... [10:02:25] 06Operations, 13Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#2230108 (10MoritzMuehlenhoff) Sounds good to me, I'll update the ticket later with the list of hosts. [10:06:42] 06Operations, 10DBA: upgrade db servers to jessie - https://phabricator.wikimedia.org/T125028#2230109 (10jcrespo) All masters are now in jessie or trusty; precise old masters now to be reimaged. Current trusty masters *will not be* upgraded to jessie (invalid still applies @dzahn), only the trusty slaves/decom... [10:07:05] (03PS1) 10Elukey: Add a simple class to the statistics module to allow basic apache maintenace. [puppet] - 10https://gerrit.wikimedia.org/r/284878 (https://phabricator.wikimedia.org/T76348) [10:08:16] (03CR) 10jenkins-bot: [V: 04-1] Add a simple class to the statistics module to allow basic apache maintenace. [puppet] - 10https://gerrit.wikimedia.org/r/284878 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [10:09:32] 06Operations, 10DBA, 07Tracking: Migrate MySQLs to use ROW-based replication (tracking) - https://phabricator.wikimedia.org/T109179#2230113 (10jcrespo) Should we continue doing ROW performance testing on codfw at the same time than T121207 is worked on? [10:12:13] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [10:13:24] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:15:36] 06Operations, 10Traffic, 13Patch-For-Review: confctl: improve/upgrade --tags/--find - https://phabricator.wikimedia.org/T128199#2230144 (10Joe) GET: ``` $ confctl --config conf.yaml select 'dc=codfw,cluster=api_appserver,name=mw20(7[6-9]|[8-9][0-9]).*' get {"mw2076.codfw.wmnet": {"pooled": "yes", "weight": 2... [10:20:22] (03PS2) 10Elukey: Allow basic apache maintenace webpages for the statistics::web role. [puppet] - 10https://gerrit.wikimedia.org/r/284878 (https://phabricator.wikimedia.org/T76348) [10:21:32] (03CR) 10jenkins-bot: [V: 04-1] Allow basic apache maintenace webpages for the statistics::web role. [puppet] - 10https://gerrit.wikimedia.org/r/284878 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [10:23:22] (03PS3) 10Elukey: Allow basic apache maintenace webpages for the statistics::web role. [puppet] - 10https://gerrit.wikimedia.org/r/284878 (https://phabricator.wikimedia.org/T76348) [10:27:57] anybody that can give me some hint if --^ makes sense or not? (basic apache config for a maintenance page - http only) [10:29:19] the idea would be to apply the role to bromine or bohrium temporarily [10:30:13] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 5 failures [10:31:04] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [10:33:02] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 2 failures [10:35:51] (03CR) 10Mobrovac: "The bug is still there. As I tried to hint at in my previous post, the crux of the problem lies in the the file-loading part: you have to " [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) (owner: 10KartikMistry) [10:39:57] (03PS3) 10Alexandros Kosiaris: servermon: require_package python-ldap instead [puppet] - 10https://gerrit.wikimedia.org/r/284731 (owner: 10Alex Monk) [10:40:04] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] servermon: require_package python-ldap instead [puppet] - 10https://gerrit.wikimedia.org/r/284731 (owner: 10Alex Monk) [10:47:03] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 675 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5190276 keys - replication_delay is 675 [10:55:51] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [10:56:21] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:57:31] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:02:01] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5157641 keys - replication_delay is 0 [11:09:20] PROBLEM - HHVM rendering on mw1247 is CRITICAL: Connection refused [11:11:30] RECOVERY - HHVM rendering on mw1247 is OK: HTTP OK: HTTP/1.1 200 OK - 71453 bytes in 0.075 second response time [11:12:18] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:12:49] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [11:12:59] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [11:13:09] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 400 (expecting: 200): /page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 400 (expecting: 200): /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid [11:13:19] PROBLEM - puppet last run on mw1095 is CRITICAL: CRITICAL: Puppet has 5 failures [11:13:50] PROBLEM - puppet last run on mw1015 is CRITICAL: CRITICAL: Puppet has 7 failures [11:14:00] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [11:19:28] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:20:28] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:31:48] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:32:19] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:39:09] RECOVERY - puppet last run on mw1095 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [11:39:38] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:46:20] 06Operations, 13Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#2230173 (10MoritzMuehlenhoff) These are the remaining db systems without base::firewall: Servers using coredb, which are re-setup with mariadb::core: db1052.eqiad.wmnet db1038.eqiad.wmnet d... [11:51:40] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [11:54:57] (03PS4) 10Elukey: Allow basic apache maintenace webpages for the statistics::web role. [puppet] - 10https://gerrit.wikimedia.org/r/284878 (https://phabricator.wikimedia.org/T76348) [12:03:54] (03PS5) 10Elukey: Allow basic apache maintenace webpages for the statistics::web role. [puppet] - 10https://gerrit.wikimedia.org/r/284878 (https://phabricator.wikimedia.org/T76348) [12:04:58] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 400 (expecting: 200): /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [12:06:28] PROBLEM - RAID on hassaleh is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:06:49] PROBLEM - DPKG on hassaleh is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:06:58] PROBLEM - Disk space on hassaleh is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:07:08] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [12:07:39] PROBLEM - configured eth on hassaleh is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:07:54] (03PS6) 10Elukey: Allow basic apache maintenace webpages for the statistics::web role. [puppet] - 10https://gerrit.wikimedia.org/r/284878 (https://phabricator.wikimedia.org/T76348) [12:07:58] PROBLEM - dhclient process on hassaleh is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:07:58] PROBLEM - salt-minion processes on hassaleh is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:08:20] PROBLEM - puppet last run on hassaleh is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:12:33] (03PS7) 10Elukey: Allow basic apache maintenace webpages for the statistics::web role. [puppet] - 10https://gerrit.wikimedia.org/r/284878 (https://phabricator.wikimedia.org/T76348) [12:14:08] RECOVERY - configured eth on hassaleh is OK: OK - interfaces up [12:14:19] RECOVERY - salt-minion processes on hassaleh is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:14:19] RECOVERY - dhclient process on hassaleh is OK: PROCS OK: 0 processes with command name dhclient [12:14:39] RECOVERY - puppet last run on hassaleh is OK: OK: Puppet is currently enabled, last run 40 minutes ago with 0 failures [12:15:00] RECOVERY - RAID on hassaleh is OK: OK: no RAID installed [12:15:19] RECOVERY - DPKG on hassaleh is OK: All packages OK [12:15:30] RECOVERY - Disk space on hassaleh is OK: DISK OK [12:15:40] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 400 (expecting: 200): /page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 400 (expecting: 200): /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid [12:26:02] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2230227 (10elukey) Summary: 1) we decided to have a maintenance page hosted somewhere to avoid connection timeouts during the reimage. I created https://gerrit.wikimedia.org/r/2... [12:33:06] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [12:38:46] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 400 (expecting: 200): /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 400 (expecting: 200): /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) [12:47:15] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [12:59:08] mobrovac: ^ curious 400s only from restbase1007, can see the 400s in logstash but not obvious to me what that could be causing it [13:00:01] wth? [13:00:12] on a friday afternoon? [13:00:13] * mobrovac sighs [13:00:36] hehe 1007 is out for a friday beer [13:00:58] lol [13:01:12] (interestingly, "the end" by "the doors" has just started playing) [13:01:31] godog: i need one too [13:01:59] hehe [13:02:40] mobrovac: I'll depool it just in case, I haven't seen alarms from others [13:02:49] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 1 failures [13:03:25] !log depool restbase1007, 400s from restbase self check [13:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:05:48] kk godog, /me looking [13:10:55] (03PS5) 10BBlack: cache_maps: re-role old mobile servers [puppet] - 10https://gerrit.wikimedia.org/r/268236 (https://phabricator.wikimedia.org/T109162) [13:11:08] (03PS4) 10BBlack: cache_maps: remove cp104[34] test caches [puppet] - 10https://gerrit.wikimedia.org/r/268237 (https://phabricator.wikimedia.org/T109162) [13:11:16] (03PS4) 10BBlack: cache_maps: add all sites in LVS [puppet] - 10https://gerrit.wikimedia.org/r/268238 (https://phabricator.wikimedia.org/T109162) [13:11:25] (03PS1) 10Muehlenhoff: Allow display of a job by job id or relatively (e.g. -2 to show the second-last) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/284889 [13:11:57] (03PS5) 10BBlack: maps DNS 2/2: enable geodns routing [dns] - 10https://gerrit.wikimedia.org/r/268240 (https://phabricator.wikimedia.org/T109162) [13:12:49] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update debian package for gerrit [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [13:16:18] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 400 (expecting: 200): /page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 400 (expecting: 200): /page/mobile-sections/{title} (Get MobileApps Foobar page) is CRITICAL: [13:16:57] yes icinga-wm, we know [13:17:15] 06Operations, 10DBA, 06Labs, 07Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#2230289 (10jcrespo) [13:18:18] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [13:19:02] (03PS1) 10Filippo Giunchedi: cassandra: add restbase1015-a [puppet] - 10https://gerrit.wikimedia.org/r/284890 [13:19:17] I've silenced it for a couple of hours [13:20:38] cool, thnx, but it seems to be flapping [13:27:09] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [13:27:58] !log upload 5.3.10-1ubuntu3.22+wmf1 on apt.wikimedia.org [13:28:01] moritzm: ^ [13:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:28:44] 4 host I think [13:29:06] akosiaris: thanks, I'll update these [13:29:18] mobrovac: cassandra looks fine to me tho, I wanted to get restbase1015 going asap so I'll merge https://gerrit.wikimedia.org/r/#/c/284890/ [13:30:08] yeah, that's the strange part godog [13:30:15] godog: kk for rb1015 [13:30:52] (03PS2) 10Filippo Giunchedi: cassandra: add restbase1015-a [puppet] - 10https://gerrit.wikimedia.org/r/284890 [13:30:54] godog: i'll kill restbase on rb1007 and run it manually to try to figure out what's going on [13:30:58] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase1015-a [puppet] - 10https://gerrit.wikimedia.org/r/284890 (owner: 10Filippo Giunchedi) [13:31:28] ok! [13:31:47] manually == on a different port [13:31:55] so as to have a sane output [13:33:10] 06Operations, 10DBA, 07Epic: Eliminate SPOF at the main database infrastructure - https://phabricator.wikimedia.org/T119626#2230311 (10jcrespo) [13:34:00] !log restbase stopping restbase on rb1007 to manually inspect why is it flapping [13:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:39:41] 06Operations, 10DBA, 10MediaWiki-Database, 07Performance: Implement GTID replication on MariaDB 10 servers - https://phabricator.wikimedia.org/T133385#2230318 (10jcrespo) [13:39:57] PROBLEM - Restbase root url on restbase1007 is CRITICAL: Connection refused [13:41:30] also silenced [13:42:36] 06Operations, 10DBA, 10MediaWiki-Database, 07Performance: Implement GTID replication on MariaDB 10 servers - https://phabricator.wikimedia.org/T133385#2230332 (10jcrespo) Main theoretical reasons: * Simpler replication maintenance (failover is now just CHANGE MASTER TO MASTER_HOST='new master'; * Transact... [13:42:40] (03Abandoned) 10BBlack: pull down 3x HTTP/2 fixes from nginx master [software/nginx] (wmf-1.9.14-1) - 10https://gerrit.wikimedia.org/r/284076 (owner: 10BBlack) [13:43:03] thnx godog [13:43:04] (03PS3) 10BBlack: multicert + libssl1.0.2 patches for 1.9.15 [software/nginx] (wmf-1.9.14-1) - 10https://gerrit.wikimedia.org/r/284075 (https://phabricator.wikimedia.org/T96848) [13:43:06] (03PS4) 10BBlack: nginx (1.9.15-1+wmf1) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.9.14-1) - 10https://gerrit.wikimedia.org/r/284077 (https://phabricator.wikimedia.org/T96848) [13:43:08] (03PS1) 10BBlack: Import nginx.org 1.9.14-1.9.15 diffs [software/nginx] (wmf-1.9.14-1) - 10https://gerrit.wikimedia.org/r/284892 [13:43:52] !log bootstrap restbase1015-a T128107 [13:43:53] T128107: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107 [13:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:44:26] 06Operations, 10DBA, 10MediaWiki-Database, 07Performance: Implement GTID replication on MariaDB 10 servers - https://phabricator.wikimedia.org/T133385#2230347 (10jcrespo) [13:44:47] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [13:45:32] again? [13:45:34] goddammit [13:45:46] 06Operations, 10DBA, 10MediaWiki-JobQueue: Jobqueue increase activity on refreshlinks for frwiktionary - https://phabricator.wikimedia.org/T133160#2230354 (10jcrespo) 05Open>03Resolved a:03jcrespo Issue no longer active. [13:49:06] (03PS2) 10Andrew Bogott: Allocate LVS service IPs for labs auth and recursor dns [dns] - 10https://gerrit.wikimedia.org/r/284824 (https://phabricator.wikimedia.org/T119660) [13:50:58] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 2.22 ms [13:51:38] (03CR) 10Filippo Giunchedi: [WIP]New user for prometheus monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/280939 (https://phabricator.wikimedia.org/T128185) (owner: 10Jcrespo) [13:52:46] 06Operations, 10netops: HTCP purges flood across CODFW - https://phabricator.wikimedia.org/T133387#2230372 (10akosiaris) [13:53:34] (03CR) 10BBlack: [C: 04-1] "Needs revdns, too. But also, I'm unsure about whether these belong in the prod LVS allocations. authdns I could see an argument for, but" [dns] - 10https://gerrit.wikimedia.org/r/284824 (https://phabricator.wikimedia.org/T119660) (owner: 10Andrew Bogott) [13:54:02] godog: as it usually happens, no way to reproduce it now :/ [13:54:23] heh of course! [13:54:37] godog: will start it back up, maybe was just a hiccup in the cass driver of some sorts that got cleared when i killed it [13:54:45] that's my working theory for now at least [13:55:16] (03PS1) 10Alexandros Kosiaris: Revert "Set codfw globally down" [dns] - 10https://gerrit.wikimedia.org/r/284893 [13:55:29] _joe_: ^ [13:55:51] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Set codfw globally down" [dns] - 10https://gerrit.wikimedia.org/r/284893 (owner: 10Alexandros Kosiaris) [13:56:02] mobrovac: sounds good, thanks! [13:56:47] RECOVERY - Restbase root url on restbase1007 is OK: HTTP OK: HTTP/1.1 200 - 15253 bytes in 0.011 second response time [13:57:03] ok... [13:57:14] !log restbase bringing the service back up on restbase1007 [13:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:05] andrewbogott: re the dns patch: really the authdns makes sense to be public and part of our normal /27 for public things that are LVS'd [13:58:26] andrewbogott: the recursor.. I'm really not even sure why our prod recursor IPs are in that range to begin with, either, but labs even more so [13:59:17] bblack: yeah, I'm not sure how to best handle the recursor. Because labs doesn't have access to the internal prod vlan... [13:59:24] andrewbogott: I think the essential conflict here is there's no block of public IPs we've allocated for "this is in the set of LVS'd public IPs, but it's not part of our official block of public service IPs we expect the outside world to hit", both of which are that /27 right now [13:59:25] making it public certainly allows labs to see it but also seems silly [13:59:54] ah, yes! That's what we need, 'public IP but not public service' [14:00:13] but really, there's zero true cases for that except for this labs recursor [14:00:25] well, prod recursor also? [14:00:53] well that's the other one today, but IMHO that shouldn't even be a public IP. even the boxes that have their primary IP on the public networks can use a private recursor IP... [14:01:13] true [14:01:28] it's just an artifact of how we configured things because it's easy. the prod recursors themselves need to be on the public networks to make outbound queries, but IMHO the IP they answer resolver requests on should be private [14:01:44] (which means giving them (virtual) interfaces on both public+private vlans to do it "right") [14:02:30] and then looking at our space allocations in general, both of 153 for codfw and 154 for eqiad are completely carved out, there's no excess space there to make a new block for any new class of LVS service [14:02:44] in eqiad we happen to have space in 155, but I think codfw doesn't have a secondary /24 like that? [14:03:39] we could in theory move blocks around in the existing /27's (make one of its small subnets for non-public public IPs)... [14:04:05] but moving IPs there is painful and takes a long time, and we've already given zero those whole /27's too, so they have meaning outside of just ops [14:04:36] Would having the recursor IPs in a separate block mean technical differences (routing rules, etc) or just be for documentation? [14:05:05] no, it would be a total PITA and probably require router changes, as they only expect LVS IPs to be in those LVS blocks, I think [14:05:21] but I'm not 100% sure on that, I'd have to look again to remember [14:05:33] PROBLEM - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is CRITICAL: Connection refused [14:06:01] I think what I'm long-windedly saying is there's no better solution than what you're doing, but I still hate it and watch to bitch about it [14:06:29] bblack: you're not wrong. I can certainly add a comment to that effect in the zone file [14:06:55] ACKNOWLEDGEMENT - Restbase root url on restbase1015 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [14:06:55] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [14:06:56] ACKNOWLEDGEMENT - restbase endpoints health on restbase1015 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.134, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Filippo Giunchedi bootstrapping [14:09:24] andrewbogott: yeah so I guess I'll add DNS comments to the prod recursor IPs that are in those blocks too, something to the effect of "these LVS'd public IPs really aren't public" in the revdns [14:09:32] and that patch still needs revdns and I guess a similar comment too [14:09:41] yep, ok [14:09:45] godog: i can deploy restbase to rb1015 if you want [14:09:51] so we get rid of that alert [14:09:54] at least :) [14:10:15] mobrovac: yup, thanks! [14:11:52] (03PS1) 10BBlack: mark oddball dns-rec-lb "public" IPs [dns] - 10https://gerrit.wikimedia.org/r/284894 [14:12:22] (03PS3) 10Andrew Bogott: Allocate LVS service IPs for labs auth and recursor dns [dns] - 10https://gerrit.wikimedia.org/r/284824 [14:12:24] (03PS2) 10BBlack: mark oddball dns-rec-lb "public" IPs [dns] - 10https://gerrit.wikimedia.org/r/284894 [14:13:23] (03CR) 10Andrew Bogott: [C: 032] mark oddball dns-rec-lb "public" IPs [dns] - 10https://gerrit.wikimedia.org/r/284894 (owner: 10BBlack) [14:13:26] andrewbogott: well one other thing, aren't you going to have a labs-ns0 + labs-ns1 ? why is that one just "labs-ns" in 1x DC? [14:14:10] I guess because you have other legacy labs-ns[01] hostnames you're switching from [14:14:18] !log restbase initial deploy of 7f69f86ee9 to restbase1015 [14:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:14:24] godog: ^ {{done}} [14:14:49] bblack: labs-ns0 and labs-ns1 are the current servers. labs-ns will be an lvs wrapper for those two servers. [14:15:23] ok I guess I don't understand the context [14:15:24] (03CR) 10Jcrespo: "I am a bit confused about the actual goals here. An "I do not know (yet)" would be cool for me." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/280939 (https://phabricator.wikimedia.org/T128185) (owner: 10Jcrespo) [14:15:27] (03PS4) 10Andrew Bogott: Allocate LVS service IPs for labs auth and recursor dns [dns] - 10https://gerrit.wikimedia.org/r/284824 [14:15:45] is upstream (public DNS for .org) going to still list labs-ns[01] as the 2x NS for wmflabs.org? [14:16:13] bblack: it's completely possible that I'm doing this all wrong :) But, the associated patch is https://gerrit.wikimedia.org/r/#/c/284829/ (and there's a bug linked in both commit messages) [14:16:57] I was expecting to change upstream to just use labs-ns as the NS for wmflabs.org [14:17:17] upstream always wants 2+ IPs, usually they expect them to have some network diversity too, but probably don't enforce that [14:17:36] so that everyone (labs VMs and upstream) just uses a single address, and all the balancing/failover is done by lvs/pybal [14:17:45] the ticket is about recdns, not authdns [14:18:20] the commit is about both [14:18:23] you're right [14:18:25] hm [14:19:06] what's standard practice for auth in this case? Am I better off leaving this as is and only doing lvs for the recursor? [14:19:18] I have no idea about long-term labs view of things, but I would think the long-term direction would be something like: [14:20:44] in each DC (codfw + eqiad), have 2x authdns machines, 4 total. say they're called labs-authdns-codfw-[12], labs-authdns-eqiad-[12] for example sake [14:20:55] they're all in sync (at least down to a small time window, they all give the same answers) [14:21:23] in codfw, set up an LVS IP for labs-ns0 virtual hostname. LVS balances/fails those requests to labs-authdns-codfw[12] [14:21:39] similar in eqiad for labs-ns0 on an eqiad LVS IP balanced/failed to labs-authdns-eqiad[12] [14:21:54] in upstream DNS, it's still 2x NS records, pointing at these LVS virtual IPs for labs-ns[01] [14:22:39] So that's roughly what I'm in the process of doing, right? Except for 1) slight differences in naming and 2) only one datacenter [14:22:41] sorry, two lines up that should've said "similar in eqiad for labs-n1", not 0 [14:22:58] 2) isn't likely to change any time soon, does that change things dramatically? [14:23:41] yeah lack of a second DC makes things un-resilient and that's not changing anytime soon I guess [14:24:19] but still, I don't know if upstream DNS will even take a 1x NS IP configuration from you. it's definitely not "normal", as usually you give upstream DNS redundant NS IPs that are relatively independent of each other, as much as you can. [14:24:26] we certainly could put nameservers in codfw, but if eqiad dies they would just serve names for dead VMs :( [14:24:28] DNS has built-in fault tolerance for that kind of thing [14:24:44] ok -- [14:24:51] it's still a good idea to put authdns behind LVS, but that would mean doing an LVS for each public IP, not 1x LVS backending to the only 2x authdns [14:25:16] so it sounds to me like you're making a convincing case for leaving the auth setup just as it is for now. [14:25:39] I think so, but that's just my $0.02 from a generalist perspective [14:25:44] and using lvs for the recursor (which the upstream doesn't care about) [14:25:48] right [14:26:27] it would still be nice to have official public IPs be in floating space, though (as in, not tied to a specific rack/row), IMHO [14:26:40] so that if you move the ns0 machine to a new row, you don't have to renumber the IP [14:26:44] I think I'm convinced. I was just in a 'more load-balancing more better' mindset but… any non-stupid client can just take both nameservers and do the fail-over itself. [14:27:14] right, keep in mind actual clients don't hit authdns servers. recursors/caches do, which are more-complex and meant to handle that well in normal practice [14:27:37] bblack: is moving things to floating space literally just a question of renumbering, or would those new IPs have to be explicitly routed? [14:28:03] andrewbogott: I really don't know. I think the only non-row-specific public space we have in the general case *is* the LVS range [14:28:08] it's a problem for many services [14:28:24] ok, I will open a task for that bug keep it vague [14:28:28] e.g. mirrors.wikimedia.org's public IP has to change if we replace carbon with a machine in a different row [14:28:49] because it CNAME's carbon's IP in public1-a-eqiad [14:29:11] that's not awful for a low TTL IP used for HTTP[S], but it kinda sucks for an authdns IP [14:30:07] that's been a pain point in past labs-ns changes I think (row moves) [14:30:23] yeah, it has [14:30:35] so, filed https://phabricator.wikimedia.org/T133389 — feel free to elaborate [14:30:45] and meanwhile I'll pull out the labs-ns bits of my other patches. Stay tuned [14:31:30] but fundamentally, it's ok to use the /27 "LVS public service IPs" space for non-LVS. that's what we do for production authdns. [14:32:06] ns[012].wikimedia.org IPs all live in that space alongside LVS IPs, but are not currently LVS-routed. but we put special rules in the routers to map them to the right machines in specific racks. [14:33:19] (03CR) 10Filippo Giunchedi: [WIP]New user for prometheus monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/280939 (https://phabricator.wikimedia.org/T128185) (owner: 10Jcrespo) [14:33:29] (03PS5) 10Andrew Bogott: Allocate LVS service IPs for labs dns recursor [dns] - 10https://gerrit.wikimedia.org/r/284824 (https://phabricator.wikimedia.org/T119660) [14:35:33] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 643 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5173757 keys - replication_delay is 643 [14:36:24] andrewbogott: so another dumb question - why does the labs recursor IP[s] live in the prod networks anyways? [14:37:24] I mean see the comment: [14:37:25] ; Nevertheless it needs a public IP since labs can't 259 [14:37:25] ; access the prod internal vlan. [14:37:42] bblack: well, the recursors are running on physical prod machines [14:37:43] but why does the labs recursor software itself need access to the prod internal vlan? [14:38:12] to access openstack things that run out there? [14:38:29] it may be that I'm misunderstanding the question… but all labs services themselves run on actual prod boxes [14:38:33] (wikitech/horizon/whatever?) [14:38:58] I dont' think we have a way to run it on the labs vlan without having it actually hosted on a labs vm [14:39:39] well yeah, it's just from the naive POV, I'd think "this is just a recursor, and it can look up public names including wmflabs.org without being in prod space", but I'm guessing there's also some side-tie-in to looking .wmflabs. non-public hostnames via the recursor, too, or something. [14:40:30] or I guess another way to think of it: if some labs instance wanted to run its own private recursor with just a stock "look up public things" on its own 127.0.0.1 and use that in its resolv.conf, that would break something or other that it would miss out on looking up correctly, right? [14:41:21] anyways, I'm just trying to understand the whole thing better, I'm sure there's a reason, I just don't know it. [14:41:27] There's a complicated mess having to do with labs VMs accessing other labs VMs via their floating and their internal IPs [14:41:34] 06Operations, 10Monitoring: save grafana dashboards in revision control / puppet - https://phabricator.wikimedia.org/T133392#2230471 (10fgiunchedi) [14:41:36] (03CR) 10BBlack: [C: 031] Allocate LVS service IPs for labs dns recursor [dns] - 10https://gerrit.wikimedia.org/r/284824 (https://phabricator.wikimedia.org/T119660) (owner: 10Andrew Bogott) [14:41:50] 06Operations, 10Monitoring, 07Tracking: Improve access to and control over incident and metrics monitoring infrastructure - https://phabricator.wikimedia.org/T124179#2230489 (10fgiunchedi) [14:41:52] 06Operations, 10Monitoring, 13Patch-For-Review, 07Tracking: consolidate graphite metrics monitoring frontends into grafana - https://phabricator.wikimedia.org/T125644#1993425 (10fgiunchedi) 05Open>03Resolved resolving, followup work is tracked in T133392 [14:41:52] It's a known (and much-investigated, and complicated) issue that labs instances can have public IPs, but other labs instances can't route to those public IPs [14:42:04] yeah I vaguely recall heh [14:42:05] so those recursors intercept requests for those public IPs and return internal IPs instead. [14:42:18] 06Operations, 10Monitoring: save grafana dashboards in revision control / puppet - https://phabricator.wikimedia.org/T133392#2230471 (10fgiunchedi) p:05Triage>03Normal [14:42:20] that's the only interesting thing those recursors do, otherwise we wouldn't have them at all [14:42:45] what's the underlying reason labs public IPs can't route to other labs public IPs? [14:43:20] Man, I don't even remember. I can find you a bug, hang on [14:44:10] https://phabricator.wikimedia.org/T96924 [14:46:24] heh, which links back to an old RT ticket, with an email copypasta that includes: [14:46:27] Mark kindly replied: [14:46:30] So you're saying that labs instances can't reach the floating public IP addresses? I think we should fix that problem instead, for all instances... [14:47:13] and then work is done on NAT stuff, and then faidon says at some point "However, I have to go one step back: do we really need to do DNAT on labs *instances*? This feels very wrong. We should solve this on the infrastructure side of things/openstack level..." [14:48:02] " [14:48:07] This issue is nasty, it is a limitation of labs when your IP packet traverse the NAT (Network Address Translation)." <- another line about it in there [14:48:44] so, at some labs router machine, the public IPs are handled by NAT, and for whatever reason that setup can't let these machines hit each others' NAT IPs [14:48:59] (03PS2) 10Andrew Bogott: Add an lvs service ip (labs-ns.wikimedia.org) for the labs dns recursors [puppet] - 10https://gerrit.wikimedia.org/r/284829 (https://phabricator.wikimedia.org/T119660) [14:50:15] PROBLEM - Host es2019 is DOWN: PING CRITICAL - Packet loss = 100% [14:51:54] wow [14:52:08] we may have to return one server [14:52:11] or [14:52:20] check its connection [14:52:27] (to power) [14:53:08] damn [14:53:19] 06Operations, 10DBA, 13Patch-For-Review: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2230524 (10jcrespo) 05Resolved>03Open [14:53:23] ^ [14:53:40] too bad :( [14:53:41] not damn, volans, good that it didn't happen 25 hours ago [14:53:50] yeah I was just saying it [14:53:54] he he [14:54:18] redirect to papaul for check, otherwise, that sould be in full waranty [14:54:31] go back to your lair, I have this controlled [14:54:35] ok :) [14:55:33] jynus: yes warranty 2019 [14:55:48] (and actually, I was in the end right not to pool that server during the swithover) [14:56:06] I undo my onw saying [14:56:22] papaul, no need to rush [14:56:29] jynus:ok [14:56:30] but I will need you to check power [14:56:42] (the server is not serving traffic right now) [14:56:52] I will update the ticket [14:56:56] and ack the alerts [14:57:08] this is the second time this happens [14:57:12] jynus: please update task [14:57:19] yes, doing it now [14:57:29] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2230527 (10elukey) Tried to install manually (dpkg -i) http://ftp.us.debian.org/debian/pool/main/m/memcached/memcached_1.4.25-2_amd64.deb on an instance in labs... [14:57:35] (I did not warn you the first time because it seemed random at first) [15:03:44] 06Operations, 10ops-codfw: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2230546 (10jcrespo) This server crashed again right now, potentially power related like the last time (there were no kernel logs last time, and we will probably will not found them either this time). @Papaul cou... [15:04:12] 06Operations, 10ops-codfw: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2230550 (10jcrespo) p:05High>03Normal Normal because we do not plan to use it any time soon (unlike last time). [15:04:48] 06Operations, 05WMF-NDA: Migrate RT to Phabricator - https://phabricator.wikimedia.org/T38#2230552 (10chasemp) a:05chasemp>03None [15:05:42] (03PS1) 10Jcrespo: Revert "Repool es2019" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284903 [15:05:48] (03PS2) 10Jcrespo: Revert "Repool es2019" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284903 [15:11:54] (03PS1) 10Rush: labs: setting up to use cgrules engine [puppet] - 10https://gerrit.wikimedia.org/r/284906 (https://phabricator.wikimedia.org/T131541) [15:11:56] (03PS3) 10Jcrespo: Depool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284903 (https://phabricator.wikimedia.org/T130702) [15:13:40] (03CR) 10Jcrespo: [C: 032] Depool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284903 (https://phabricator.wikimedia.org/T130702) (owner: 10Jcrespo) [15:16:21] (03PS1) 10Elukey: Example of possible configuration to run mc2009 with the latest memcached version. [puppet] - 10https://gerrit.wikimedia.org/r/284907 (https://phabricator.wikimedia.org/T129963) [15:17:41] (03CR) 10jenkins-bot: [V: 04-1] Example of possible configuration to run mc2009 with the latest memcached version. [puppet] - 10https://gerrit.wikimedia.org/r/284907 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [15:19:50] (03CR) 10Andrew Bogott: [C: 04-1] "one typo" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284906 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [15:21:39] (03PS2) 10Elukey: Example of possible configuration to run mc2009 with the latest memcached version. [puppet] - 10https://gerrit.wikimedia.org/r/284907 (https://phabricator.wikimedia.org/T129963) [15:22:16] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool es2019 (duration: 00m 52s) [15:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:24:36] !log uploaded gerrit 2.12.2 to carbon [15:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:24:55] ^ ostriches [15:25:38] Thanks! [15:26:33] Oh, was that done for precise? [15:26:58] We're trying to move to jessie while upgrading. [15:27:14] 06Operations, 10ops-codfw, 13Patch-For-Review: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2230606 (10jcrespo) a:05Volans>03None [15:27:24] !log add new librenms template bound to 'port utilization over threshold' alert [15:27:27] 06Operations, 10ops-codfw: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2143705 (10jcrespo) [15:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:28:47] ostriches: then you shouldn't have set the target distribution to precise-wikimedia :-) [15:28:55] let me rebuild [15:29:03] Wherps! [15:29:09] 06Operations, 10ops-codfw, 06DC-Ops: es2009 degraded RAID - https://phabricator.wikimedia.org/T125442#2230610 (10jcrespo) 05Open>03stalled This server will be decommissioned soon. Only keeping this open as a reminder to not reuse that particular faulty disk. [15:30:17] (03PS2) 10Rush: labs: setting up to use cgrules engine [puppet] - 10https://gerrit.wikimedia.org/r/284906 (https://phabricator.wikimedia.org/T131541) [15:30:47] (03PS1) 10Chad: Set target for 2.12.2 to jessie, not precise [debs/gerrit] - 10https://gerrit.wikimedia.org/r/284908 [15:32:05] (03CR) 10Andrew Bogott: [C: 031] labs: setting up to use cgrules engine [puppet] - 10https://gerrit.wikimedia.org/r/284906 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [15:32:16] (03PS2) 10BBlack: ganglia web: HTTP->HTTPS redir [puppet] - 10https://gerrit.wikimedia.org/r/284803 (https://phabricator.wikimedia.org/T132521) [15:33:11] (03CR) 10Rush: [C: 032] labs: setting up to use cgrules engine [puppet] - 10https://gerrit.wikimedia.org/r/284906 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [15:34:31] (03CR) 10Muehlenhoff: [C: 032 V: 032] Set target for 2.12.2 to jessie, not precise [debs/gerrit] - 10https://gerrit.wikimedia.org/r/284908 (owner: 10Chad) [15:41:16] 06Operations, 10DBA: Install, configure and provision recently arrived db core machines - https://phabricator.wikimedia.org/T133398#2230646 (10jcrespo) [15:43:07] (03PS1) 10Rush: toollabs: bastion setup for cgred::group scripts [puppet] - 10https://gerrit.wikimedia.org/r/284909 (https://phabricator.wikimedia.org/T131541) [15:44:46] (03CR) 10jenkins-bot: [V: 04-1] toollabs: bastion setup for cgred::group scripts [puppet] - 10https://gerrit.wikimedia.org/r/284909 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [15:47:25] !log uploaded gerrit 2.12.2 for jessie-wikimedia to carbon [15:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:48:23] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5148872 keys - replication_delay is 0 [15:50:47] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2230693 (10Nuria) This is what I think needs to be done here to resolve this ticket as soon as possible: (cc-ing @BBlack and @Ottomata for confirmation) - host analytics.wikimedia.org... [15:51:55] (03PS2) 10Rush: toollabs: bastion setup for cgred::group scripts [puppet] - 10https://gerrit.wikimedia.org/r/284909 (https://phabricator.wikimedia.org/T131541) [15:52:58] (03CR) 10jenkins-bot: [V: 04-1] toollabs: bastion setup for cgred::group scripts [puppet] - 10https://gerrit.wikimedia.org/r/284909 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [15:54:25] !log testing https redir on ganglia.wm.o (uranium) [15:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:55:43] (03PS3) 10BBlack: ganglia web: HTTP->HTTPS redir [puppet] - 10https://gerrit.wikimedia.org/r/284803 (https://phabricator.wikimedia.org/T132521) [15:56:11] (03CR) 10BBlack: [C: 032 V: 032] "Manually tested->deployed config diff, works as expected!" [puppet] - 10https://gerrit.wikimedia.org/r/284803 (https://phabricator.wikimedia.org/T132521) (owner: 10BBlack) [15:57:48] (03PS3) 10Rush: toollabs: bastion setup for cgred::group scripts [puppet] - 10https://gerrit.wikimedia.org/r/284909 (https://phabricator.wikimedia.org/T131541) [16:04:48] 06Operations, 10DBA: Bug on MariaDB use_stat_tables - https://phabricator.wikimedia.org/T118079#2230716 (10jcrespo) 05Open>03Resolved .23 is already available, and in production on may servers. [16:06:07] 06Operations, 06Analytics-Kanban, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2230731 (10Nuria) [16:12:04] (03PS3) 10BBlack: librenms: use chain cert correctly [puppet] - 10https://gerrit.wikimedia.org/r/284817 (https://phabricator.wikimedia.org/T132521) [16:12:13] (03PS1) 10Filippo Giunchedi: ganglia: don't run ganglia-monitor in labs [puppet] - 10https://gerrit.wikimedia.org/r/284912 (https://phabricator.wikimedia.org/T115330) [16:12:15] (03CR) 10BBlack: [C: 032 V: 032] librenms: use chain cert correctly [puppet] - 10https://gerrit.wikimedia.org/r/284817 (https://phabricator.wikimedia.org/T132521) (owner: 10BBlack) [16:12:47] 06Operations, 10DBA, 10Wikidata, 07Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#2230764 (10jcrespo) Marius, or someone else, do you know if this is still ongoing after latest deployments +... [16:13:56] (03CR) 10Thcipriani: "One comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284851 (owner: 10Alex Monk) [16:14:03] 06Operations, 10DBA, 10Wikidata, 07Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#2230782 (10hoo) I haven't seen it in the error logs recently at least, but I'm not looking at them every day. [16:15:20] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#1721704 (10fgiunchedi) I think to properly fix this we'd need PRODUCTION_NETWORKS from https://gerrit.wikimedia.org/r/#/c/260926/ though https://gerrit.wikimedia.org/r/#/c/... [16:15:42] (03CR) 10Alex Monk: Try to separate trebuchet stuff from role::deployment::server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284851 (owner: 10Alex Monk) [16:18:32] PROBLEM - puppet last run on mw2065 is CRITICAL: CRITICAL: Puppet has 1 failures [16:20:19] 06Operations, 10DBA, 10Wikidata, 07Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#2230789 (10jcrespo) p:05High>03Low a:05jcrespo>03None I promise I will give it a thorough check befo... [16:24:24] 06Operations, 10DBA, 13Patch-For-Review: Implement mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#2230795 (10jcrespo) 05Open>03Resolved Done, not without some issues: T133309 Pending tasks tracked separately: T109179 T133385 [16:35:22] (03CR) 10Thcipriani: Try to separate trebuchet stuff from role::deployment::server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284851 (owner: 10Alex Monk) [16:40:09] 06Operations, 10DBA, 07Upstream: TokuDB crashes frequently -consider upgrade it or search for alternative engines with similar features - https://phabricator.wikimedia.org/T109069#2230835 (10jcrespo) a:05jcrespo>03None [16:42:03] RECOVERY - puppet last run on mw2065 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:47:17] 06Operations, 10DBA: Upgrade x1 cluster - https://phabricator.wikimedia.org/T112079#2230848 (10jcrespo) p:05Low>03Normal a:05jcrespo>03None Master done already. Eqiad slave pending. [16:54:59] (03PS5) 10BBlack: nginx (1.9.15-1+wmf1) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.9.14-1) - 10https://gerrit.wikimedia.org/r/284077 (https://phabricator.wikimedia.org/T96848) [16:55:01] (03PS1) 10BBlack: Remove --automatic-dbgsym on dyn mod dh_strip [software/nginx] (wmf-1.9.14-1) - 10https://gerrit.wikimedia.org/r/284920 (https://phabricator.wikimedia.org/T96848) [16:55:53] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2230898 (10BBlack) Packaging patches updated to 1.9.15-1+wmf1 (which is still in branch wmf-1.9.14-1, as we're still based on that release from debian upstream unstable/testing, and th... [16:57:48] (03CR) 10Muehlenhoff: [C: 031] "Yeah, that can be discarded. dbgsym are fairly new, see https://lists.debian.org/debian-devel/2015/12/msg00262.html for the announcement." [software/nginx] (wmf-1.9.14-1) - 10https://gerrit.wikimedia.org/r/284920 (https://phabricator.wikimedia.org/T96848) (owner: 10BBlack) [16:59:05] 06Operations: Access to wikitech-static server? - https://phabricator.wikimedia.org/T133372#2230901 (10Dzahn) Oops, yea, it got changed during T126385. I have the new root password, when trying to update pwstore i got: the following recipients are invalid: A01005654C44D4F63C064C352E2CDEA83D58E334. that is yuvi... [17:00:51] (03CR) 10Thcipriani: Automate the generation deployment keys (keyholder-managed ssh keys) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [17:05:10] (03PS8) 10Elukey: Allow basic apache maintenace webpages for the statistics::web role. [puppet] - 10https://gerrit.wikimedia.org/r/284878 (https://phabricator.wikimedia.org/T76348) [17:06:11] 06Operations, 10DBA, 07Tracking: Migrate MySQLs to use ROW-based replication (tracking) - https://phabricator.wikimedia.org/T109179#2230915 (10jcrespo) a:05jcrespo>03None [17:06:16] Hm.. s1 lag in labs is growing to over a minute just now [17:06:22] https://tools.wmflabs.org/replag/ [17:06:48] (removed bit about read-only at 14:00) [17:08:33] Krinkle, it is due to one user running DELETE ... SELECT [17:08:45] which isn't allowed? [17:08:53] you mean in prod? [17:08:54] it is, that is my issue :-) [17:08:57] on labs [17:09:01] (03PS4) 10Rush: toollabs: bastion setup for cgred::group scripts [puppet] - 10https://gerrit.wikimedia.org/r/284909 (https://phabricator.wikimedia.org/T131541) [17:09:01] :| [17:09:24] that thing, when combined with non-transactional user tables, blocks replication [17:09:31] I cannot do anything about it [17:09:46] unless you collectivelly allow me to ban those queries [17:10:08] if finished now [17:11:02] those are discouraged on documentation, but not prohibited [17:11:27] I would be the first interested on disallow those :-) [17:11:58] (03CR) 10Rush: [C: 032] "talked to this over with andrew (who is currently at lunch)" [puppet] - 10https://gerrit.wikimedia.org/r/284909 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [17:12:51] (03CR) 10Andrew Bogott: toollabs: bastion setup for cgred::group scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284909 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [17:12:58] jynus: IIRC it was allowed, because people used this DBs for own tables, but should use toollabs DB now? [17:13:02] jynus: DELETE on a non-replication table I presume? [17:13:27] because hte replication tables are obviously read-only. [17:13:33] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 680 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5158442 keys - replication_delay is 680 [17:14:16] Krinkle, DELETE FROM ch_pl WHERE NOT EXISTS ( SELECT 1 FROM enwiki [17:14:29] replica tables are transactional [17:15:15] but users can create non-transactional tables, so when executing DML with target their tables and origin the replica tables, they lock them completelly [17:15:30] (03PS1) 10Rush: remove duplicate package def for cgroup-bin [puppet] - 10https://gerrit.wikimedia.org/r/284923 (https://phabricator.wikimedia.org/T131541) [17:15:40] (03PS2) 10Rush: remove duplicate package def for cgroup-bin [puppet] - 10https://gerrit.wikimedia.org/r/284923 (https://phabricator.wikimedia.org/T131541) [17:16:07] people use replicas for intermediate results, which I think is ok [17:16:36] (03PS9) 10Elukey: Allow basic apache maintenace webpages for the statistics::web role. [puppet] - 10https://gerrit.wikimedia.org/r/284878 (https://phabricator.wikimedia.org/T76348) [17:16:55] but if you asked me, if you run UPDATE|DELETE|REPLACE|INSERT ... SELECT, you should be using InnoDB to avoid blocking replication [17:17:07] InnoDB or TokuDB [17:17:39] (03CR) 10Ottomata: Allow basic apache maintenace webpages for the statistics::web role. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/284878 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [17:17:44] (03CR) 10Rush: [C: 032] remove duplicate package def for cgroup-bin [puppet] - 10https://gerrit.wikimedia.org/r/284923 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [17:18:06] replica tables are of course read-only, but they still need to be written by the replica (system) thread [17:18:14] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2230942 (10ori) >>! In T96848#2147772, @BBlack wrote: > [...] but using @ema's new systemtap stuff (which is way better than the sniffer-based solution) [...] Is this code published a... [17:18:25] those queries perform an implicit LOCK TABLE [17:20:51] If replication lags affects you I would suggest 2 ways 1) convince people not to do that (or me to enforce it and everybody is ok with that) 2) separate toolsdb on 2 services, OLTP with a very short query timeout, and analytics- allowing scheduling of long queries [17:21:05] 06Operations: Access to wikitech-static server? - https://phabricator.wikimedia.org/T133372#2230950 (10Dzahn) a:03Dzahn ok, my local copy of the repo was outdated and messed up. i needed to import 2 other new keys. fixed the wikitech-static root password has been updated in pwstore [17:21:53] I was going to propose 2) when new hardware is here, but I need to survey actual needs/usage first [17:22:31] 06Operations: Access to wikitech-static server? - https://phabricator.wikimedia.org/T133372#2230957 (10Dzahn) 05Open>03Resolved [17:24:43] (03PS1) 10Rush: toollabs: bastion fixup perms for cgred [puppet] - 10https://gerrit.wikimedia.org/r/284924 (https://phabricator.wikimedia.org/T131541) [17:24:54] (03PS2) 10Rush: toollabs: bastion fixup perms for cgred [puppet] - 10https://gerrit.wikimedia.org/r/284924 (https://phabricator.wikimedia.org/T131541) [17:25:08] (03CR) 10jenkins-bot: [V: 04-1] toollabs: bastion fixup perms for cgred [puppet] - 10https://gerrit.wikimedia.org/r/284924 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [17:27:29] (03PS10) 10Elukey: Allow basic apache maintenace webpages for the statistics::web role. [puppet] - 10https://gerrit.wikimedia.org/r/284878 (https://phabricator.wikimedia.org/T76348) [17:27:39] (03CR) 10Rush: [C: 032] toollabs: bastion fixup perms for cgred [puppet] - 10https://gerrit.wikimedia.org/r/284924 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [17:34:19] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5149820 keys - replication_delay is 0 [17:39:25] (03PS1) 10Rush: toollabs: bastion setup for cgred::group utilities [puppet] - 10https://gerrit.wikimedia.org/r/284925 (https://phabricator.wikimedia.org/T131541) [17:41:21] 06Operations, 10DBA, 13Patch-For-Review: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on icinga - https://phabricator.wikimedia.org/T114752#2230984 (10jcrespo) [17:41:36] (03CR) 10Rush: [C: 032] "this has been in place for awhile so I'm persisting to puppet here" [puppet] - 10https://gerrit.wikimedia.org/r/284925 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [17:41:52] 06Operations, 10DBA, 13Patch-For-Review: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on the several monitoring backends - https://phabricator.wikimedia.org/T114752#1705156 (10jcrespo) [17:50:32] (03PS1) 10Rush: toollabs: bastion setup for cgred::group user daemons [puppet] - 10https://gerrit.wikimedia.org/r/284926 (https://phabricator.wikimedia.org/T131541) [17:53:25] (03CR) 10Rush: [C: 032] toollabs: bastion setup for cgred::group user daemons [puppet] - 10https://gerrit.wikimedia.org/r/284926 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [17:59:09] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2231122 (10elukey) So after reviewing the change with @Ottomata we realized that it would be much easier to add a rule in VCL's cache::misc to enable/disable a backend with a fla... [18:01:07] (03PS1) 10Rush: toollabs: bastion setup for cgred::group shell [puppet] - 10https://gerrit.wikimedia.org/r/284927 (https://phabricator.wikimedia.org/T131541) [18:02:59] (03CR) 10Rush: [C: 032] toollabs: bastion setup for cgred::group shell [puppet] - 10https://gerrit.wikimedia.org/r/284927 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [18:06:33] (03PS1) 10Rush: toollabs: bastion setup for cgred::group throttle [puppet] - 10https://gerrit.wikimedia.org/r/284928 (https://phabricator.wikimedia.org/T131541) [18:07:47] (03CR) 10jenkins-bot: [V: 04-1] toollabs: bastion setup for cgred::group throttle [puppet] - 10https://gerrit.wikimedia.org/r/284928 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [18:08:29] (03CR) 10Bmansurov: "No, the PS3 just links the task." [puppet] - 10https://gerrit.wikimedia.org/r/284576 (https://phabricator.wikimedia.org/T129693) (owner: 10Jdlrobson) [18:08:47] (03PS2) 10Rush: toollabs: bastion setup for cgred::group throttle [puppet] - 10https://gerrit.wikimedia.org/r/284928 (https://phabricator.wikimedia.org/T131541) [18:12:20] (03CR) 10Rush: [C: 032] toollabs: bastion setup for cgred::group throttle [puppet] - 10https://gerrit.wikimedia.org/r/284928 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [18:28:39] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [18:32:59] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:34:49] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:37:09] RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active [18:38:29] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [18:38:38] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:39:18] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 617 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5156004 keys - replication_delay is 617 [18:41:18] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5149867 keys - replication_delay is 0 [19:09:32] !log mwscript deleteEqualMessages.php --wiki thwikibooks [19:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:29:39] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [19:37:08] RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active [19:43:17] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [19:51:08] PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: puppet fail [20:07:28] RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active [20:13:38] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [20:19:38] RECOVERY - puppet last run on mw2203 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [20:21:29] (03CR) 10Dzahn: "per puppet module structure it would only be an init.pp in the module root but since this is a subdirectory of module sslcert it would be " [puppet] - 10https://gerrit.wikimedia.org/r/283761 (https://phabricator.wikimedia.org/T132812) (owner: 10Dzahn) [20:36:31] !log reboot install2001 to PXE for OS upgrade\ [20:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:38:08] RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active [20:44:18] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [21:09:18] (03PS1) 10Rush: cgred changes for toollabs bastions use case [puppet] - 10https://gerrit.wikimedia.org/r/284978 (https://phabricator.wikimedia.org/T131541) [21:10:02] !log install2001 - reinstalled, re-adding to puppet etc [21:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:10:22] (03PS2) 10Rush: cgred changes for toollabs bastions use case [puppet] - 10https://gerrit.wikimedia.org/r/284978 (https://phabricator.wikimedia.org/T131541) [21:12:28] (03CR) 10Rush: [C: 032] cgred changes for toollabs bastions use case [puppet] - 10https://gerrit.wikimedia.org/r/284978 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [21:19:58] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: Puppet has 1 failures [21:21:10] !log short gaps in codfw ganglia expected due to install2001 being an aggregator [21:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:37:25] RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active [21:37:26] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2231619 (10Dzahn) [21:40:21] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2231622 (10Dzahn) I have upgraded install2001 to jessie. Since this is also the codfw ganglia aggregator we had a short interruption there but i checked the services and grap... [21:43:34] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [21:46:35] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:48:09] 06Operations, 10DBA: upgrade db servers to jessie - https://phabricator.wikimedia.org/T125028#2231624 (10Dzahn) @jcrespo Thank you! I should clarify,for the purposes of this ticket it was only about killing precise, not trusty. All it ever was was to track when we got rid of all precise. Basically just runnin... [21:49:14] PROBLEM - puppet last run on bast3001 is CRITICAL: CRITICAL: puppet fail [21:52:03] 06Operations, 10DBA: upgrade db servers to jessie - https://phabricator.wikimedia.org/T125028#2231639 (10Dzahn) would renaming it to "shutdown remaining precise db servers" be better? Or do you prefer it just to be closed completely? [21:53:58] bblack: hola! are we using varnish 4 everywhere then? [21:55:33] 06Operations, 10ops-codfw: rack five new spare pool systems - https://phabricator.wikimedia.org/T130941#2231641 (10Papaul) 05Open>03Resolved This is complete. [21:57:15] ori: yt? [21:58:11] 06Operations, 10DBA: reimage db servers on precise - https://phabricator.wikimedia.org/T125028#2231656 (10Dzahn) [21:59:01] 06Operations, 10DBA: reimage or decom db servers on precise - https://phabricator.wikimedia.org/T125028#1972522 (10Dzahn) [22:00:45] 06Operations, 05WMF-NDA: Migrate RT to Phabricator - https://phabricator.wikimedia.org/T38#2231673 (10Dzahn) [22:00:47] 06Operations, 13Patch-For-Review: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#2231672 (10Dzahn) [22:01:45] ironic that WMF-NDA but on channel? [22:03:14] 06Operations: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#2231677 (10Dzahn) [22:03:22] nuria_: hey [22:03:42] ori: hola, fast question: https://phabricator.wikimedia.org/rEWMV017f9d845c45697c1e84b50d8a16e65f60b68fee [22:03:59] ori: do this values still get set on X-Analytics? [22:04:01] 06Operations: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#1818428 (10Dzahn) a:05Dzahn>03None [22:04:38] 06Operations: decom magnesium (was: Reinstall magnesium with jessie) - https://phabricator.wikimedia.org/T123713#2231680 (10Dzahn) also blocked on T38 currently [22:05:00] 06Operations, 05WMF-NDA: Migrate RT to Phabricator - https://phabricator.wikimedia.org/T38#555 (10Dzahn) [22:05:02] 06Operations: decom magnesium (was: Reinstall magnesium with jessie) - https://phabricator.wikimedia.org/T123713#2231682 (10Dzahn) [22:05:06] nuria_: I don't know -- I don't monitor the header :) Are you looking and noticing it missing? [22:05:10] 06Operations: decom magnesium (was: Reinstall magnesium with jessie) - https://phabricator.wikimedia.org/T123713#1936373 (10Dzahn) a:05Dzahn>03None [22:06:06] ori: we recently had the discussion of what went into it and nobody mentioned the php layer adding page id but , as far as you know, that might still be happening? [22:07:25] ori: it is not here: https://github.com/wikimedia/operations-puppet/blob/production/templates/varnish/analytics.inc.vcl.erb [22:07:42] page_id and ns are getting set [22:07:49] it's done in PHP code, in the XAnalytics extension [22:08:02] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2231685 (10Dzahn) the install images have been copied by puppet to /srv/tftpboot, atftpd is running.. i saw no puppet errors [22:08:05] RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active [22:08:25] nuria_: https://github.com/wikimedia/mediawiki-extensions-XAnalytics/blob/master/XAnalytics.class.php [22:08:30] ori: right, it is not mention anywhere else though, [22:08:44] ori: is that extension in use by default? [22:08:47] yes [22:08:53] isn't it used by the pageview api? [22:09:12] ori: its values ? [22:09:38] i thought so [22:10:20] ori: k [22:14:14] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [22:17:55] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:33:54] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 713 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5169785 keys - replication_delay is 713 [22:43:32] Someone here, who can tell me where I can find the code for mw unit tests? [22:44:27] my fault, got it [22:46:15] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5149121 keys - replication_delay is 0 [22:48:37] nuria_: nope! [22:49:24] nuria_: (I mean nope on varnish4 everywhere - we're currently running 3 on some clusters and 4 on others - shared code has to be compatible with both for extra difficulty points) [23:00:54] PROBLEM - puppet last run on mw1141 is CRITICAL: CRITICAL: Puppet has 7 failures [23:04:11] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2231730 (10BBlack) >>! In T96848#2230942, @ori wrote: >>>! In T96848#2147772, @BBlack wrote: >> [...] but using @ema's new systemtap stuff (which is way better than the sniffer-based s... [23:08:10] RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active [23:09:21] (03PS5) 10Dzahn: create letsencrypt module, install acme-tiny [puppet] - 10https://gerrit.wikimedia.org/r/283761 (https://phabricator.wikimedia.org/T132812) [23:11:49] (03PS6) 10Dzahn: create letsencrypt module, install acme-tiny [puppet] - 10https://gerrit.wikimedia.org/r/283761 (https://phabricator.wikimedia.org/T132812) [23:14:10] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [23:14:16] (03PS7) 10Dzahn: create letsencrypt module, install acme-tiny [puppet] - 10https://gerrit.wikimedia.org/r/283761 (https://phabricator.wikimedia.org/T132812) [23:25:11] PROBLEM - Apache HTTP on mw1141 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:26:05] !log reboot bast4001 [23:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:26:30] PROBLEM - HHVM rendering on mw1141 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:26:51] PROBLEM - dhclient process on mw1141 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:27:22] PROBLEM - nutcracker port on mw1141 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:27:32] PROBLEM - nutcracker process on mw1141 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:27:42] PROBLEM - Check size of conntrack table on mw1141 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:27:51] PROBLEM - RAID on mw1141 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:27:51] PROBLEM - DPKG on mw1141 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:28:32] PROBLEM - configured eth on mw1141 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:29:10] PROBLEM - SSH on mw1141 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:29:11] PROBLEM - HHVM processes on mw1141 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:32:41] PROBLEM - salt-minion processes on mw1141 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:33:10] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [23:33:39] !log powercycled mw1141 [23:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:20] PROBLEM - Disk space on mw1141 is CRITICAL: Timeout while attempting connection [23:36:00] RECOVERY - salt-minion processes on mw1141 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:36:01] RECOVERY - Disk space on mw1141 is OK: DISK OK [23:36:10] RECOVERY - nutcracker port on mw1141 is OK: TCP OK - 0.000 second response time on port 11212 [23:36:31] RECOVERY - nutcracker process on mw1141 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [23:36:41] RECOVERY - RAID on mw1141 is OK: OK: no RAID installed [23:36:51] RECOVERY - configured eth on mw1141 is OK: OK - interfaces up [23:37:01] RECOVERY - dhclient process on mw1141 is OK: PROCS OK: 0 processes with command name dhclient [23:37:30] RECOVERY - SSH on mw1141 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [23:37:31] RECOVERY - Check size of conntrack table on mw1141 is OK: OK: nf_conntrack is 6 % full [23:37:31] RECOVERY - HHVM processes on mw1141 is OK: PROCS OK: 6 processes with command name hhvm [23:37:41] RECOVERY - DPKG on mw1141 is OK: All packages OK [23:37:51] RECOVERY - Apache HTTP on mw1141 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.044 second response time [23:38:22] RECOVERY - HHVM rendering on mw1141 is OK: HTTP OK: HTTP/1.1 200 OK - 71824 bytes in 0.157 second response time [23:39:32] RECOVERY - puppet last run on mw1141 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [23:41:04] (03PS1) 10Dzahn: installserver: let bast4001 use install1001 [puppet] - 10https://gerrit.wikimedia.org/r/284990 [23:43:37] (03PS2) 10Dzahn: installserver: let bast4001 use install1001 [puppet] - 10https://gerrit.wikimedia.org/r/284990 [23:44:48] (03CR) 10Dzahn: [C: 032] installserver: let bast4001 use install1001 [puppet] - 10https://gerrit.wikimedia.org/r/284990 (owner: 10Dzahn) [23:46:46] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2231788 (10ori) >>! In T96848#2231730, @BBlack wrote: >>>! In T96848#2230942, @ori wrote: >>>>! In T96848#2147772, @BBlack wrote: >>> [...] but using @ema's new systemtap stuff (which...