[00:28:27] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1313563 (10Dzahn) I ran this to obfuscate email addresses but keep them unique: ``` mysql> update profiles set login_name=concat(substring_index(login_name,'@',1), '@', substring(sh... [00:30:52] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1313566 (10Dzahn) ``` mysql> select userid,login_name,cryptpassword,realname from profiles where login_name like "dzahn%" or login_name like "aklapper%"; +--------+-------------------... [00:51:20] 6operations: change notification options in CyrusOne customer portal - https://phabricator.wikimedia.org/T100481#1313580 (10Dzahn) [00:53:55] 6operations: change notification options in CyrusOne customer portal - https://phabricator.wikimedia.org/T100481#1313583 (10Dzahn) p:5Triage>3High [00:55:31] 6operations: ftpsync@carbon - mirror sync - ERROR - https://phabricator.wikimedia.org/T100482#1313590 (10Dzahn) [00:59:15] 6operations: ftpsync@carbon - mirror sync - ERROR - https://phabricator.wikimedia.org/T100482#1313592 (10Dzahn) [carbon:/var/lib/mirror/archvsync/log] $ grep -ri error * | grep "May 25" ftpsync.log.10:May 25 07:17:47 carbon ftpsync[3050]: ERROR: Sync step 2 went wrong, got errorcode 1. Logfile: /var/lib/mirror/a... [01:09:16] 6operations, 10ops-eqiad: analytics1036 can't talk cross row? - https://phabricator.wikimedia.org/T99845#1313597 (10BBlack) cmjohnson1 did move it to ge-2/0/37 without any change in behavior, so it's not specific to the switch port. I suppose we could switch the cables for 1035/1036 and get some extra validat... [01:24:55] (03PS1) 10Dzahn: remove amssq31-62 incl. mgmt [dns] - 10https://gerrit.wikimedia.org/r/213965 (https://phabricator.wikimedia.org/T95742) [01:26:59] (03PS1) 10Dzahn: remove amss31-62 from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/213966 (https://phabricator.wikimedia.org/T95742) [01:29:16] (03PS2) 10Dzahn: remove amssq31-62 from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/213966 (https://phabricator.wikimedia.org/T95742) [01:29:42] (03PS1) 10Dzahn: torrus: remove amssq nodes from tests/cdn.pp [puppet] - 10https://gerrit.wikimedia.org/r/213968 (https://phabricator.wikimedia.org/T95742) [01:32:14] (03PS1) 10Dzahn: rolematcher: remove amssq nodes [puppet] - 10https://gerrit.wikimedia.org/r/213969 (https://phabricator.wikimedia.org/T95742) [01:40:59] 6operations: Spam solutions for Education-l mailing list - https://phabricator.wikimedia.org/T100428#1313630 (10Dzahn) Hi, all mail going to lists also goes through spamassassin and gets a score. If that score is over a certain threshold it gets rejected. Example from exim log on the list server: "rejected af... [01:41:58] 6operations: ftpsync@carbon - mirror sync - ERROR - https://phabricator.wikimedia.org/T100482#1313631 (10Dzahn) p:5Triage>3Normal [01:42:42] 10Ops-Access-Requests, 6operations: Requesting addition to researchers group on stat1003 - https://phabricator.wikimedia.org/T99798#1313632 (10Dzahn) @dbrant Hi, can we have approval language from your manager on the ticket please to move this forward? Thank you [01:46:23] (03PS4) 10Dzahn: Make the DNS server for .wmflabs configurable [puppet] - 10https://gerrit.wikimedia.org/r/211063 (owner: 10Andrew Bogott) [01:46:25] (03PS3) 10Dzahn: remove amss31-62 from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/213966 (https://phabricator.wikimedia.org/T95742) [01:46:27] (03PS6) 10Dzahn: dnsrecursor: ensure => 'present' rather than 'latest' [puppet] - 10https://gerrit.wikimedia.org/r/211060 (owner: 10Andrew Bogott) [01:46:29] (03PS7) 10Dzahn: Added a simple IP-aliasing script for the pdns recursor. [puppet] - 10https://gerrit.wikimedia.org/r/211059 (owner: 10Andrew Bogott) [01:46:31] (03PS2) 10Dzahn: rolematcher: remove amssq nodes [puppet] - 10https://gerrit.wikimedia.org/r/213969 (https://phabricator.wikimedia.org/T95742) [01:46:32] eh, what nooo [01:46:33] (03PS2) 10Dzahn: torrus: remove amssq nodes from tests/cdn.pp [puppet] - 10https://gerrit.wikimedia.org/r/213968 (https://phabricator.wikimedia.org/T95742) [01:46:35] (03PS1) 10Dzahn: access: add dbrant to researchers [puppet] - 10https://gerrit.wikimedia.org/r/213970 (https://phabricator.wikimedia.org/T99798) [01:46:53] damn, i did not want to create these :p [01:47:53] 6operations, 10ops-eqiad: analytics1036 can't talk cross row? - https://phabricator.wikimedia.org/T99845#1313642 (10Gage) I discussed this problem with a friend in neteng at Twitter, who says he has seen similar behavior in Juniper switches before. He recommends, and I agree: let's reboot the switch (asw-d2-eq... [01:49:06] (03CR) 10Dzahn: "@AndrewBogott I did not want to amend to this, it was by accident. Sorry if it changed anything. Please double check." [puppet] - 10https://gerrit.wikimedia.org/r/211059 (owner: 10Andrew Bogott) [01:59:58] (03CR) 10BBlack: [C: 031] remove amssq31-62 incl. mgmt [dns] - 10https://gerrit.wikimedia.org/r/213965 (https://phabricator.wikimedia.org/T95742) (owner: 10Dzahn) [02:00:43] (03CR) 10BBlack: [C: 031] remove amss31-62 from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/213966 (https://phabricator.wikimedia.org/T95742) (owner: 10Dzahn) [02:01:08] (03CR) 10BBlack: [C: 031] rolematcher: remove amssq nodes [puppet] - 10https://gerrit.wikimedia.org/r/213969 (https://phabricator.wikimedia.org/T95742) (owner: 10Dzahn) [02:01:25] (03CR) 10BBlack: [C: 031] torrus: remove amssq nodes from tests/cdn.pp [puppet] - 10https://gerrit.wikimedia.org/r/213968 (https://phabricator.wikimedia.org/T95742) (owner: 10Dzahn) [02:02:23] mutante: seems like probably you were operating on the recursor branch with these? but I think they can rebased and repushed independently. [02:04:42] 6operations, 10ops-esams, 10Traffic, 5Patch-For-Review: Decomission amssq31-62 (32 hosts) - https://phabricator.wikimedia.org/T95742#1313649 (10BBlack) Aside from the 4 tickets above to clean out the puppet+dns repos, I think the only thing we're really missing from the Server Lifecycle doc here is disable... [02:05:24] I would mess with them, but then it might screw up something you're trying to do to disentangle them in your checkout [02:06:29] bblack: yes, i was on the wrong branch and then told git review it's ok, my bad [02:06:36] usually it works :p [02:07:46] (03CR) 10Dzahn: [C: 032] remove amss31-62 from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/213966 (https://phabricator.wikimedia.org/T95742) (owner: 10Dzahn) [02:08:40] (03PS4) 10Dzahn: remove amss31-62 from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/213966 (https://phabricator.wikimedia.org/T95742) [02:11:02] (03PS3) 10Dzahn: torrus: remove amssq nodes from tests/cdn.pp [puppet] - 10https://gerrit.wikimedia.org/r/213968 (https://phabricator.wikimedia.org/T95742) [02:11:31] (03CR) 10Dzahn: [C: 032] torrus: remove amssq nodes from tests/cdn.pp [puppet] - 10https://gerrit.wikimedia.org/r/213968 (https://phabricator.wikimedia.org/T95742) (owner: 10Dzahn) [02:12:27] (03PS3) 10Dzahn: rolematcher: remove amssq nodes [puppet] - 10https://gerrit.wikimedia.org/r/213969 (https://phabricator.wikimedia.org/T95742) [02:12:35] (03CR) 10Dzahn: [C: 032] rolematcher: remove amssq nodes [puppet] - 10https://gerrit.wikimedia.org/r/213969 (https://phabricator.wikimedia.org/T95742) (owner: 10Dzahn) [02:21:16] 10Ops-Access-Requests, 6operations: Additional Webmaster tools access - https://phabricator.wikimedia.org/T98283#1313667 (10Dzahn) So we need to find 15 sites we can delete to replace them with the 15 missing https versions of en, de, zh, ru, it, es, fr, ja, pt, tr, nl, pl, ar, ko, hi ? [02:22:54] 6operations, 10ops-esams, 10Traffic, 5Patch-For-Review: Decomission amssq31-62 (32 hosts) - https://phabricator.wikimedia.org/T95742#1313668 (10BBlack) Switch work done (also disabled/unconfigured the old amslvs1-4 ports, missed back when those were decommed) [02:24:35] !log l10nupdate Synchronized php-1.26wmf6/cache/l10n: (no message) (duration: 06m 45s) [02:24:45] Logged the message, Master [02:27:53] 6operations, 10Traffic, 5Patch-For-Review: reinstall/rename dysprosium as cp1099 (upload eqiad) - https://phabricator.wikimedia.org/T96873#1313669 (10BBlack) Edited racktables for server rename [02:28:20] 6operations, 10MediaWiki-extensions-SecurePoll, 3Elections, 7I18n, and 2 others: Cannot select language on votewiki - https://phabricator.wikimedia.org/T97923#1313670 (10Dzahn) [02:29:10] 6operations, 10MediaWiki-extensions-SecurePoll, 3Elections, 7I18n, 7network: Cannot select language on votewiki - https://phabricator.wikimedia.org/T97923#1313672 (10Dzahn) [02:29:36] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [02:29:37] !log LocalisationUpdate completed (1.26wmf6) at 2015-05-27 02:28:34+00:00 [02:29:43] Logged the message, Master [02:30:30] (03PS1) 10BBlack: rename: dysprosium->cp1099 T96873 [dns] - 10https://gerrit.wikimedia.org/r/213971 [02:30:54] (03CR) 10BBlack: [C: 032] rename: dysprosium->cp1099 T96873 [dns] - 10https://gerrit.wikimedia.org/r/213971 (owner: 10BBlack) [02:34:16] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (90091s 90000s) [02:34:50] (03PS1) 10BBlack: s/dysprosium/cp1099/ T96873 [puppet] - 10https://gerrit.wikimedia.org/r/213972 [02:38:46] (03PS1) 10BBlack: rename: dysprosium->cp1099 T96873 [dns] - 10https://gerrit.wikimedia.org/r/213974 [02:38:57] (03CR) 10BBlack: [C: 032] s/dysprosium/cp1099/ T96873 [puppet] - 10https://gerrit.wikimedia.org/r/213972 (owner: 10BBlack) [02:39:09] (03CR) 10BBlack: [C: 032] rename: dysprosium->cp1099 T96873 [dns] - 10https://gerrit.wikimedia.org/r/213974 (owner: 10BBlack) [02:48:26] !log l10nupdate Synchronized php-1.26wmf7/cache/l10n: (no message) (duration: 06m 52s) [02:48:36] Logged the message, Master [02:53:28] !log LocalisationUpdate completed (1.26wmf7) at 2015-05-27 02:52:25+00:00 [02:53:34] Logged the message, Master [03:18:30] (03PS2) 10BBlack: remove amssq31-62 incl. mgmt [dns] - 10https://gerrit.wikimedia.org/r/213965 (https://phabricator.wikimedia.org/T95742) (owner: 10Dzahn) [03:20:27] (03PS3) 10BBlack: remove amssq31-62 incl. mgmt [dns] - 10https://gerrit.wikimedia.org/r/213965 (https://phabricator.wikimedia.org/T95742) (owner: 10Dzahn) [03:21:31] (03CR) 10BBlack: [C: 032] remove amssq31-62 incl. mgmt [dns] - 10https://gerrit.wikimedia.org/r/213965 (https://phabricator.wikimedia.org/T95742) (owner: 10Dzahn) [03:23:25] PROBLEM - Host cp1099 is DOWN: PING CRITICAL - Packet loss = 100% [03:25:10] 6operations, 10ops-esams, 10Traffic: Decomission amssq31-62 (32 hosts) - https://phabricator.wikimedia.org/T95742#1313718 (10BBlack) a:3mark [03:25:31] 6operations, 10ops-esams, 10Traffic: Decomission amssq31-62 (32 hosts) - https://phabricator.wikimedia.org/T95742#1199200 (10BBlack) All software-level work is done, just needs on-site decom. [03:26:13] RECOVERY - Host cp1099 is UPING OK - Packet loss = 0%, RTA = 0.57 ms [03:29:48] (03PS1) 10BBlack: add cp1099 to upload cache list [puppet] - 10https://gerrit.wikimedia.org/r/213975 [03:30:11] (03CR) 10BBlack: [C: 032 V: 032] add cp1099 to upload cache list [puppet] - 10https://gerrit.wikimedia.org/r/213975 (owner: 10BBlack) [03:30:55] 6operations, 10Traffic, 5Patch-For-Review: reinstall/rename dysprosium as cp1099 (upload eqiad) - https://phabricator.wikimedia.org/T96873#1313724 (10BBlack) 5Open>3Resolved a:3BBlack [03:35:58] 6operations, 10Traffic: Upgrade prod DNS daemons to gdnsd 2.2.0 - https://phabricator.wikimedia.org/T98003#1313729 (10BBlack) Update: The target here is still 2.2.0 on jessie for all NS boxes, current status is: rubidium: 2.1.0 on precise baham: 2.1.0 on trusty eeden: 2.1.2 on jessie [03:39:15] (03PS1) 10Chmarkine: noc - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/213976 (https://phabricator.wikimedia.org/T40516) [03:50:37] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1313732 (10MZMcBride) I'm not totally sure I understand why a dump approach is being pursued here when there are other options available. In a restricted environment, why not simply... [03:55:33] PROBLEM - puppet last run on db2003 is CRITICAL puppet fail [03:58:13] PROBLEM - puppet last run on terbium is CRITICAL Puppet has 1 failures [04:12:33] RECOVERY - puppet last run on db2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:15:13] RECOVERY - puppet last run on terbium is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [04:33:38] Krenair: https://lists.wikimedia.org/pipermail/newprojects/2015-May/000098.html [05:07:23] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (18242 90000s) [05:13:30] 6operations, 10wikitech.wikimedia.org: wikitech-static CRIT - wikitech and wikitech-static out of sync (94951s > 90000s) - https://phabricator.wikimedia.org/T100485#1313756 (10Dzahn) [05:14:37] 6operations, 10wikitech.wikimedia.org: wikitech-static CRIT - wikitech and wikitech-static out of sync (94951s > 90000s) - https://phabricator.wikimedia.org/T100485#1313746 (10Dzahn) 22:08 < icinga-wm> RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-sta... [05:14:46] 6operations, 10wikitech.wikimedia.org: wikitech-static CRIT - wikitech and wikitech-static out of sync (94951s > 90000s) - https://phabricator.wikimedia.org/T100485#1313759 (10Dzahn) p:5Triage>3Low [05:23:39] (03PS1) 10Dzahn: adjust wikitech-static sync monitoring threshold [puppet] - 10https://gerrit.wikimedia.org/r/213978 (https://phabricator.wikimedia.org/T100485) [05:24:27] (03CR) 10Dzahn: [C: 032] adjust wikitech-static sync monitoring threshold [puppet] - 10https://gerrit.wikimedia.org/r/213978 (https://phabricator.wikimedia.org/T100485) (owner: 10Dzahn) [05:25:53] 6operations, 10wikitech.wikimedia.org, 5Patch-For-Review: wikitech-static CRIT - wikitech and wikitech-static out of sync (94951s > 90000s) - https://phabricator.wikimedia.org/T100485#1313775 (10Dzahn) 5Open>3Resolved a:3Dzahn upped from 90000 to 10000 seconds [05:27:53] (03CR) 10Dzahn: [C: 031] noc - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/213976 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [05:30:04] "All 283,299 conversations in "Monitor/Cron" are selected. Clear selection." [05:30:07] :o good night [05:48:28] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed May 27 05:47:25 UTC 2015 (duration 47m 24s) [05:48:37] Logged the message, Master [05:49:25] PROBLEM - High load average on ms-be1001 is CRITICAL - load average: 227.48, 137.27, 64.32 [06:00:04] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [06:09:51] (03CR) 10Springle: [C: 031] Update of parsercache db servers to MariaDB 10 (#1) [puppet] - 10https://gerrit.wikimedia.org/r/213784 (owner: 10Jcrespo) [06:10:27] ^thanks [06:14:47] (03PS5) 10Jcrespo: Update of parsercache db servers to MariaDB 10 (#1) [puppet] - 10https://gerrit.wikimedia.org/r/213784 [06:16:27] (03CR) 10Jcrespo: [C: 032] Update of parsercache db servers to MariaDB 10 (#1) [puppet] - 10https://gerrit.wikimedia.org/r/213784 (owner: 10Jcrespo) [06:16:35] 6operations: pc100[123] maintenance and upgrade - https://phabricator.wikimedia.org/T100301#1313810 (10jcrespo) [06:30:06] PROBLEM - puppet last run on wtp2015 is CRITICAL puppet fail [06:31:37] (03PS1) 10KartikMistry: CX: Add wikis for CX deployment on 20150528 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213992 (https://phabricator.wikimedia.org/T99535) [06:34:35] PROBLEM - puppet last run on mw1144 is CRITICAL Puppet has 1 failures [06:34:45] PROBLEM - puppet last run on mw2127 is CRITICAL Puppet has 1 failures [06:35:04] PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 1 failures [06:35:14] PROBLEM - puppet last run on mw2056 is CRITICAL Puppet has 1 failures [06:35:15] PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 2 failures [06:35:44] !log clone dbstore2001 data to dbstore2002 [06:35:49] Logged the message, Master [06:47:38] (03PS1) 10Jcrespo: Small configuration fix for pc1001 database [puppet] - 10https://gerrit.wikimedia.org/r/213997 (https://phabricator.wikimedia.org/T100301) [06:48:22] (03CR) 10jenkins-bot: [V: 04-1] Small configuration fix for pc1001 database [puppet] - 10https://gerrit.wikimedia.org/r/213997 (https://phabricator.wikimedia.org/T100301) (owner: 10Jcrespo) [06:50:11] (03PS2) 10Jcrespo: Small configuration fix for pc1001 database [puppet] - 10https://gerrit.wikimedia.org/r/213997 (https://phabricator.wikimedia.org/T100301) [06:51:18] (03CR) 10Jcrespo: [C: 032] Small configuration fix for pc1001 database [puppet] - 10https://gerrit.wikimedia.org/r/213997 (https://phabricator.wikimedia.org/T100301) (owner: 10Jcrespo) [07:26:21] RECOVERY - puppet last run on mw1144 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [07:27:01] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:22] RECOVERY - puppet last run on mw2127 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:31] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:41] RECOVERY - puppet last run on wtp2015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:29:11] RECOVERY - puppet last run on mw2056 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:36:23] (03PS2) 10Alexandros Kosiaris: Setup ganeti100X with RAID1 [puppet] - 10https://gerrit.wikimedia.org/r/213958 [07:41:54] PROBLEM - Host ganeti1002 is DOWN: PING CRITICAL - Packet loss = 100% [07:41:54] PROBLEM - Host ganeti1003 is DOWN: PING CRITICAL - Packet loss = 100% [07:48:25] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 1 below the confidence bounds [07:48:52] (03CR) 10Alexandros Kosiaris: [C: 032] Setup ganeti100X with RAID1 [puppet] - 10https://gerrit.wikimedia.org/r/213958 (owner: 10Alexandros Kosiaris) [08:00:45] 6operations, 10ops-requests: Update ruby-jsduck package to v5.3.4 - https://phabricator.wikimedia.org/T83282#1313963 (10akosiaris) IIRC it was something like * apt-get source ruby-jsduck * wget * Muck around with it until it builds correctly * Package ruby-rkelly-remix (because ruby.... [08:11:24] Morning [08:11:36] Getting a slow response out of Wikisource (GB based) [08:12:56] Not a major issue but thought you might want to check response times [08:13:29] Qcoder00, any specific action (edit, read). Logged in? [08:13:38] Editing... [08:13:50] and proofread oage, so access to images [08:13:52] *page [08:14:29] Logged in, [08:18:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Seems like I built that package without those changes. The entire diff of the repo vs debian dir from apt.wikimedia.org package is:" (034 comments) [debs/ruby-jsduck] - 10https://gerrit.wikimedia.org/r/213954 (https://phabricator.wikimedia.org/T95008) (owner: 10Dzahn) [08:21:14] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Build Debian package ruby-jsduck for Jessie - https://phabricator.wikimedia.org/T95008#1313971 (10akosiaris) @Dzahn, I left some comments in https://gerrit.wikimedia.org/r/#/c/213954/ . As far as the "5.3.4-1w... [08:24:43] Qcoder00, cannot reproduce, but it is difficult to say something 100% certain. We will keep an eye on it. Report back if you see an error! [08:24:53] Will do [08:25:13] As i said this about response times, not anything broken yet [08:37:53] (03PS2) 10KartikMistry: CX: Log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) [08:38:26] akosiaris: review ^ :) [08:44:04] RECOVERY - Host analytics1036 is UPING OK - Packet loss = 0%, RTA = 1.12 ms [08:47:46] PROBLEM - Host analytics1036 is DOWN: PING CRITICAL - Packet loss = 100% [08:51:42] (03CR) 10Alexandros Kosiaris: [C: 04-1] CX: Log to logstash (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) (owner: 10KartikMistry) [08:56:34] 6operations, 10ops-eqiad: analytics1036 can't talk cross row? - https://phabricator.wikimedia.org/T99845#1314031 (10faidon) I did a VRRP master switchover to cr1 and ARP was restored. A switchover back to cr2 made ARPs fail again. This looks unlikely to be a host issue at this point and is probably a switch or... [09:05:15] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [09:07:52] akosiaris, hi, around? Please reply https://phabricator.wikimedia.org/T97638 [09:10:59] 7Blocked-on-Operations, 6operations, 10Maps, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1314056 (10akosiaris) Hello, Yes, the postgresql cluster in pgsql.eqiad.wmnet has been initialized with UT... [09:12:21] yurik: ^ done [09:13:58] (03PS1) 1020after4: bump phabricator release tag. [puppet] - 10https://gerrit.wikimedia.org/r/214014 [09:28:16] (03PS1) 10Faidon Liambotis: mirrors: update ftpsync version [puppet] - 10https://gerrit.wikimedia.org/r/214018 [09:28:58] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mirrors: update ftpsync version [puppet] - 10https://gerrit.wikimedia.org/r/214018 (owner: 10Faidon Liambotis) [09:30:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 609 [09:35:14] RECOVERY - check_mysql on db1008 is OK: Uptime: 3530976 Threads: 1 Questions: 12011857 Slow queries: 23439 Opens: 56829 Flush tables: 2 Open tables: 64 Queries per second avg: 3.401 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:37:46] db1008? [09:38:52] ah, ok [09:49:38] 6operations, 10Traffic, 7HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1314106 (10faidon) I don't think we should pin specific certificates. It sounds way too risky to me, even with backup keys, and the benefit compared to pinning roots is small:... [09:50:26] eh, wp is somewhere between crawling and dead for me [09:50:57] or was that comcast? [09:51:13] looks fine now, nvm [09:59:24] !log powercycling ms-be1001; dead, console unresponsive [09:59:33] Logged the message, Master [10:00:53] (03PS14) 10Paladox: Rename $wmincClosedWikis to $wgWmincClosedWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 [10:02:25] RECOVERY - Host ms-be1001 is UPING OK - Packet loss = 0%, RTA = 2.23 ms [10:02:25] RECOVERY - High load average on ms-be1001 is OK - load average: 21.32, 4.70, 1.54 [10:07:13] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1314128 (10Aklapper) Neat. I like that. [10:08:14] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [10:13:31] (03CR) 10Alexandros Kosiaris: [C: 031] "Patch seems fine but I am with Giuseppe on this. How about we only run the sshknowgen on the puppetmasters, populate a file and ship it vi" [puppet] - 10https://gerrit.wikimedia.org/r/210926 (owner: 10Faidon Liambotis) [10:22:13] (03CR) 10Alexandros Kosiaris: "Note that there is a problem here with eventual consistency by doing it on the puppetmasters. That is from the moment we add a new host, i" [puppet] - 10https://gerrit.wikimedia.org/r/210926 (owner: 10Faidon Liambotis) [10:24:13] PROBLEM - salt-minion processes on ganeti1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:25:22] PROBLEM - salt-minion processes on ganeti1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:26:03] PROBLEM - salt-minion processes on ganeti1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:28:55] (03PS1) 10Jcrespo: Repool pc1001 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214025 [10:29:32] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [10:30:23] jynus: are you working on pc100x? [10:30:30] paravoid, yes [10:30:35] any issue? [10:30:44] just a puppet issue [10:31:01] possibly too late to fix :( [10:31:15] please, tell! [10:31:29] not in production yet [10:33:56] akosiaris: Hi can you comment on https://phabricator.wikimedia.org/T78056 ? [10:34:27] paravoid? [10:34:39] (03PS1) 10Faidon Liambotis: Add system => true to mysql_wmf::mysqluser [puppet] - 10https://gerrit.wikimedia.org/r/214027 [10:34:41] (03PS1) 10Faidon Liambotis: Remove txstatsd module and role class [puppet] - 10https://gerrit.wikimedia.org/r/214028 [10:34:43] (03PS1) 10Faidon Liambotis: ve: switch jsbench/chromium user/groups to system [puppet] - 10https://gerrit.wikimedia.org/r/214029 [10:34:45] jynus: https://gerrit.wikimedia.org/r/214027 [10:34:45] (03PS1) 10Faidon Liambotis: admin: use $LAST_SYSTEM_UID in enforce-users-groups [puppet] - 10https://gerrit.wikimedia.org/r/214030 [10:36:08] funny^ I am ok with that- but I didn't do a sys install [10:38:12] actually, that will die [10:38:47] so not a problem from puppet point of view- actual application is another thing [10:38:56] paravoid ^ [10:39:20] die? [10:39:22] role will change to mariadb10 [10:39:22] what will die? [10:39:27] oh [10:39:33] with no mysqluser [10:39:34] ok [10:40:27] I will comment on the commit [10:40:48] jynus: http://p.defau.lt/?_K2R6dqw0Az4lNouwpoYKQ -- it's a bit of a mess [10:40:55] both user & group should be < 1000 ideally [10:41:20] yep, I agree [10:41:45] is it on puppet already? [10:41:51] what is? [10:42:13] problem is that there are several roles now [10:42:33] old mysql::core, mariadb and mariadb10 [10:43:15] and of course it cannot be changed live except on maintenance [10:43:27] right [10:46:00] paravoid, I will make sure it gets added to the newest role -if it is not yet there- and it will be *very* slowly rolled in [10:46:19] I think it is for the user, but not for the group [10:48:08] user { 'mysql': (...) system => true <--- true [10:48:15] right [10:48:26] what's a very newly-provisioned db system? [10:48:42] backwards compatibility is the problem :-) [10:48:50] let me remember [10:49:21] db1009 was recently installed from 0 [10:49:30] and now it should be in production [10:49:30] {'db1009.eqiad.wmnet': 'uid=997(mysql) gid=1000(mysql) groups=1000(mysql)'} [10:49:33] right [10:49:34] so that's system => true [10:49:47] yep, actually, I remember you or someone else adding that [10:49:48] but the group isn't :( [10:50:00] I will take care, thank you [10:50:08] :-) [10:51:06] $ egrep 'uid=[0-9]{4}' mysql |wc -l [10:51:06] 45 [10:52:09] pc1001-1003, es1001-es1010 and 32 db10xx [10:52:29] anyway, not that important, I was just cleaning up the enforce-users-groups script a bit [10:54:08] no, it is ok [10:54:38] hrm [10:54:42] site issues [10:54:51] https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-2hours&from=-2hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color%28cactiStyle%28alias%28reqstats.500,%22500%20resp/min%22%29%29,%22red%22%29&target=color%28cactiStyle%28alias%28reqstats.5xx,%225xx%20resp/min%22%29%29,%22blue%22%29 [10:58:32] jynus: did you sync-file after those mediawiki-maintenance changes? [10:58:43] paravoid, not yet [10:58:53] and I will wait if there are problems [10:59:02] ah right, this isn't merged yet [10:59:02] hrm [11:04:58] lots of "Notice: JobQueueGroup::__destruct: 1 buffered job(s) never inserted.", some OOMs [11:05:12] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [11:10:02] RECOVERY - salt-minion processes on ganeti1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:10:23] RECOVERY - salt-minion processes on ganeti1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:10:24] (03PS1) 10Faidon Liambotis: udp2log: add conntrack exception firewall rule [puppet] - 10https://gerrit.wikimedia.org/r/214034 [11:10:43] RECOVERY - salt-minion processes on ganeti1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:10:54] (03CR) 10Faidon Liambotis: [C: 032] udp2log: add conntrack exception firewall rule [puppet] - 10https://gerrit.wikimedia.org/r/214034 (owner: 10Faidon Liambotis) [11:12:52] <_joe_> akosiaris: \o/ for ganeti100x [11:16:14] !log rebooting ganeti100{1..4} for bridge networking configuration [11:16:22] Logged the message, Master [11:18:23] PROBLEM - Host ganeti1003 is DOWN: PING CRITICAL - Packet loss = 100% [11:18:23] PROBLEM - Host ganeti1004 is DOWN: PING CRITICAL - Packet loss = 100% [11:18:26] (03PS1) 10Alexandros Kosiaris: Assign Hostname/IPs to the etcd cluster @ eqiad [dns] - 10https://gerrit.wikimedia.org/r/214035 [11:18:34] PROBLEM - Host ganeti1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:19:16] _joe_: https://gerrit.wikimedia.org/r/214034 does this look OK to you. I could not find the task about it in phab [11:20:10] <_joe_> akosiaris: uh, what is this about? I don't remember [11:21:31] _joe_: pebkac, that's what it is about. I meant this https://gerrit.wikimedia.org/r/214034 [11:21:40] lol [11:21:44] https://gerrit.wikimedia.org/r/214035 [11:21:47] finally! [11:21:56] <_joe_> ahah [11:23:53] (03PS3) 10KartikMistry: CX: Log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) [11:26:01] 6operations: salt broken after the upgrade - https://phabricator.wikimedia.org/T100502#1314192 (10faidon) 3NEW a:3ArielGlenn [11:26:25] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [11:26:26] PROBLEM - Host ganeti1001 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:35] since when project logos are located at /static ? [11:28:53] grrrit-wm seems to be slacking [11:29:12] 6operations: salt broken after the upgrade - https://phabricator.wikimedia.org/T100502#1314209 (10ArielGlenn) testing was done on a test docker cluster, then again in deployment-prep, then repeated checks were done on the production cluster and on labs. investigating. [11:29:25] RECOVERY - Host ganeti1003 is UPING OK - Packet loss = 0%, RTA = 1.38 ms [11:29:25] RECOVERY - Host ganeti1001 is UPING OK - Packet loss = 0%, RTA = 1.05 ms [11:29:26] RECOVERY - Host ganeti1002 is UPING OK - Packet loss = 0%, RTA = 0.52 ms [11:29:36] RECOVERY - Host ganeti1004 is UPING OK - Packet loss = 0%, RTA = 2.40 ms [11:34:35] I will wait for commiting changes after lunch [11:38:40] 6operations, 10Traffic, 7HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1314219 (10BBlack) I still think pinning a set of certs is doable and even safer, but perhaps we can re-debate that later on, when we have some better infrastructure for storin... [11:47:52] (03PS4) 10KartikMistry: CX: Log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) [11:52:52] (03PS1) 10Alex Monk: Add option to mwgrep to hide private wiki results [puppet] - 10https://gerrit.wikimedia.org/r/214037 [11:57:26] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [11:58:26] RECOVERY - Host mw2027 is UPING WARNING - Packet loss = 61%, RTA = 43.38 ms [12:09:56] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1314257 (10zhuyifei1999) @AddisWang Would you check your meta talk page please? [12:11:45] Please can all the branches be pushed to the github mediawiki mirror? They were there until yesterday morning. https://github.com/wikimedia/mediawiki/branches [12:11:47] in #mediawiki [12:12:00] https://github.com/wikimedia/mediawiki/branches - why has gerrit seemingly only replicated 2 branches? [12:12:15] anything in the logs about it? [12:16:10] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1314263 (10MZMcBride) Personally, I disagree with the premise that a sanitized dump is needed. This task has already consumed a non-trivial amount of operations time and it's complete... [12:28:11] (03PS5) 10KartikMistry: CX: Log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) [12:34:36] (03CR) 10MZMcBride: "Hmmm. I wonder if the option should be inverted (i.e., "--include-private")." [puppet] - 10https://gerrit.wikimedia.org/r/214037 (owner: 10Alex Monk) [12:39:17] 6operations, 10Traffic, 7HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1314291 (10faidon) The tricky part is finding CAs that are able to issue our unified certificate. I don't think RapidSSL can do that but we'd have to double-check. [12:46:47] (03Abandoned) 10Faidon Liambotis: Add PROXY protocol v2 send code [debs/stud] (wikimedia) - 10https://gerrit.wikimedia.org/r/82428 (owner: 10Mark Bergsma) [12:46:52] (03Abandoned) 10Faidon Liambotis: Add --write-proxy2 configuration option [debs/stud] (wikimedia) - 10https://gerrit.wikimedia.org/r/82427 (owner: 10Mark Bergsma) [12:51:34] (03Abandoned) 10Alexandros Kosiaris: Use hiera to have ytterbium listen only on it's IP address [puppet] - 10https://gerrit.wikimedia.org/r/185434 (owner: 10Alexandros Kosiaris) [12:54:52] (03Abandoned) 10Faidon Liambotis: Add apparmor profiles for avconv/ffmpeg2theora [puppet] - 10https://gerrit.wikimedia.org/r/38307 (https://phabricator.wikimedia.org/T42099) (owner: 10J) [12:55:31] (03PS2) 10Alexandros Kosiaris: Assign Hostname/IPs to the etcd cluster @ eqiad [dns] - 10https://gerrit.wikimedia.org/r/214035 [12:55:38] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Assign Hostname/IPs to the etcd cluster @ eqiad [dns] - 10https://gerrit.wikimedia.org/r/214035 (owner: 10Alexandros Kosiaris) [12:57:15] akosiaris: all on the same subnet/row? [12:57:33] that doesn't look good [12:58:22] (03CR) 10Faidon Liambotis: "All on the same subnet/row is not a very good idea for a service that a) we want to be redundant b) runs three times in a quorum for preci" [dns] - 10https://gerrit.wikimedia.org/r/214035 (owner: 10Alexandros Kosiaris) [12:58:40] paravoid: yeah, for now. Different racks though [12:58:42] 6operations, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1314333 (10Ottomata) Ori has made a python module to interact with varnish shared logs via the C API: https://gerrit.wikimedia.org/r/#/c/213293/ I should probably use this rather than varnishncsa. Tho... [12:58:55] (03CR) 10Jcrespo: [C: 032] Repool pc1001 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214025 (owner: 10Jcrespo) [12:59:14] paravoid: we only got ganeti cluster on one row at eqiad for now [12:59:46] ^ is this a good moment? [13:00:41] I've seen the 50X go down [13:00:48] 6operations, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1314335 (10faidon) I'd like that a lot, yes, as long as we can make it perform well enough. Ori's initial version was too time-consuming; he's reworked it since to perform better but the tradeoff was (AI... [13:03:28] (03PS3) 10Faidon Liambotis: bd808: Add second ssh public key for personal laptop [puppet] - 10https://gerrit.wikimedia.org/r/207369 (owner: 10BryanDavis) [13:03:37] (03CR) 10Faidon Liambotis: [C: 032] bd808: Add second ssh public key for personal laptop [puppet] - 10https://gerrit.wikimedia.org/r/207369 (owner: 10BryanDavis) [13:07:21] (03CR) 10Jcrespo: [C: 031] Add system => true to mysql_wmf::mysqluser [puppet] - 10https://gerrit.wikimedia.org/r/214027 (owner: 10Faidon Liambotis) [13:08:03] akosiaris: why? [13:08:29] (03CR) 10Jcrespo: "I am ok with this, although this will be a legacy class someday (mysql core/old parsercache)." [puppet] - 10https://gerrit.wikimedia.org/r/214027 (owner: 10Faidon Liambotis) [13:09:32] mark: networking limitation on the cluster level and our architecture [13:09:57] mark aka can't have a VM move to another row due to how we handle row networking at the router [13:10:05] why is that? [13:10:19] the subnets are defined per row ? [13:10:26] so? [13:10:33] perhaps that's something that should have been raised before then [13:11:11] we can make a special setup for it etc [13:11:16] what do you mean so. How can a VM with IP address belonging in public1-a-eqiad show up in public1-b-eqiad ? [13:11:25] why would it need to be in public1-a-eqiad? [13:11:46] eer I meant private1-a-eqiad [13:12:10] but yeah, a VM can't really go cross row without being renumbered [13:12:26] we could create a new subnet for it with different topology [13:12:28] same way a physical host can't [13:12:49] I 'd like that [13:13:10] or creating an overlay switching environment [13:13:14] using openvswitch [13:13:19] or vxlan [13:13:34] uh oh, here come the buzzwords ;) [13:13:39] :P [13:13:45] !log jynus Synchronized wmf-config/db-eqiad.php: repool pc1001 (duration: 00m 13s) [13:13:53] Logged the message, Master [13:14:02] mark: sans bullshit sans font for ya [13:15:14] pretending it's not a VM but a hardware box with teleportation capabilities and assuming free U space , how would you propose we solved it ? [13:16:34] we could try MC-LAG with it [13:16:46] uh oh.. [13:16:58] I 've actually been bitten badly by it [13:17:09] well, granted it was on a nexus 7000+2000 environment [13:17:31] it might work better on Junipers [13:17:39] everything works better on junipers [13:18:03] I thought though you did not like LACP [13:18:15] me? why? [13:18:31] we're about to rely on etcd for configuring pybal to control our whole varnish/appserver fleet, I think we can spare three machines [13:18:50] can one of them be in the other dc? [13:19:26] that'd be even better but I remember hearing _joe_ saying bad things about cross-DC etcd, although I may be misremembering [13:19:37] i see [13:19:45] I think so too [13:19:56] btw, we did not do that for zookeeper either [13:20:10] zk runs on the same box as the kafka brokers, IIRC [13:20:33] connections on pc1001 seem to be going down, which is a good thing [13:20:36] precisely. We did not spare 3 boxes per DC for zookeeper [13:21:12] btw, it is actually possible that we do that on multiple rows and still leveraging ganeti without changing anything in our networking [13:21:25] just another cluster on another row [13:21:38] currently zk does not run on same boxes as kafka [13:21:40] that'd be a good start yeah [13:21:46] ottomata: oh really? [13:21:47] but that has been a suggestion for cache DC kafka clusters [13:21:54] does it run on dedicated boxes? [13:21:56] yes [13:22:02] hmm [13:22:05] so, we could run zk + etcd on the same three set of hosts [13:22:07] an1023-1025 [13:22:18] yes [13:25:00] ottomata: I remember we were running the CDH version of ZK, is this still the case? [13:25:15] no, we never were [13:25:22] we weren't? [13:25:23] we started to but then you wanted us to use deb version so we did [13:25:26] oh [13:25:30] good! [13:25:38] the cdh jar is a dep for some cdh packages [13:25:45] which means, we can't install the deb package on hadoop nodes [13:25:46] paravoid: you don't even remember your influence :P [13:25:49] yeah :) [13:26:01] but, that isnt' a big deal [13:26:08] so we use the deb zookeeper server [13:26:16] but some zk clients use the cdh version [13:26:17] he was just mumbling that in his sleep [13:26:30] ottomata: does the plan of sharing zk + etcd on three distinct boxes make sense to you? [13:26:31] "need to use debian package everywhere" [13:26:33] kinda risky with possible version mismatches, but the versions are close enough [13:26:40] paravoid: i thikn so, don't see why not [13:26:50] unless someone was doing something really intense with zk, should be fine [13:27:06] !log restarting Jenkins for java upgrade [13:27:11] moritzm: ^^^ [13:27:11] paravoid: do you suggest we use the existing zk boxes? or you are saying for a non-analytics zk? [13:27:12] Logged the message, Master [13:27:46] I am a bit skeptical due to the ACL in analytics, but it would work [13:27:59] i think i would prefer a separate zk for non analytics things [13:28:05] why? [13:28:07] actually, hm, not sure [13:28:10] it's just that we have the same problem [13:28:11] what are we gonna do with zk? [13:28:19] nothing, just analytics [13:28:20] aren't those 3 boxes on the same rack [13:28:22] ? [13:28:26] they are not [13:28:29] row ? [13:28:33] oh, you just want to colocate etcd on those nodes? [13:28:34] nope [13:28:37] not use zk for something else opsy? [13:28:39] ottomata: yes [13:28:43] that's fine with me [13:28:54] but ja, acl. [13:29:02] hm, but acl just prevents analytics vlan from reaching out [13:29:05] anytihng can reach in [13:29:20] should move them out of analytics vlan I guess [13:29:22] does etcd iniitatate connections? [13:29:23] we'd want to move them to the non-analytics vlan [13:29:35] and probably rename them & format them to jessie in the process as well [13:29:41] HMmmM [13:30:02] so, guess so. remember though that kafka brokers 100% depend on zk [13:30:05] so busted zk means busted kafka [13:30:17] !log All Jenkins slaves are disconnected due to some ssh error. CI is down. [13:30:23] Logged the message, Master [13:30:26] not sure I'd want the brokers in a seprate vlan as the zks [13:30:31] well, firewalled off [13:30:32] busted etcd will also mean busted pybal/varnish ;) [13:30:37] heheh [13:30:39] should be fine with whitelists, but still [13:30:41] well maybe not [13:30:45] but still ;) [13:30:54] it shouldn't but we all know it will happen [13:31:11] ottomata: if routing between public & analytics vlans is broken, kafka won't have anything to process/transport anyway :) [13:31:19] why will it happen? [13:31:20] if all routing is broken [13:31:35] but it is already cumbersome enough to have to ask when we need to get a hole in the firewall [13:31:51] will be worse if we have some emergency thing where we need to set up a new broker or something [13:32:32] it's not _that_ big of a deal, come on :) [13:33:02] we have other dependencies between analytics & public too, e.g. dns servers [13:34:18] moritzm: the Jenkins ssh to slaves fails with: "Could not load host key: /etc/ssh/ssh_host_ed25519_key [13:34:23] moritzm: and .... fatal: no matching mac found: client hmac-sha1-96,hmac-sha1,hmac-md5-96,hmac-md5 server hmac-sha2-512,hmac-sha2-256 [13:35:08] that's related to the sshd config changes which mutante merged [13:35:21] both gallium and the target slave are Precise so maybe they are missing some of the algorithms ? [13:35:48] precise covered, let me check that on the systems [13:36:01] (precise is supported I meant) [13:36:14] our labs puppetmaster is dieing with OOM so the patch might not have been applied [13:38:06] paravoid: ja should be fine, maybe some annoyances but is mostly fine [13:38:46] ori: yt around? suppose it is early... [13:39:21] ottomata: he was in Europe and traveling yesterday so unlikely to show up I guess [13:42:22] (03PS1) 10Ottomata: Add diamond collector_module class [puppet] - 10https://gerrit.wikimedia.org/r/214053 [13:42:24] aye [13:42:24] k [13:44:33] (03PS21) 10Ottomata: Add varnish request stats diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) [13:45:16] 6operations, 10Continuous-Integration-Infrastructure: Jenkins master / client ssh connection fails due to missing ssh algorithm - https://phabricator.wikimedia.org/T100509#1314411 (10hashar) 3NEW [13:45:45] (03PS2) 10Ottomata: Add diamond collector_module class [puppet] - 10https://gerrit.wikimedia.org/r/214053 (https://phabricator.wikimedia.org/T83580) [13:49:54] 6operations, 10Continuous-Integration-Infrastructure: Jenkins master / client ssh connection fails due to missing ssh algorithm - https://phabricator.wikimedia.org/T100509#1314426 (10hashar) p:5Triage>3Unbreak! Being investigated with @MoritzMuehlenhoff [13:53:55] 6operations, 10Continuous-Integration-Infrastructure: Jenkins master / client ssh connection fails due to missing ssh algorithm - https://phabricator.wikimedia.org/T100509#1314445 (10hashar) QChris sent a change for Gerrit which is related. https://gerrit.wikimedia.org/r/#/c/213216/ Turn off sshd MAC and KEX... [13:57:24] can I get a +2 on this: https://gerrit.wikimedia.org/r/#/c/214014/ I'm about to do the phab update which I missed last week [13:59:00] 6operations, 10Continuous-Integration-Infrastructure: Jenkins master / client ssh connection fails due to missing ssh algorithm - https://phabricator.wikimedia.org/T100509#1314450 (10hashar) Applied on hiera page https://wikitech.wikimedia.org/wiki/Hiera:Integration "ssh::server::disable_nist_kex": false... [14:00:04] chasemp: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150527T1400). Please do the needful. [14:00:58] paravoid: i want to build a kafkacat for trusty [14:01:00] 6operations, 10Continuous-Integration-Infrastructure: Jenkins master / client ssh connection fails due to missing ssh algorithm - https://phabricator.wikimedia.org/T100509#1314456 (10hashar) Gerrit had the same issue with T99990 [14:01:10] i'm pulling from edenhill and pushing to our operations/debs fork [14:02:09] looks like magnus has added a lot of cool features recently, and hasn't tagged in a while. i'm asking him if he's got plans for a new tag sometime soon, but, do we need to build from his tags? or can I build a wmf package from his master? [14:03:09] (03PS1) 10Hashar: Turn off sshd MAC and KEX hardening for Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/214055 (https://phabricator.wikimedia.org/T100509) [14:05:30] anyone able to merge my patch so that the current deployment window can proceed? [14:06:17] (03CR) 10Hashar: "Same for Jenkins slave https://gerrit.wikimedia.org/r/#/c/214055/ / T100509" [puppet] - 10https://gerrit.wikimedia.org/r/213216 (https://phabricator.wikimedia.org/T99990) (owner: 10QChris) [14:06:33] akosiaris: may be around since it'll be early for mutante, twentyafterfour [14:06:47] twentyafterfour, I can do that [14:07:00] or him :) [14:07:00] jynus: thanks! [14:07:17] (03PS2) 10Jcrespo: bump phabricator release tag. [puppet] - 10https://gerrit.wikimedia.org/r/214014 (owner: 1020after4) [14:07:17] https://gerrit.wikimedia.org/r/#/c/214014/ [14:07:27] yay! thank you [14:07:32] 6operations: salt broken after the upgrade - https://phabricator.wikimedia.org/T100502#1314484 (10yuvipanda) Also related: T99213 [14:07:53] (03CR) 10Hashar: "I first tested this on the wikitech Hiera:integration page then cherry picked this puppet patch and reverted the wiki edit. It is all goo" [puppet] - 10https://gerrit.wikimedia.org/r/214055 (https://phabricator.wikimedia.org/T100509) (owner: 10Hashar) [14:08:00] not sure if jenkins will work right now... [14:08:09] (03CR) 10Rush: [C: 032] bump phabricator release tag. [puppet] - 10https://gerrit.wikimedia.org/r/214014 (owner: 1020after4) [14:08:33] (03CR) 10Rush: [V: 032] bump phabricator release tag. [puppet] - 10https://gerrit.wikimedia.org/r/214014 (owner: 1020after4) [14:08:52] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Jenkins master / client ssh connection fails due to missing ssh algorithm - https://phabricator.wikimedia.org/T100509#1314486 (10hashar) p:5Unbreak!>3Normal Issue is fixed by cherry picked the puppet patch https://gerrit.wikimedia.o... [14:08:55] that does it.. [14:08:59] jynus: oops didn't see you around thanks man [14:09:00] Rush beat me to it [14:09:07] :-) [14:09:13] well back to getting coffee, I haven't tested this at all twentyafterfour so good speed :) [14:09:26] thanks [14:09:51] I tested the new stuff on phab-01.wmflabs last night [14:11:13] for the first time ever I think we'll be running a phabricator install that's more current than secure.phabricator.com (they recently stopped running HEAD, I've noticed) [14:11:56] no CI is scary (for me) even if I run checks locally [14:13:54] (03PS1) 10ArielGlenn: add salt minion recon and auth timing params for performance [puppet] - 10https://gerrit.wikimedia.org/r/214056 [14:14:09] jynus: you haven't seen labsdb yet :D [14:17:36] (03PS2) 10Hashar: Turn off sshd MAC/KEX hardening for Jenkins and Beta [puppet] - 10https://gerrit.wikimedia.org/r/214055 (https://phabricator.wikimedia.org/T100509) [14:17:49] YuviPanda, I have been told legends :-) [14:18:32] actually, I am lying, most of the db stuff has no tests [14:18:48] but at least we have 20 replicas [14:18:56] akosiaris: i'm sorta confused on a git-pbuilder thing [14:19:06] jynus: :) [14:19:32] jynus: also for more horrors - there's tools-db which was unpuppetized until recently and still completely untuned :) [14:19:43] (03CR) 10Hashar: "+deployment-prep which has instances attached as Jenkins slaves" [puppet] - 10https://gerrit.wikimedia.org/r/214055 (https://phabricator.wikimedia.org/T100509) (owner: 10Hashar) [14:19:50] i shouldn't need to install build deps manually, should I/ [14:19:50] ? [14:19:53] on copper [14:20:26] ottomata: no you shouldn't [14:20:30] well, actually cleaning up the openstack db was one of my first assignments [14:20:32] ottomata: no, you shouldn't be doing anything as real root on copper, like installing deps or build deps [14:20:55] 90GB -> 2.5GB, no rows deleted [14:21:12] ottomata: let me make a guess. the clean step is failing, right ? [14:21:45] jynus: :D nice! [14:21:52] 6operations, 10ops-eqiad: Move cp1069 and cp1070 within the rack A5 - https://phabricator.wikimedia.org/T100516#1314533 (10Cmjohnson) 3NEW [14:22:05] (03CR) 10Muehlenhoff: [C: 031] "The comment is slightly misleading, this is not related to a deficiency in OpenJDK, but in jsch, the Java SSH implementation." [puppet] - 10https://gerrit.wikimedia.org/r/214055 (https://phabricator.wikimedia.org/T100509) (owner: 10Hashar) [14:22:44] moritzm: mind editing the commit summary message directly ? [14:23:11] ottomata: pdebuild will generate a pbuilder-satisfydepends fake package which installs the dependencies in the build chroot [14:23:29] moritzm: if jsch is a java lib, I guess it is embedded in the Jenkins build or we need to upgrade whatever .deb package ships it [14:23:41] morebots: tht's what I thought, but i'm getting [14:23:41] I am a logbot running on tools-exec-1203. [14:23:41] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [14:23:41] To log a message, type !log . [14:23:55] dpkg-checkbuilddeps: Unmet build dependencies: librdkafka-dev (>= 0.8.4) zlib1g-dev [14:24:00] dpkg-buildpackage: warning: build dependencies/conflicts unsatisfied; aborting [14:24:08] although, maybe that isn't killing it [14:24:12] also dpkg-source: error: can't build with source format '3.0 (quilt)': no upstream tarball found at ../kafkacat_1.1.0.orig.tar.{bz2,gz,lzma,xz} [14:24:14] ottomata: git-buildpackage ? or pdebuild ? [14:24:19] but i don't think I shoudl need that either, right? cause of cow? [14:24:21] hashar: ah, so it's using some upstream jar instead of what's packaged in Debian? that we cannot update it independently easily [14:24:21] akosiaris: [14:24:30] GIT_PBUILDER_AUTOCONF=no DIST=trusty WIKIMEDIA=yes git-buildpackage -us -uc --git-builder=git-pbuilder [14:24:45] cleaner=fakeroot debian/rules clean in debian/gbp.conf [14:24:54] under [DEFAULT] [14:25:02] moritzm: I have no idea to be honest. But I am pretty sure a java "good practice" is to embed all lib dependencies inside the jar. We are using Jenkins upstream jar which list no dependencies, so most probably the case. [14:25:07] ottomata: ^ [14:25:10] that should do it [14:26:31] hashar: indeed, just checked: the package consists mostly of a 66MB war file [14:28:17] akosiaris: that seems to be doing weird things to the local checkout [14:28:20] when i run builder [14:28:28] dh clean ... [14:28:33] make[1]: Leaving directory '/home/otto/kafkacat' [14:28:33] dh_clean [14:28:36] gbp:error: You have uncommitted changes in your source tree: [14:28:36] gbp:error: On branch debian [14:28:41] deleted: debian/files [14:28:41] deleted: debian/kafkacat.debhelper.log [14:28:41] deleted: debian/kafkacat.substvars [14:28:44] (03PS3) 10Muehlenhoff: Turn off sshd MAC/KEX hardening for Jenkins and Beta [puppet] - 10https://gerrit.wikimedia.org/r/214055 (https://phabricator.wikimedia.org/T100509) (owner: 10Hashar) [14:28:58] ottomata: yes, you need a clean debian branch [14:29:02] it was clean [14:29:17] ottomata: I think you commited things you shouldn't have committed [14:29:19] clearly not! :) [14:29:19] until i set cleaner in gbp.conf [14:29:21] lemme see [14:29:31] 6operations: salt broken after the upgrade - https://phabricator.wikimedia.org/T100502#1314546 (10ArielGlenn) All jobs do eventually return, as I found out by checking the job cache. So a first take will be about getting the response time down to something more reasonable. More than 1000 minions is apparently... [14:29:33] hm, i only committed the change to gbp.conf to debian branch [14:29:40] it may also be a matter of having the correct upstream/debian branch settings for gbp [14:29:41] it was clean until I ran the builder [14:29:45] hm [14:29:54] debian-branch = debian [14:29:57] upstream-tag = %(version)s [14:29:59] looking ok [14:30:09] this wrong? [14:30:09] [DEFAULT] [14:30:10] cleaner=fakeroot debian/rules clean [14:30:31] akosiaris: i'm working on copper in ~otto/kafkacat [14:30:56] e.g. --git-debian-branch=debian --git-upstream-branch=master --git-upstream-tree=branch [14:31:01] those sorts of settings, as applicable [14:31:26] it's pretty highly variable what the right values for those things are, depends on the situation [14:31:37] naw, want upstream-tree = tag [14:31:45] but the rest ya [14:31:57] does your version match upstream version? [14:32:06] yeah it does [14:32:10] I am looking right now [14:32:46] akosiaris: feel free to reset --hard [14:32:49] (03PS2) 10Yuvipanda: ores: Add simple nginx load balancer [puppet] - 10https://gerrit.wikimedia.org/r/214060 [14:33:00] ottomata: I cloned into my homedir [14:35:02] ottomata: sigh, my guess was wrong. I led you in the wrong direction [14:35:12] -no-create-orig = True [14:35:12] +no-create-orig = False [14:35:34] this fixed the problem ottomata but it is now failing at another place [14:35:37] ha oook [14:36:29] ottomata: found it [14:36:35] doing this stuff via the cow/gbp stuff really forces you to clean things up with all the build settings. it's a nice side-bonus in the long run, but can be a pain for something that was just kinda-working before :) [14:36:37] Version: 0.8.3-1~precise1 [14:36:52] and you request 0.8.4 in debian/control [14:37:04] ottomata: so it can not fullfil the requirements in precise-wikimedia [14:37:26] 7Puppet, 6Phabricator, 5Patch-For-Review: Puppet lock files fail because tag names are treated like dirs - https://phabricator.wikimedia.org/T98411#1314564 (10mmodell) [14:37:54] (03PS3) 10Yuvipanda: ores: Add simple nginx load balancer [puppet] - 10https://gerrit.wikimedia.org/r/214060 [14:37:56] (03PS12) 10Yuvipanda: ores: Initial module, with web class / role [puppet] - 10https://gerrit.wikimedia.org/r/213354 [14:37:59] 6operations, 10Continuous-Integration-Infrastructure: Jenkins jar should ship with a more recent jsch java lib version to support hardened algorithm - https://phabricator.wikimedia.org/T100517#1314566 (10hashar) 3NEW [14:38:21] (03PS4) 10Hashar: Turn off sshd MAC/KEX hardening for Jenkins and Beta [puppet] - 10https://gerrit.wikimedia.org/r/214055 (https://phabricator.wikimedia.org/T100509) [14:38:31] (03PS5) 10Hashar: Turn off sshd MAC/KEX hardening for Jenkins and Beta [puppet] - 10https://gerrit.wikimedia.org/r/214055 (https://phabricator.wikimedia.org/T100509) [14:38:37] akosiaris: I'm trying to build for trusty [14:38:59] ok then, let's see [14:39:29] (03CR) 10Hashar: [C: 031 V: 032] "PS4: rewrapped commit message summary and pointed to T100517 which ask for Jenkins embedded jsch lib to be updated." [puppet] - 10https://gerrit.wikimedia.org/r/214055 (https://phabricator.wikimedia.org/T100509) (owner: 10Hashar) [14:40:16] ottomata: ok, it is way better now [14:40:19] kafkacat.1: No such file or directory at /usr/bin/dh_installman line 130. [14:40:38] hmm, I think there's a fact for CPU core count but I can't find it [14:40:40] * YuviPanda keeps looking [14:40:51] YuviPanda: processorcount [14:40:55] ah! [14:40:59] I was grepping for 'cpu' and 'core' [14:41:42] thanks akosiaris [14:41:43] YuviPanda: there are even a couple of nices ones by bblack [14:41:45] hmm, that is a debhelper dep" [14:41:45] ? [14:41:46] physicalcorecount [14:41:55] and physicalprocessorcount [14:42:02] yeah the physicalcorecount one ignores HT duplication, regardless of the current BIOS setting [14:42:15] yeah, so I tried physicalcorecount [14:42:17] but it's not present in labs [14:42:22] physicalprocessorcount is [14:42:27] ottomata: yeah, debian/manpages [14:42:34] it tries to install a manpage that does not exist [14:42:43] but I'm trying to set uwsgi process count so I guess processorcount is what I should use [14:42:45] it's defined in our modules/base [14:42:52] ottomata: either set that file correctly or remove it [14:42:52] root@ores-sigh:/var/lib/git/operations/puppet# facter | grep physical [14:42:55] physicalprocessorcount => 2 [14:43:09] should I file a bug, bblack ? [14:43:22] (03PS4) 10Yuvipanda: ores: Add simple nginx load balancer [puppet] - 10https://gerrit.wikimedia.org/r/214060 [14:43:24] (03PS13) 10Yuvipanda: ores: Initial module, with web class / role [puppet] - 10https://gerrit.wikimedia.org/r/213354 [14:43:27] YuviPanda: physicalcorecount => 4 [14:43:27] physicalprocessorcount => 1 [14:43:30] labs as well [14:43:35] I have no idea why it's not present in labs. whatever makes it not so is not something I regularly deal with :) [14:43:36] haha [14:43:37] wat [14:43:50] akosiaris: which instance? [14:43:54] also distro? [14:43:55] oh [14:43:56] * YuviPanda is on trusty [14:43:58] use "facter -p" ? [14:44:07] facter -p|grep phys [14:44:11] oh [14:44:13] * YuviPanda tries [14:44:17] YuviPanda: what bblack said [14:44:17] ah [14:44:19] I see it now [14:44:34] what does -p do? the man page isn't very useful [14:44:36] the -p brings in puppet-level facts, including our custom-defined stuff [14:44:38] oh [14:44:39] right [14:44:42] raw facter is just what's built into facter, not puppet [14:44:44] fair enough. thanks :) [14:45:14] another wonderful gotcha: you have to do "facter -p" as root, or it quietly doesn't work right :) [14:45:18] hmm, akosiaris, it looks like that file is generated by the makefile [14:45:24] :D [14:45:27] hmm, maybe [14:45:48] naw maybe not [14:46:07] pshh, ok no man [14:47:04] ah ya it is in master [14:47:04] hm [14:47:13] !log powering down cp1070 to relocate within the same rack [14:47:20] Logged the message, Master [14:49:08] 6operations, 10Continuous-Integration-Infrastructure: Jenkins jar should ship with a more recent jsch java lib version to support hardened algorithm - https://phabricator.wikimedia.org/T100517#1314604 (10hashar) [14:50:20] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Jenkins master / client ssh connection fails due to missing ssh algorithm - https://phabricator.wikimedia.org/T100509#1314605 (10hashar) 5Open>3Resolved a:3hashar I have filled follow up tasks (T100517 and T100518). Puppet patch... [14:54:01] (03CR) 10Muehlenhoff: [C: 032] "Good to go." [puppet] - 10https://gerrit.wikimedia.org/r/214055 (https://phabricator.wikimedia.org/T100509) (owner: 10Hashar) [14:54:10] \O/ [14:54:19] moritzm: so the Jenkins failure was unrelated to the java upgrade [14:54:31] just that it hadn't been rebooted after the ssh algorithms change [14:54:45] and it is all good now :-} Thanks for the support! [14:54:57] PROBLEM - Host cp1070 is DOWN: PING CRITICAL - Packet loss = 100% [14:56:26] RECOVERY - Host cp1070 is UPING WARNING - Packet loss = 61%, RTA = 0.72 ms [14:58:17] 6operations, 6Phabricator: Move phab out from behind misc-web - https://phabricator.wikimedia.org/T100519#1314620 (10mmodell) 3NEW [14:58:28] akosiaris: i think this is package related, not builder related, right? [14:58:29] Makefile:1: Makefile.config: No such file or directory [14:58:29] make[1]: *** No rule to make target 'Makefile.config'. Stop. [14:58:29] make[1]: Leaving directory '/home/otto/build-area/kafkacat-1.1.0' [14:58:29] dh_auto_clean: make -j1 distclean returned exit code 2 [14:58:29] debian/rules:4: recipe for target 'clean' failed [14:58:39] not sure why it would need to make that target during clean though [14:59:07] 6operations, 6Phabricator: Move phab out from behind misc-web - https://phabricator.wikimedia.org/T100519#1314627 (10mmodell) @greg: I guess we need you to approve $ for the certificate? [14:59:22] 6operations, 6Phabricator: Move phab out from behind misc-web - https://phabricator.wikimedia.org/T100519#1314629 (10chasemp) p:5Triage>3Normal [14:59:36] looks like that file is generated by configure...should the builder run configure? [15:00:05] manybubbles, anomie, ^d, thcipriani, marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150527T1500). [15:00:28] ottomata: yes, building the package requires running configure [15:00:40] well, if it exists, which it does in this case [15:00:43] hm, yeah looks likea makefile problem, the Makefile is trying to include that [15:00:47] looks like an empty SWAT today [15:00:50] 6operations, 6Phabricator: Move phab out from behind misc-web - https://phabricator.wikimedia.org/T100519#1314633 (10mark) Can you elaborate a little on why that is necessary? [15:00:52] Makefile.config [15:00:55] but that won't exist til after configure [15:01:02] and the cleaner is using the Makefile to clean [15:01:15] should I tell it not to clean? [15:01:41] that probably means it has nothing to do with the package but rather the software [15:01:51] yes, well, the Makefiles yes [15:01:52] if it's unable to clean anyway [15:01:59] but, it seems like it can't clean because Makefile.config doesn't exist [15:01:59] it's a software problem [15:02:07] but Makefile.config wont' exist until configure is run [15:02:54] !log powering down cp1069 to relocate within the same rack [15:03:02] Logged the message, Master [15:03:11] ottomata: it should be failing though [15:03:25] I see include mklove/Makefile.base [15:03:31] in Makefile [15:03:47] and the -include $(TOPDIR)/Makefile.config [15:03:59] in that mklove/Makefile.base [15:05:23] rm -f kafkacat.o kafkacat.d [15:05:25] rm -f kafkacat [15:05:26] it works correctly [15:05:26] ? [15:05:48] what are the steps to reproduce that error you are seeing ? [15:05:56] hm, oh, I merged master into debian because I was getting mismatched source [15:06:05] i'm trying to build from master now, with a new tag [15:06:11] hmm, lemme start over [15:08:51] 6operations, 6Phabricator: Move phab out from behind misc-web - https://phabricator.wikimedia.org/T100519#1314653 (10mmodell) @mark: 1. The primary reason: we need to expose ssh for git repository hosting. If phabricator is going to replace gerrit then it'll definitely need ssh for feature parity. Maybe there... [15:08:58] RECOVERY - puppetmaster backend https on rhodium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.994 second response time [15:10:04] meh, i think it works akosiaris, i'll stick with the somewhat old 1.1.0 tag, and try to get magnus to mkae a new release [15:10:06] thanks for you rhelp [15:10:26] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [15:10:37] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [15:11:01] ottomata: you 're welcome [15:12:09] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1314662 (10Dzahn) >>! In T85141#1314263, @MZMcBride wrote: > I think static-bugzilla and old-bugzilla are sufficient. The plan is to remove old-bugzilla which was blocked by providin... [15:12:16] moritzm: should I merge https://gerrit.wikimedia.org/r/#/c/214055/ on palladium ? [15:12:17] 6operations, 6Phabricator, 10Traffic: Move phab out from behind misc-web - https://phabricator.wikimedia.org/T100519#1314663 (10mark) [15:12:36] akosiaris: yeah I got it deployed on the relevant labs puppetmaster [15:12:47] PROBLEM - DPKG on rhodium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:12:55] seems it has zero prod impact since it only touch labs hiera [15:14:06] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [15:14:08] hashar moritzm, ok merged on palladium as well [15:15:09] 6operations, 6Phabricator, 10Traffic: Move phab out from behind misc-web - https://phabricator.wikimedia.org/T100519#1314667 (10mark) I'm not too worried about 2). 1) is a bit trickier, and I believe we've already run into that before for... gerrit I think? Let's first see if we can make the existing setup... [15:15:27] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:17:28] akosiaris: merci! [15:21:08] 6operations, 6Phabricator, 10Traffic: Move phab out from behind misc-web - https://phabricator.wikimedia.org/T100519#1314675 (10mmodell) @mark: I have no objection to solving it in a different way. I'll update the title and we can hash it out. Yes gerrit is the prior-art in this domain for sure. [15:22:10] 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1314677 (10mmodell) [15:23:32] 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1314678 (10chasemp) >>! In T100519#1314667, @mark wrote: > I'm not too worried about 2). 1) is a bit trickier, and I believe we've already run into that before... [15:24:29] (03PS1) 10Jcrespo: Depool pc1002 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214066 [15:28:19] (03PS1) 10Ottomata: Install kafkacat on analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/214068 [15:28:47] 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1314696 (10mark) It turns out that we didn't move gerrit to behind misc-web, because 1) turned out to be difficult. And we'll have the same problem now as well... [15:28:47] (03PS2) 10Ottomata: Install kafkacat on analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/214068 (https://phabricator.wikimedia.org/T97771) [15:29:19] 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1314702 (10mmodell) [15:29:35] 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1314620 (10mmodell) [15:30:00] (03CR) 10Ottomata: [C: 032] Install kafkacat on analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/214068 (https://phabricator.wikimedia.org/T97771) (owner: 10Ottomata) [15:31:13] (03CR) 10Jcrespo: [C: 032] Depool pc1002 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214066 (owner: 10Jcrespo) [15:32:24] 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1314717 (10mmodell) We will need to scale phabricator eventually, so yes this is an issue. One thing to consider: Phabricator can host the web interface on a... [15:33:07] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [15:33:11] !log jynus Synchronized wmf-config/db-eqiad.php: depool pc1002 (duration: 00m 13s) [15:33:16] 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1314721 (10chasemp) >>! In T100519#1314696, @mark wrote: > It turns out that we didn't move gerrit to behind misc-web, because 1) turned out to be difficult. A... [15:33:17] Logged the message, Master [15:33:52] 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1314722 (10chasemp) >>! In T100519#1314717, @mmodell wrote: >One thing to consider: Phabricator can host the web interface on a separate server from the git ho... [15:34:47] PROBLEM - puppet last run on analytics1027 is CRITICAL Puppet has 1 failures [15:35:07] 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1314723 (10mmodell) @chasemp: as far as I know it's fully supported by upstream, and almost certainly facebook does it that way. There might be some extra setu... [15:36:00] 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1314724 (10mark) >>! In T100519#1314721, @chasemp wrote: >>>! In T100519#1314696, @mark wrote: >> It turns out that we didn't move gerrit to behind misc-web, b... [15:36:36] RECOVERY - puppet last run on analytics1027 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [15:39:37] 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1314729 (10mmodell) If hardware goes down, phabricator can be brought back up by puppet rather quickly. Downtime is bad but it wouldn't take long to spin up a... [15:39:51] (03PS1) 10KartikMistry: CX: Add languages for deployment on 20150528 [puppet] - 10https://gerrit.wikimedia.org/r/214071 (https://phabricator.wikimedia.org/T99535) [15:43:41] 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1314760 (10chasemp) > ...until it goes down for a hardware failure, which might quickly change that. :) Do we have an existing ticket describing the difficult... [15:45:09] 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1314762 (10chasemp) >>! In T100519#1314729, @mmodell wrote: > If hardware goes down, phabricator can be brought back up by puppet rather quickly. Downtime is b... [15:45:33] 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1314763 (10mmodell) some commentary from epriestley [[ http://www.quora.com/What-were-some-of-Evan-Priestleys-biggest-challenges-in-creating-Phabricator-from-a... [15:45:56] 6operations, 10Analytics-Cluster, 10procurement: Hadoop worker node procurement - 2015 - https://phabricator.wikimedia.org/T100442#1314768 (10kevinator) [15:47:33] (03PS5) 10Yuvipanda: ores: Add simple nginx load balancer [puppet] - 10https://gerrit.wikimedia.org/r/214060 [15:47:35] (03PS14) 10Yuvipanda: ores: Initial module, with web class / role [puppet] - 10https://gerrit.wikimedia.org/r/213354 [15:48:03] 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1314780 (10mmodell) >>! In T100519#1314762, @chasemp wrote: > > Agreed with two qualifiers. We haven't actualy done it, and diffusion does not survive this p... [15:48:40] (03PS1) 10Jcrespo: Updating pc1002 to MariaDB 10 [puppet] - 10https://gerrit.wikimedia.org/r/214073 [15:48:46] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [15:50:15] (03CR) 10Jcrespo: [C: 032] Updating pc1002 to MariaDB 10 [puppet] - 10https://gerrit.wikimedia.org/r/214073 (owner: 10Jcrespo) [15:51:11] 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1314783 (10chasemp) agreed, achievable just not worked out, my plan was always to sort it as we transition the repos from mirrored to hosted (since that's when... [15:52:57] RECOVERY - Host analytics1036 is UPING OK - Packet loss = 0%, RTA = 1.17 ms [15:53:42] 6operations, 10ops-eqiad: Move cp1069 and cp1070 within the rack A5 - https://phabricator.wikimedia.org/T100516#1314795 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson Task Complete [15:54:50] 6operations, 10ops-eqiad: dataset1001: add new disk array - https://phabricator.wikimedia.org/T99808#1314806 (10Cmjohnson) The disk array has been racked. Ariel and I will add to dataset1001 on 5/28 [15:56:18] 6operations, 10ops-eqiad: analytics1036 can't talk cross row? - https://phabricator.wikimedia.org/T99845#1314808 (10Cmjohnson) I removed the interface for ge-2/0/37 and attached an1037 to asw-3/0/47. I verified in the correct vlan ge-3/0/47 up up analytics1036-temp [15:56:18] PROBLEM - puppet last run on analytics1036 is CRITICAL puppet fail [15:57:07] PROBLEM - Hadoop NodeManager on analytics1036 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:57:27] PROBLEM - configured eth on analytics1036 is CRITICAL: eth1 reporting no carrier. [15:58:45] 6operations, 10ops-eqiad, 5Patch-For-Review: db1054 MCE errors logged for CPU temperature - https://phabricator.wikimedia.org/T89801#1314816 (10Cmjohnson) 5Open>3Resolved It has been nearly 3 weeks since applying more thermal paste. The errors have not returned. Resolving this ticket [15:59:16] (03CR) 10Andrew Bogott: [C: 032] Feed the puppet host IP directly to dnsmasq. [puppet] - 10https://gerrit.wikimedia.org/r/213629 (owner: 10Andrew Bogott) [16:00:33] bios booting: 5minutes. OS boot with SSD/RAID: 2 seconds. [16:01:37] RECOVERY - puppet last run on analytics1036 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:03:57] PROBLEM - puppetmaster backend https on rhodium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error [16:09:08] PROBLEM - Host analytics1036 is DOWN: PING CRITICAL - Packet loss = 100% [16:09:33] 6operations, 7database: Document x1 DB requirements for new wikis - https://phabricator.wikimedia.org/T100527#1314833 (10Krenair) 3NEW [16:13:54] (03PS1) 10Andrew Bogott: Revert "Feed the puppet host IP directly to dnsmasq." [puppet] - 10https://gerrit.wikimedia.org/r/214077 [16:13:56] RECOVERY - Host analytics1036 is UPING OK - Packet loss = 0%, RTA = 1.17 ms [16:14:17] RECOVERY - DPKG on rhodium is OK: All packages OK [16:15:21] (03PS1) 10Andrew Bogott: Alias labs-puppetmaster.wikimedia.org to virt1000 [dns] - 10https://gerrit.wikimedia.org/r/214079 [16:15:30] (03CR) 10jenkins-bot: [V: 04-1] Alias labs-puppetmaster.wikimedia.org to virt1000 [dns] - 10https://gerrit.wikimedia.org/r/214079 (owner: 10Andrew Bogott) [16:16:33] (03PS2) 10Andrew Bogott: Alias labs-puppetmaster.wikimedia.org to virt1000 [dns] - 10https://gerrit.wikimedia.org/r/214079 [16:22:30] PROBLEM - Host analytics1036 is DOWN: PING CRITICAL - Packet loss = 100% [16:22:35] 6operations, 10Analytics-Cluster, 5Patch-For-Review: Turn off webrequest udp2log instances. - https://phabricator.wikimedia.org/T97294#1314869 (10Ottomata) [16:22:38] 6operations, 10Analytics-Cluster, 5Patch-For-Review: Backport? and install kafkacat (on stat1002?) - https://phabricator.wikimedia.org/T97771#1314868 (10Ottomata) 5Open>3Resolved [16:30:33] (03PS1) 10Alexandros Kosiaris: qualify hadoop erb variables [puppet] - 10https://gerrit.wikimedia.org/r/214081 [16:32:40] PROBLEM - ganeti-noded running on ganeti1003 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 0 (root), command name ganeti-noded [16:33:57] (03PS3) 10Andrew Bogott: Alias labs-puppetmaster-eqiad.wikimedia.org to virt1000 [dns] - 10https://gerrit.wikimedia.org/r/214079 [16:34:25] (03CR) 10Ottomata: [C: 031] qualify hadoop erb variables [puppet] - 10https://gerrit.wikimedia.org/r/214081 (owner: 10Alexandros Kosiaris) [16:35:27] (03PS5) 10Andrew Bogott: Replace many references to virt1000 and labcontrol2001 with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/213543 [16:37:19] I'm going start branching for 1.26wmf8 shortly, any last minute merges? [16:42:02] no takers? ok branching [16:45:40] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [16:54:54] (03PS6) 10Yuvipanda: ores: Add simple nginx load balancer [puppet] - 10https://gerrit.wikimedia.org/r/214060 [16:54:56] (03PS15) 10Yuvipanda: ores: Initial module, with web class / role [puppet] - 10https://gerrit.wikimedia.org/r/213354 [16:57:00] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [17:02:39] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:07:24] 6operations, 10ops-eqiad: analytics1036 can't talk cross row? - https://phabricator.wikimedia.org/T99845#1314970 (10faidon) So, this worked. Moving it back to its original port made it fail again. I also tried to remove xe-7/0/31 from ae2 (one of the four members of the uplink to cr2-eqiad). This made ARP wor... [17:11:32] (03PS1) 10Andrew Bogott: Tidy up firewall rules for puppetmaster and salt [puppet] - 10https://gerrit.wikimedia.org/r/214085 [17:13:50] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [17:20:00] RECOVERY - ganeti-noded running on ganeti1003 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded [17:24:44] (03CR) 10Andrew Bogott: "yuvi, adding you in case there are other monitoring things that I don't know about that hit puppet on virt1000" [puppet] - 10https://gerrit.wikimedia.org/r/214085 (owner: 10Andrew Bogott) [17:24:47] (03PS7) 10Yuvipanda: ores: Add simple nginx load balancer [puppet] - 10https://gerrit.wikimedia.org/r/214060 [17:24:55] (03CR) 10Rush: [C: 031] "makes sense to pull UID from the system def to me" [puppet] - 10https://gerrit.wikimedia.org/r/214030 (owner: 10Faidon Liambotis) [17:25:19] (03CR) 10Andrew Bogott: [C: 032] Alias labs-puppetmaster-eqiad.wikimedia.org to virt1000 [dns] - 10https://gerrit.wikimedia.org/r/214079 (owner: 10Andrew Bogott) [17:27:10] (03PS1) 10Gage: puppetmaster::autosigner: fix doc generation for class [puppet] - 10https://gerrit.wikimedia.org/r/214087 [17:27:37] (03CR) 10Gage: [C: 032 V: 032] puppetmaster::autosigner: fix doc generation for class [puppet] - 10https://gerrit.wikimedia.org/r/214087 (owner: 10Gage) [17:28:37] (03CR) 10Rush: [C: 032] admin: use $LAST_SYSTEM_UID in enforce-users-groups [puppet] - 10https://gerrit.wikimedia.org/r/214030 (owner: 10Faidon Liambotis) [17:29:03] akosiaris: thanks for merging, was afk [17:29:18] (03CR) 10Rush: [V: 032] admin: use $LAST_SYSTEM_UID in enforce-users-groups [puppet] - 10https://gerrit.wikimedia.org/r/214030 (owner: 10Faidon Liambotis) [17:41:06] !log initiating controlled shutdown of kafka broker analytics1018 in anticipation of switch reboot [17:41:14] Logged the message, Master [17:42:06] !log rebooting asw-d2-eqiad [17:42:11] Logged the message, Master [17:45:01] PROBLEM - Host analytics1035 is DOWN: PING CRITICAL - Packet loss = 100% [17:45:40] PROBLEM - Host analytics1018 is DOWN: PING CRITICAL - Packet loss = 100% [17:46:41] PROBLEM - Host analytics1025 is DOWN: PING CRITICAL - Packet loss = 100% [17:46:41] PROBLEM - Host analytics1037 is DOWN: PING CRITICAL - Packet loss = 100% [17:46:41] PROBLEM - Host analytics1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:46:41] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [17:46:50] PROBLEM - Host analytics1019 is DOWN: PING CRITICAL - Packet loss = 100% [17:48:10] RECOVERY - Host analytics1018 is UPING WARNING - Packet loss = 93%, RTA = 1.75 ms [17:48:21] RECOVERY - Host analytics1037 is UPING OK - Packet loss = 0%, RTA = 1.12 ms [17:48:21] RECOVERY - Host analytics1019 is UPING OK - Packet loss = 0%, RTA = 1.31 ms [17:48:21] RECOVERY - Host analytics1035 is UPING OK - Packet loss = 0%, RTA = 0.48 ms [17:48:21] RECOVERY - Host analytics1002 is UPING OK - Packet loss = 0%, RTA = 0.86 ms [17:48:29] RECOVERY - Host analytics1020 is UPING OK - Packet loss = 0%, RTA = 1.90 ms [17:48:29] RECOVERY - Host analytics1036 is UPING OK - Packet loss = 0%, RTA = 1.22 ms [17:48:36] success [17:48:39] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1021 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 18.0 [17:48:40] PROBLEM - Hadoop Namenode - Primary on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [17:48:49] RECOVERY - Host analytics1025 is UPING OK - Packet loss = 0%, RTA = 1.18 ms [17:48:49] PROBLEM - Disk space on analytics1027 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:49:10] PROBLEM - Disk space on analytics1026 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:49:33] uhhhhh [17:49:41] ottomata: I'm getting paged about an1001 [17:49:42] 1001? [17:49:45] uhh [17:49:50] ya me too [17:49:51] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:49:57] hi [17:50:00] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1012 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 34.0 [17:50:17] (03PS1) 10Jcrespo: Repool pc1002 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214093 [17:50:19] ottomata: it wasn't there and we didn't lose connectivity to it, it's something higher up the stack [17:50:29] i think journalnode quorum. [17:50:30] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1022 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 29.0 [17:50:40] 1028 was one out already [17:50:43] its back now though? [17:50:47] checking [17:51:07] should be [17:51:18] ottomata: ah i overlooked journalnodes in that ticket because they're not mentioned in https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hardware :\ [17:51:28] 1019 was there [17:51:31] and 1028 is out [17:51:36] but 1011 should be still there [17:51:43] i guess it didn't lik operating with only 1 journalnode? [17:52:01] PROBLEM - Kafka Broker Server on analytics1018 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [17:52:09] RECOVERY - Hadoop Namenode - Primary on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [17:52:12] (03CR) 10Jcrespo: [C: 032] Repool pc1002 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214093 (owner: 10Jcrespo) [17:52:25] think puppet just restarted it [17:52:32] unless you did jgage? [17:52:33] i was about to [17:52:35] 6operations, 5Continuous-Integration-Isolation: Review Jenkins isolation architecture with Antoine - https://phabricator.wikimedia.org/T92324#1315059 (10akosiaris) [17:52:35] nope [17:52:35] looking ok [17:52:47] might have caused jobs to fail ooof [17:52:51] will be ok though [17:52:52] 1018 now paged [17:52:55] 1001 is OK [17:53:01] PROBLEM - puppet last run on analytics1035 is CRITICAL Puppet has 8 failures [17:53:01] PROBLEM - puppet last run on analytics1036 is CRITICAL puppet fail [17:53:21] 1018 coming back,i think it died because it couldnt' talk to anything [17:53:41] this is good [17:53:45] this was a short downtime [17:53:50] the unplanned one with a much longer one :) [17:53:59] RECOVERY - Kafka Broker Server on analytics1018 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [17:54:00] so, time to figure out everything that went wrong ;) [17:54:21] ottomata, why was 1028 out? [17:54:27] that's the one that exploded [17:54:29] robh: ping? [17:54:33] doh [17:54:36] still waiting on hw [17:54:37] ori: he's out this morning [17:54:41] oh yeah [17:54:43] ah [17:54:49] can no one else invite to this channel, really? [17:54:56] which channel? [17:54:58] wrong channel, ori [17:55:08] so, 1018 is fine, we should have scheduled kafka downtime so it wouldn't page, and just stopped the broker service [17:55:08] oh, nevermind [17:55:12] i just made it not a leader for anything [17:55:13] !log jynus Synchronized wmf-config/db-eqiad.php: repool pc1002 (duration: 00m 13s) [17:55:17] so, it will be fine [17:55:24] Logged the message, Master [17:55:27] that channel [17:55:35] the journalnode stuff...i should have checked to make sure which journalnodes would have been unreachable. [17:55:41] if 1028 was online, this would not have happened [17:55:43] but still, 1011 was. [17:55:52] and i didn' think that namenode would shutdown without 1011 [17:55:54] sorry [17:55:57] with only 1 journalnode [17:56:07] exception is that it is shutting down because journalnodes don't have a quorum [17:56:20] java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. [17:56:50] RECOVERY - Disk space on stat1002 is OK: DISK OK [17:57:31] 6operations, 10ops-eqiad: analytics1036 can't talk cross row? - https://phabricator.wikimedia.org/T99845#1315084 (10faidon) 5Open>3Resolved a:3faidon I rebooted asw-d2 (and only 2) after coordinating with @ottomata — it seemed to have worked. Root cause is still unknown, this could be a hardware fault s... [17:57:39] RECOVERY - Disk space on analytics1027 is OK: DISK OK [17:57:50] RECOVERY - Disk space on analytics1026 is OK: DISK OK [17:57:56] jobs are looking fine too! [17:57:59] PROBLEM - puppet last run on analytics1026 is CRITICAL Puppet has 1 failures [17:58:05] good ol' HA ResourceManager! [17:58:06] lookin' [17:58:29] ottomata: awesome [17:59:28] COOL [17:59:31] 1036 back in the cluster! [17:59:35] paravoid: thanks! [17:59:39] :) [17:59:51] this was a nice one [18:00:04] twentyafterfour, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150527T1800). [18:00:19] ja! only the unexpected namenode shutdown [18:00:21] RECOVERY - Hadoop NodeManager on analytics1036 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:00:23] and some extra pages [18:00:42] the issue I meant :) [18:00:53] the switch thing? [18:00:59] yeah [18:01:01] no idea of the actual problem, right? [18:01:09] PROBLEM - puppet last run on stat1002 is CRITICAL Puppet has 1 failures [18:01:22] not really, no [18:01:30] PROBLEM - puppet last run on analytics1027 is CRITICAL Puppet has 1 failures [18:01:33] weird [18:01:41] dunno what is up with those two, checking [18:01:43] shoudl be unrelated [18:01:50] RECOVERY - puppet last run on analytics1036 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:02:02] (03CR) 10Ori.livneh: [C: 031] ve: switch jsbench/chromium user/groups to system [puppet] - 10https://gerrit.wikimedia.org/r/214029 (owner: 10Faidon Liambotis) [18:02:13] PROBLEM - puppet last run on analytics1001 is CRITICAL Puppet has 1 failures [18:02:45] (03PS2) 10Faidon Liambotis: Add system => true to mysql_wmf::mysqluser [puppet] - 10https://gerrit.wikimedia.org/r/214027 [18:02:55] (03CR) 10Faidon Liambotis: [C: 032] Add system => true to mysql_wmf::mysqluser [puppet] - 10https://gerrit.wikimedia.org/r/214027 (owner: 10Faidon Liambotis) [18:03:10] RECOVERY - puppet last run on analytics1027 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [18:03:12] ottomata: i've updated this to show journalnodes: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hardware#Roles [18:03:36] (03PS2) 10Faidon Liambotis: Remove txstatsd module and role class [puppet] - 10https://gerrit.wikimedia.org/r/214028 [18:04:01] (03CR) 10Faidon Liambotis: [C: 032] Remove txstatsd module and role class [puppet] - 10https://gerrit.wikimedia.org/r/214028 (owner: 10Faidon Liambotis) [18:04:37] (03PS2) 10Faidon Liambotis: ve: switch jsbench/chromium user/groups to system [puppet] - 10https://gerrit.wikimedia.org/r/214029 [18:04:39] RECOVERY - puppet last run on stat1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:04:47] (03CR) 10Faidon Liambotis: [C: 032] ve: switch jsbench/chromium user/groups to system [puppet] - 10https://gerrit.wikimedia.org/r/214029 (owner: 10Faidon Liambotis) [18:05:20] RECOVERY - puppet last run on analytics1035 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [18:05:37] (03PS2) 10Faidon Liambotis: admin: use $LAST_SYSTEM_UID in enforce-users-groups [puppet] - 10https://gerrit.wikimedia.org/r/214030 [18:08:29] RECOVERY - puppet last run on analytics1026 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [18:10:46] ottomata: so, you were saying about kafkacat? [18:11:00] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1021 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [18:11:30] PROBLEM - Kafka Broker Messages In on analytics1018 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 [18:11:38] oh paravoid, i already built a 1.1.0 version from the tag for trusty [18:11:45] so, i think that's fine [18:12:01] but, i was asking about building from master [18:12:06] since it has some newer fancy things [18:12:17] like, output formatting [18:12:24] but, dunno, i don't really need them :) [18:12:29] RECOVERY - puppet last run on analytics1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:12:30] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1012 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [18:12:45] 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1315119 (10BBlack) Issues with phab backend server scaling and reliability aside, I'm still a fan of moving all our service traffic through our standard outer... [18:12:48] !log Uploaded gridengine_6.2u5-4+wmf2 for precise-wikimedia to apt.wikimedia.org [18:12:54] Logged the message, Master [18:12:59] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1022 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [18:15:10] RECOVERY - Kafka Broker Messages In on analytics1018 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 7099.42375539 [18:16:29] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting addition to researchers group on stat1003 - https://phabricator.wikimedia.org/T99798#1315133 (10Dbrant) @dr0ptp4kt approve plz? (as the product owner of the Android app, I need to use eventlogging data to chart performance and user engagement) [18:16:39] PROBLEM - puppet last run on hooft is CRITICAL Puppet has 1 failures [18:18:10] (03PS6) 10Andrew Bogott: Replace many references to virt1000 and labcontrol2001 with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/213543 [18:18:12] (03PS2) 10Andrew Bogott: Remove redundant or defunct ldap servers from the ldap list. [puppet] - 10https://gerrit.wikimedia.org/r/213542 [18:18:58] 6operations: pc100[123] maintenance and upgrade - https://phabricator.wikimedia.org/T100301#1315145 (10jcrespo) pc100 1 & 2 upgraded. 3 is left. I am geting slightly lower QPS in 1. Need to investigate and check more thoroughly the performance. [18:19:49] (03PS2) 10Dzahn: Add my new key now that I'm back in office [puppet] - 10https://gerrit.wikimedia.org/r/210716 (owner: 10MaxSem) [18:20:54] (03CR) 10Dzahn: [C: 032] "confirmed in person at office" [puppet] - 10https://gerrit.wikimedia.org/r/210716 (owner: 10MaxSem) [18:22:50] PROBLEM - NTP on analytics1036 is CRITICAL: NTP CRITICAL: Offset unknown [18:26:08] (03PS16) 10Yuvipanda: ores: Initial module, with web class / role [puppet] - 10https://gerrit.wikimedia.org/r/213354 [18:26:22] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Initial module, with web class / role [puppet] - 10https://gerrit.wikimedia.org/r/213354 (owner: 10Yuvipanda) [18:26:30] (03PS8) 10Yuvipanda: ores: Add simple nginx load balancer [puppet] - 10https://gerrit.wikimedia.org/r/214060 [18:26:38] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Add simple nginx load balancer [puppet] - 10https://gerrit.wikimedia.org/r/214060 (owner: 10Yuvipanda) [18:29:09] 6operations, 10Datasets-General-or-Unknown: snaphot1004 running dumps very slowly, investigate - https://phabricator.wikimedia.org/T98585#1315177 (10Magioladitis) Is this fixed? [18:30:56] 6operations, 10Traffic: No IPv6 addresses on Wikimedia nameservers ns(0-2).wikimedia.org - https://phabricator.wikimedia.org/T81605#1315180 (10BBlack) [18:31:04] 6operations, 10Traffic: Upgrade prod DNS daemons to gdnsd 2.2.0 - https://phabricator.wikimedia.org/T98003#1315182 (10BBlack) [18:31:06] 6operations, 10Traffic: No IPv6 addresses on Wikimedia nameservers ns(0-2).wikimedia.org - https://phabricator.wikimedia.org/T81605#890482 (10BBlack) [18:33:50] PROBLEM - puppet last run on analytics1026 is CRITICAL Puppet has 1 failures [18:34:45] (03PS1) 10BBlack: remove v6 token puppetization from cp1008 for testing [puppet] - 10https://gerrit.wikimedia.org/r/214100 [18:35:15] (03CR) 10BBlack: [C: 032 V: 032] remove v6 token puppetization from cp1008 for testing [puppet] - 10https://gerrit.wikimedia.org/r/214100 (owner: 10BBlack) [18:39:42] (03PS1) 10Andrew Bogott: Labcontrol1001 will be the new virt1000 [puppet] - 10https://gerrit.wikimedia.org/r/214102 [18:40:48] good night! [18:43:41] hm, I just ran a puppet compiler job on a patch that I /know/ has changes, and it says ‘no changes’ [18:44:44] how do you know? :) [18:44:55] (03PS1) 10Alexandros Kosiaris: ganeti: ferm::service for DRBD ports [puppet] - 10https://gerrit.wikimedia.org/r/214104 [18:45:12] andrewbogott: job 772? [18:45:19] yes [18:45:35] it didn't output no changes, it failed [18:45:37] am I misreading the output? I don’t use this tool much. [18:45:46] [ 05/27/2015 18:42:27 ] INFO: Nodes: 0 OK 0 DIFF 1 FAIL [18:46:10] Hm, big green dot means ‘fail’? [18:46:25] I guess, wherever that is [18:46:38] but the point is, you click into the final link in the console output to see whatever really happened [18:46:50] it's failing to even run the prod branch from before your change on that node, because: [18:46:53] Error: Must pass enable to Class[Base::Remote_syslog] at /opt/wmf/software/compare-puppet-catalogs/external/puppet/modules/base/manifests/init.pp:63 on node virt1000.wikimedia.org [18:46:58] hm, the ‘before’ fails too [18:46:59] yes, as you said [18:47:11] which has something to do with hiera+catalog-compiler being broken wrt to each other, it's been that way for a while now and blocking all kinds of testing [18:47:18] crap [18:47:26] is there a ticket? [18:47:49] https://phabricator.wikimedia.org/T96802 [18:48:00] andrewbogott: which change ? [18:48:03] my last update there a few weeks ago talks about the same failure [18:48:20] https://gerrit.wikimedia.org/r/214102 this one I suppose ? [18:49:00] 6operations: puppet-compiler has strange problems with some facts - https://phabricator.wikimedia.org/T96802#1315264 (10BBlack) We're also now seeing: ``` Error: Must pass enable to Class[Base::Remote_syslog] at /opt/wmf/software/compare-puppet-catalogs/external/puppet/modules/base/manifests/init.pp:63 on node... [18:49:25] akosiaris: no, https://gerrit.wikimedia.org/r/#/c/213543/ [18:49:25] 6operations: puppet-compiler has strange problems with some facts and/or hiera - https://phabricator.wikimedia.org/T96802#1315270 (10BBlack) [18:50:33] (03CR) 10Alexandros Kosiaris: [C: 032] ganeti: ferm::service for DRBD ports [puppet] - 10https://gerrit.wikimedia.org/r/214104 (owner: 10Alexandros Kosiaris) [18:50:48] (03PS7) 10Andrew Bogott: Replace many references to virt1000 and labcontrol2001 with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/213543 [18:50:50] (03PS1) 10Andrew Bogott: Use the new service names for labs puppetmasters. [puppet] - 10https://gerrit.wikimedia.org/r/214105 [18:52:01] (03PS2) 10Andrew Bogott: Use the new service names for labs puppetmasters. [puppet] - 10https://gerrit.wikimedia.org/r/214105 [18:52:10] akosiaris: looks like the puppet compiler is just broken, as per bblack’s phab task. [18:52:50] (03CR) 10Andrew Bogott: [C: 032] Remove redundant or defunct ldap servers from the ldap list. [puppet] - 10https://gerrit.wikimedia.org/r/213542 (owner: 10Andrew Bogott) [18:53:21] Unknown function ip_resolve [18:53:27] andrewbogott: yeah, I concur [18:53:45] Hey, where is geoip.inc.vcl? http://git.wikimedia.org/tree/operations%2Fpuppet%2Fvarnish.git/e6b6bae27f5576ebe9a8f11b3afbb425635f8552/templates [18:54:11] awight: even asking that question means you're looking at doing something scary heh [18:54:15] It's rendered here, http://git.wikimedia.org/blob/operations%2Fpuppet%2Fvarnish.git/e6b6bae27f5576ebe9a8f11b3afbb425635f8552/manifests%2Fcommon%2Fvcl.pp#L4 [18:54:18] hehe seriously [18:54:19] but it's in templates/varnish/ [18:54:20] horrorshow [18:54:35] bblack: different repo? [18:54:38] nope [18:54:53] yes [18:54:54] maybe the .erb at the end is tripping you up? [18:55:01] bblack: that the puppet/varnish repo [18:55:04] oh! [18:55:08] ooh I see, thx [18:55:10] we stopped using that repo a long time ago [18:55:17] lol ok [18:55:18] don't look at it at all, it's disabled for commits [18:55:24] we should delete that one [18:55:36] and I am guessing it's on github as well [18:55:48] yeah probably. I went with just disabling commits at the time when I un-submoduled it, because I was acting unilateraly with some malice. [18:55:50] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [18:55:55] (03PS1) 10Ori.livneh: Fixes for ori's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/214106 [18:56:03] https://github.com/wikimedia/operations-puppet-varnish [18:56:16] ok, removing it from both github and gitblit [18:56:19] OK I'm on the right planet now: http://git.wikimedia.org/tree/operations%2Fpuppet.git [18:56:22] but is it still on gerrit ? [18:56:32] yeah it's still in gerrit too, with commits disabled [18:56:53] should I delete it ? [18:57:05] yes! [18:57:05] it says "Deprecated, probably needs to be deleted" [18:57:07] ok [18:57:11] it's a deal [18:57:15] thanks :) [18:58:19] Awesome, found what I was looking for [18:58:39] at some point in near-term refactoring, I'm going to un-submodule the nginx one as well most likely. [18:59:16] (I'd really like to just unsubmodule them all, but some people are attached to their submodules :/) [19:00:22] (03PS3) 10Andrew Bogott: Use the new service names for labs puppetmasters. [puppet] - 10https://gerrit.wikimedia.org/r/214105 [19:00:24] (03PS8) 10Andrew Bogott: Replace many references to virt1000 and labcontrol2001 with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/213543 [19:02:22] akosiaris: where did you see Unknown function ip_resolve? [19:02:32] (I fixed it, just wondering what test that was in) [19:04:03] andrewbogott: the older and never released puppet catalog compiler I 've created 2 years ago [19:04:34] akosiaris, hi, db works great, thanks! On kartotherian1, we have a user "kartotherian" to run all the services. Do you need to do anything to add this user to pgsql as readonly? [19:04:35] um… ok :) Want to run it on my new patch, if it works? [19:05:11] andrewbogott: btw, am not looking at your patch atm - gridengine just blew up, we're looking at that instead [19:05:21] I'll look after [19:05:23] andrewbogott: sure. changeId ? [19:05:28] yeah, that’s clearly the right choice :) [19:05:38] yurik: unfortunately yes I have. [19:05:44] akosiaris: 213543 [19:05:51] but I 'd like it to be users that exist in LDAP [19:05:56] akosiaris, is it quick, or should i create a ticket? [19:06:19] yurik: if it is an existing LDAP user, it's quick [19:06:27] if it's not, it's a ticket [19:06:58] and I am guessing it is not [19:07:05] akosiaris, i don't think we have a user like that - its a local kartotherian1.wmflabs service user [19:07:35] i am not sure what is the best way to run a wmflabs service wrt permissions [19:08:25] bblack: What's the reason to unsubmodule stuff? I was actually hoping to go in the direction of more submodules, in order to get fundraising development (vagrant/labs) and production puppet in sync. [19:08:27] akosiaris, another quick question - it seems we can get SSD budget fairly quickly, how difficult would it be to upgrade? [19:09:14] YuviPanda: mind helping with yurik's question. For running a wmflabs service querying a database specifically. How does it work now with MySQL databases. I 'd rather not deviate from that de facto standard for postgres [19:09:16] awight: if anything, submodules drive things further out of sync [19:09:30] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [19:09:30] yurik: last time, it took around 3 days from start to finish [19:09:39] akosiaris: I can probably tomorrow, right now fighting with a gridengine outage [19:09:42] akosiaris, the upgrade or the db import? [19:09:46] bblack: I'm fine with another solution, what's the suggestion? [19:09:53] yurik: but you are going to spend budget money on a labs machine ? [19:09:54] awight: we've had this debate many times here, and there are definitely two sides to the debate. I think the un-submodule camp is winning hearts and minds the past several months though :) [19:10:00] yurik: all of it together [19:10:05] awight: suggestion for what problem? [19:10:08] akosiaris, gotcha, thx, will think about it [19:10:43] bblack: when we find ways to improve the mw-vagrant puppet modules for fundraising components, we should be porting by hand into the production puppet? [19:10:57] well [19:11:06] sorry, I'm definitely not interested in picking at the scab, I just want to know how we should proceed [19:11:11] FR is kind of a special case, you're always going to have counter-arguments about limit access/security. [19:11:26] I don't hink there's a clear path to proceed, yet. [19:11:33] yah security would be a second, private module or something [19:12:07] (03CR) 10Dzahn: [C: 032] add salt minion recon and auth timing params for performance [puppet] - 10https://gerrit.wikimedia.org/r/214056 (owner: 10ArielGlenn) [19:12:07] every time someone brings up mw-vagrant I'm lost, because I've never used that workflow/tool at all. [19:12:20] which says something about the necessity/utility of it I guess, from some perspective [19:12:34] bblack: it's a WMF specific ruby reimplemantion of vagrant [19:12:36] well it's not a big deal at this point, we're very far from being able to do something like this. I just get a little sad writing puppet modules for mw-vagrant, when Jeff_Green has already written everything for production [19:12:54] bblack: yeah IMO it's nothing but a wrapper around puppet and virtualbox [19:13:00] a nice one... [19:13:21] (03CR) 10Dzahn: "http://docs.saltstack.com/en/latest/ref/configuration/minion.html" [puppet] - 10https://gerrit.wikimedia.org/r/214056 (owner: 10ArielGlenn) [19:13:27] mediawiki-vagrant is a collection of Puppet config and a Vagrant plugin to make setting up developer VMs easier [19:13:35] we face similar issues with beta of course, which we're trying to solve with hieradata defining a split set of realm-specific data and getting the conditionals out of the puppet code. [19:13:38] (03PS1) 1020after4: Remove stale symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214108 [19:13:40] (03PS1) 1020after4: Add 1.26wmf8 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214109 [19:13:42] (03PS1) 1020after4: Wikipedias to 1.26wmf7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214110 [19:13:44] (03PS1) 1020after4: Group0 to 1.26wmf8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214111 [19:13:57] I started poking at a Vagrantfile for ops/puppet -- https://gerrit.wikimedia.org/r/#/c/212294/ [19:14:03] it needs a lot of work [19:14:36] bblack: yeah, scratch my earlier comment about reimplementation of vagrant. That's Labs Vagrant [19:14:58] and bd808 is actually hoping I can help him ditch it and replace it with LXC ;-) [19:15:20] which I am hoping I 'll find time to do next week [19:15:30] cool :) [19:15:37] well I'm all for easy testing of unmerged ops/puppet stuff on a VM of some kind, but.... [19:15:57] (03CR) 1020after4: [C: 032] Remove stale symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214108 (owner: 1020after4) [19:16:03] (03Merged) 10jenkins-bot: Remove stale symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214108 (owner: 1020after4) [19:16:08] (03CR) 1020after4: [C: 032] Add 1.26wmf8 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214109 (owner: 1020after4) [19:16:11] (03Merged) 10jenkins-bot: Add 1.26wmf8 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214109 (owner: 1020after4) [19:16:16] yeah, that specific issue has come up over and over again [19:16:23] I don't get why submodules are a critical part of this, except that they're trying to sidestep the differences in realm data, which the hieradata split should be doing anyways. [19:16:28] services folks had it with the cassandra submodule as well [19:16:51] submodules turned out to be somewhat counterproductive in that case as well [19:17:38] 6operations, 5Patch-For-Review: investigate txstatsd error logs - https://phabricator.wikimedia.org/T91464#1315336 (10chasemp) 5Open>3Invalid txstatsd is dead, statsite has risen I think [19:17:39] part of why I don't use it myself is I know it's not a final testing solution. Sure, I could submodule some generic code for a new service and test that independently of ops/puppet to get the code design right-ish.... but I'm still going to have to merge that onto a beta box or prod box in the real ops/puppet environment to find out how it *really* plays in prod [19:17:57] so I tend to just skip the first step and move on to the second, as it's a testing superset anyways [19:18:22] (03PS1) 10Ottomata: Workaround for ResourceManager WebApp proxy to ApplicationMaster bug [puppet/cdh] - 10https://gerrit.wikimedia.org/r/214114 [19:18:38] !log twentyafterfour Started scap: testwiki to php-1.26wmf8 and rebuild l10n cache [19:18:44] Logged the message, Master [19:18:49] (03CR) 10Ottomata: [C: 032 V: 032] Workaround for ResourceManager WebApp proxy to ApplicationMaster bug [puppet/cdh] - 10https://gerrit.wikimedia.org/r/214114 (owner: 10Ottomata) [19:19:15] (03PS1) 10Ottomata: Update cdh module with fix for RM webapp address bug [puppet] - 10https://gerrit.wikimedia.org/r/214115 [19:19:29] (03CR) 10Ottomata: [C: 032 V: 032] Update cdh module with fix for RM webapp address bug [puppet] - 10https://gerrit.wikimedia.org/r/214115 (owner: 10Ottomata) [19:19:47] Interesting... we're fine writing separate puppet for both environments, and I understand that git submodules suck. [19:20:10] akosiaris: did that compile job finish? [19:20:21] eh, maybe we could be using built-in puppetlabs module stuff rather than submodules? [19:20:22] awight: I don't want anyone writing code twice [19:20:40] awight: I think that's a very FR-specific problem you're talking about with code-sharing between FR and the rest of prod [19:21:04] ah, I thought that's why other people were using submodules [19:21:05] the other projects, the code ends up being used from ops/puppet in practice, and submodules is supposedly just about the dev/testing pipeline [19:21:05] k [19:21:13] aha [19:21:16] (git submodules are great! har har har) [19:21:57] ottomata: I hear Linus hates them and didn't want to implement, which would explain the fail... but attempting to read the mailing list, I'm not sure it's the case. [19:22:00] and I suspect at the heart of why FR doesn't base itself on ops/puppet is security/access diffs from the rest of prod, and differing requirements (e.g. PCI) [19:22:11] hrm [19:22:16] !log twentyafterfour scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="testwiki" --outdir="/tmp/scap_l10n_1863397713" --threads=4 --lang en --quiet' returned non-zero exit status 255 (duration: 03m 38s) [19:22:18] I would defer to Jeff_Green there. [19:22:20] oh the implementation is super hacky [19:22:22] Logged the message, Master [19:22:25] yeah [19:22:30] andrewbogott: you'd wish... I am running it against the entire fleet so it should take about 1,5 hour or so [19:22:32] but in practice they are not so bad, but I am alone in that opinion here [19:22:36] at least on ops [19:22:46] akosiaris: ok then, I’ll be patient [19:22:48] I use all the time, but it stings [19:22:52] Jeff has his reasons, and I'm sure they're valid. I'm not really sure what the right answer there is to both (a) avoid writing code twice or copypasta and (b) avoiding the hell of submodules. [19:22:55] andrewbogott: unless you got a specific set of hosts you want to test against [19:23:25] akosiaris: yes! virt1000.wikimedia.org labnet1001.eqiad.wmnet labvirt1001.eqiad.wmnet [19:23:29] dry > submodule pain [19:23:46] ottomata: actually you are not. It's just they are not very well suited for ops/puppet repo [19:23:53] ok we removed mantle but apparently it's still configured? [19:23:56] but they are useful in other situations [19:24:05] submodule pain is pretty huge, though, from my perspective. It makes doing anything big/refactory in puppet difficult because I can't even see/grep all the code at once. [19:24:06] andrewbogott: ok, restarting it then. Gimme 3 mins [19:24:10] thanks [19:24:17] among the many other points we've debated here endlessly. [19:24:18] bblack: exactly [19:24:23] and history is even worse [19:24:30] ottomata: for FR though, our production and dev are so different that it's not really repetition. IMO both modules need to be merged towards each other. [19:24:48] we very often correlate things base on the ops/puppet or mediawiki-config history logs [19:24:51] so one could argue that refactoring impedance from submodules leads to shittier code in the long run than the DRY loss [19:24:51] heheh, that just means the submodule isn't modular enough, if you have to search/replace in both codebases to refactor, something is funky [19:24:56] * awight tosses some foam on the rising flames [19:25:04] ottomata: but it's not "both", it's like 10 coedbases [19:25:16] you are refactoring 10 codebases at once? [19:25:29] awight: look what you've done! now we are discussing submodules again! [19:25:31] twentyafterfour: I just pinged in -mobile [19:25:35] jesus. lemme get some peanuts. [19:25:36] I'm refactoring ops/puppet all at once, and it has 10 submodules [19:25:37] i think bblack and I must like doing this [19:25:48] twentyafterfour: re mantle [19:26:02] hey! what about using puppetlabs modules? [19:26:06] ottomata: actually many of our modules are not modular enough yet [19:26:12] awight: they suck! [19:26:18] right, a puppet module that deservces its own submodule should be [19:26:19] :( hehe k [19:26:24] I don't know how the managed that, but they did [19:26:29] s/the/they/ [19:26:40] the ops/puppet repo should use teh submodule module as an API, and treat it like that [19:26:50] if we did it really right, we'd version the submodule modules [19:27:06] sounds so nice in theory, but :P [19:27:18] greg-g: it's https://gerrit.wikimedia.org/r/#/c/210816/ [19:27:29] versioning them would be a huge pain, indeed. [19:27:33] well, maybe not huge [19:27:35] just a little more [19:27:36] sorry I thought that already got merged [19:27:44] there ends up being 100 ways to slice abstractions and interfaces, so no matter how well you structure and version that, there will always be a need to do some kind of operation or search that cross-cuts them all [19:27:45] gotcha [19:27:45] (03CR) 1020after4: [C: 032] Remove Mantle from configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210816 (https://phabricator.wikimedia.org/T85890) (owner: 10Florianschmidtwelzow) [19:27:45] maybe versioning + submodule pain added up would be huge :p [19:27:51] (03Merged) 10jenkins-bot: Remove Mantle from configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210816 (https://phabricator.wikimedia.org/T85890) (owner: 10Florianschmidtwelzow) [19:27:53] probably ... [19:27:57] anyway, submodule modules as an API works for me! :D [19:28:06] * awight remembers not to toss bananas into the powderkeg loading factory [19:28:16] awight: i think Jeff_Green is kinda already thinking about this a little bit [19:28:19] or, at least it is on his radar [19:28:30] !log twentyafterfour Started scap: testwiki to php-1.26wmf8 and rebuild l10n cache [19:28:33] as I am kindly nudging him to use the kafkatee puppet submodule [19:28:55] oh this isn't even close to a powderkeg debate. When you see us DDoSing each others' IRC bouncers to suppress the other sides' viewpoint, you know you've really hit a nerve in ops :) [19:28:59] (btw, i have no context for what you and bblack are talking about, i just chimed in here to distact you all, BYYYEEE) [19:29:11] :) [19:29:15] (not really bye) [19:30:50] RECOVERY - Host analytics1028 is UPING OK - Packet loss = 0%, RTA = 1.80 ms [19:31:26] anyways, my meta-view is that there's probably situations in which you could convince me something should be a submodule, sorta. Like say a truly-generic cassandra module that perfectly abstracts Cassandra in the most generic and complete way possible. [19:31:58] In theory there's no cross-cutting into that submodule's code, and it would rarely need updates (maybe for major version bumps of Cassandra itself for feature-parity) [19:32:21] andrewbogott: https://phabricator.wikimedia.org/P693 [19:32:28] labvirt1001 did not compile [19:32:34] but at that point I think you can argue that that module should really live Upstream somewhere, and be pulled in via something like an upstream module repo rather than be part of our codebase [19:32:35] 6operations, 10ops-eqiad: analytics1028, Replace system board, raid card - Disks OK - https://phabricator.wikimedia.org/T99947#1315366 (10Cmjohnson) The board and Raid controller have been replaced but a DIMM is bad. I requested a new one to be sent. memory training failure detected DIMM A1. I moved the DIMM... [19:32:38] the other 2 have some worrying diffs [19:32:47] I'll bring it up again once FR stuff is perfectly modular :) [19:32:50] and if it doesn't exist upstream, we should be creating and maintaining it upstream [19:32:58] akosiaris: thanks. I guess I’ll have another rev shortly [19:33:11] but puppet's infrastructure for maintaining versioned upstream modular abstractions sucks [19:33:19] andrewbogott: btw, some diffs are actually false. it's puppet, ruby 1.8 and non stable hashes [19:33:25] they need a CPAN equivalent [19:33:37] I figured I could ignore the hashes [19:34:00] 6operations, 10Datasets-General-or-Unknown: snaphot1004 running dumps very slowly, investigate - https://phabricator.wikimedia.org/T98585#1315367 (10ArielGlenn) I have rewritten things to work around the issue and have done a full test run. It looks good. I'll try to get some of the tables run over the next... [19:34:44] the submodule boundaries we have in practice today are somewhat arbitrary and imperfect. ongoing operational concerns often force changes in them; they're volatile. [19:35:01] baaah too bad. Yeah, I downloaded a few forge modules for my own projects, but did not try updating. and like you all are saying, you do need to be able to lock to a version. [19:35:11] and that's just not a happy situation. it just feels like a repo with arbitrary boundaries in it that make operating on it suck more. [19:36:52] yeah, module base config vs domain config is annoying [19:37:10] but supporting that with -.d config directories is really nice when it can be done [19:40:50] oh! cmjohnson1, 1028 is back?!! [19:40:57] !log removed operations/puppet/varnish from gerrit, git.wikimedia.org and github. The repo was used as a git submodule but the workflow turned out to be cumbersome approximately a year ago and was no longer updated. Up to a few minutes ago, it only served as a source of confusion. It no longer does. [19:40:59] bblack: ^ [19:41:02] ;-) [19:41:02] ottomata: sort of [19:41:11] there is a bad DIIMM now [19:41:19] akosiaris: thanks :) [19:41:20] it is back in the cluster! [19:41:21] heh [19:41:41] it is..i let it go through post [19:41:58] you have to hit F1 to continue [19:42:11] heh, s'ok it will be fine [19:42:26] yeah..once the new DIMM arrives I will need to power off for a few mins [19:43:10] Is anyone who is A) a deployer and B) conscious able to do an extension backport for rillke? I'm too tired to be of any use but there's a Flickr bug that should get fixed [19:43:37] It's this one https://gerrit.wikimedia.org/r/214112 [19:43:45] so its missing some RAM for now, that's all cmjohnson1? that's fine [19:43:55] that's it [19:43:58] you can power it off and replace whenever you get it [19:44:01] no need to sync with me [19:44:09] okay cool. Hopefully tomorrow [19:44:12] gracefull shutdowns appreciated :) [19:44:21] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting addition to researchers group on stat1003 - https://phabricator.wikimedia.org/T99798#1315391 (10dr0ptp4kt) Approved. [19:44:35] ottomata: always! [19:44:52] :) [19:45:39] PROBLEM - puppet last run on analytics1034 is CRITICAL Puppet has 1 failures [19:50:40] RECOVERY - puppet last run on analytics1034 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:57:18] akosiaris: can you help me understand about ip 208.80.154.19 vs. 208.80.154.18? It looks to me like both are associated with eth0 on virt1000. Do you agree? [19:58:43] .19 is labs-ns0 [19:58:50] (03PS1) 10Alexandros Kosiaris: Introduce etherpad1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/214130 [19:58:59] like the service IP for labs DNS server [19:59:16] andrewbogott: what mutante says. The .19/32 is a service IP [19:59:20] note the /32 subnet mask [19:59:34] and it is indeed assigned on eth0 of virt1000 [19:59:35] mutante: yes, but that’s also virt1000. 1) I’m not clear why it has a separate ip and 2) if it needs it, how do I get one for labcontrol1001 as well? [19:59:57] whoah, sorry about the underlining. Not sure why it did that [20:00:04] gwicke, cscott, arlolra, subbu: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150527T2000). [20:00:12] haha [20:00:17] andrewbogott: to differentiate between the DNS server and the host that may also have other services [20:00:37] andrewbogott: it makes the service movable. [20:00:56] we can start the service in another host, and assign the IP on that host [20:01:06] removing it of course from virt1000 beforehand [20:01:23] heartbeat for example used to use that model [20:01:49] you would get one for labcontrol1001 by finding an unused one in that public range and adding it to the zone templates [20:02:08] the is a VIP that is considered a shared resource and moved around hosts according to what heartbeat would detect [20:02:46] sigh, my typing is really bad at this hour. I think I am gonna go to sleep [20:02:53] mutante: https://gerrit.wikimedia.org/r/214130 [20:02:54] sorry, just one more minute :) [20:02:58] then you can use the puppet "interface::add_ip6_mapped" to also get an IPv6 address [20:03:06] I am going to try to move all the virt1000 services to labnet1001 soon. [20:03:08] mutante: as soon as this done, planet is next [20:03:19] When I do that, should I get a new service ip for labnet1001 and change the dns entry? [20:03:29] Or should I direct the existing IP to point to labcontrol1001? [20:03:38] akosiaris: ah :)) [20:03:39] (Sorry, when I typed ‘labnet1001’ earlier I meant ‘labcontrol1001’) [20:03:51] the former. The latter needs some downtime [20:03:59] you could do it if you are really fast [20:04:29] but it's best that we avoided killing DNS for all of labs [20:06:26] ok. So, how do I do the former? Since apparently that doesn’t live in any repo that I’ve ever looked at. [20:07:38] andrewbogott: if it doesn't, that bad. cause we got the code to do that [20:08:04] andrewbogott: DNS repo is at git clone https://gerrit.wikimedia.org/r/p/operations/dns.git [20:08:37] yes, ok, I know about the dns repo, but... [20:08:40] * andrewbogott reads again [20:08:56] andrewbogott: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/manifests/site.pp;dcb73a70b253ec91a9047065d6615e209450e56a$347 [20:09:05] that should do it at the host level [20:09:17] there is only one line in dns “labs-ns0 1H IN A 208.80.154.19” [20:09:40] andrewbogott: look for url-downloader as well in there [20:09:42] It associates the name and the IP but: how is that attached to a host? [20:09:54] Oh, ok, so there’s some puppet magic that refers to the hostname [20:09:56] I will try. [20:09:56] andrewbogott: ^ look at the diffusion link [20:10:12] no hostname btw [20:10:36] only ip. The url-dowloader string over there could be mangiafazoula for all you care [20:10:56] not that that would help anyone when reading it [20:11:13] (03CR) 10Mjbmr: [C: 031] Enable Extension:NewUserMessage on ta.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213841 (https://phabricator.wikimedia.org/T100431) (owner: 10Shanmugamp7) [20:12:10] (03CR) 10Hashar: "This patch mix too many changes making it impossible to review. If you want to fix style, please reindent first then propose several chan" [dumps] - 10https://gerrit.wikimedia.org/r/207504 (owner: 10Dereckson) [20:12:30] the matching of host to DNS name, when not a service IP, is done in the DHCP config, it has the MAC address of an interface and the "fixed-address" is the DNS name [20:16:00] mutante: why did we ever organize DHCP configurations according to Com 0 Baudrate ? [20:16:55] mutante: disregard, it took me a while to figure it out [20:17:12] (03CR) 10Hashar: [C: 04-1] "You might want to fix other docstrings by running the docstrings flake8 plugin: https://pypi.python.org/pypi/flake8-docstrings" [dumps] - 10https://gerrit.wikimedia.org/r/207712 (owner: 10Dereckson) [20:17:33] akosiaris: different hardware has different serial port settings for the console redirect on mgmt [20:18:52] mutante: yeah I know. I just would not see the option pxelinux.configfile "pxelinux.cfg/ttyS0-115200"; line in dhcpd.conf [20:19:03] it would not register in my mind [20:19:41] ah right [20:19:41] mutante: I am unsure where to put the VM tbh... [20:19:54] akosiaris: the etherpad vm? [20:19:56] yes [20:20:05] which file would you choose ? [20:20:20] oh,hmm [20:20:34] (03PS1) 10Ori.livneh: add varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214147 [20:20:53] mutante: oh, serial_console, serial_speed on ganeti [20:20:55] ^ bblack akosiaris ottomata paravoid :P [20:21:03] that might work [20:21:20] akosiaris: S1-115200 because almost everything is in there. (5854 lines vs. 120 and 34) or an entirely new one [20:22:03] ori: I see a rewrite of the lib took place ;-) [20:22:21] ori no diamond? [20:23:22] mutante: serial_console: True [20:23:22] serial_speed: 38400 [20:23:28] yeah, let's change that... [20:24:14] akosiaris: yea, up to 115200 , right [20:25:33] ori: thought you were going to wrap the varnishapi stuff and we'd do this in diamond? [20:26:29] kinda nicer in diamond, because then you don't need to have a special service for it. but, if we do want to do it this way...why not use statsd python module instead of socket? [20:28:37] so deployment-jobrunner01 at 100% storage use on / [20:29:06] 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1315549 (10mmodell) @BBlack: websockets can't use a separate hostname, the browser's cross-domain policy explicitly forbids that. It does include a special hea... [20:29:12] ori, can you deploy https://gerrit.wikimedia.org/r/#/c/213010/ ? [20:34:01] greg-g: Do i understand the new deployment schedule right, that a new branch (say e.g. wmf9 from RL 1.26) get's deployed to all wmf wikis in one week (Tue, Wed, Thu, all wmf9)? [20:36:01] (03PS1) 10Mjbmr: Enable SandboxLink for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214247 (https://phabricator.wikimedia.org/T100513) [20:36:23] !log twentyafterfour Finished scap: testwiki to php-1.26wmf8 and rebuild l10n cache (duration: 67m 53s) [20:36:30] Logged the message, Master [20:37:57] FlorianSW: a new branch goes to group0 on Wed, group1 following Tues, everywhere following Wed [20:38:21] 67m for a scap is :(( [20:38:32] bd808: check your mail ;) [20:38:41] bd808: https://lists.wikimedia.org/pipermail/wikitech-l/2015-May/081863.html :D [20:38:46] greg-g: great, thanks : [20:38:48] * :) [20:38:53] 67 minute!? [20:39:03] (03PS1) 10Alexandros Kosiaris: Introduce etherpad1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/214250 [20:39:05] well, all but 68 [20:39:16] (03PS1) 10Yuvipanda: ores: Make uwsgi listen on http [puppet] - 10https://gerrit.wikimedia.org/r/214251 [20:39:21] it's syncing a lot of servers now [20:39:32] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Make uwsgi listen on http [puppet] - 10https://gerrit.wikimedia.org/r/214251 (owner: 10Yuvipanda) [20:39:34] oh, cross dc? [20:39:36] makes sense [20:39:38] 478 [20:39:51] but 80 in parallel [20:39:51] 6operations: Encrypted password storage - https://phabricator.wikimedia.org/T96130#1315575 (10Dzahn) Tested pwstore (pwstore_0.0+git20150521-1_i386.deb ) built by Moritz and the process how to edit a test file. Works so far :) [20:40:14] and with mirrors in each row I think [20:40:21] (03CR) 1020after4: [C: 032] Wikipedias to 1.26wmf7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214110 (owner: 1020after4) [20:40:27] (03Merged) 10jenkins-bot: Wikipedias to 1.26wmf7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214110 (owner: 1020after4) [20:40:39] (03PS4) 10Andrew Bogott: Use the new service names for labs puppetmasters. [puppet] - 10https://gerrit.wikimedia.org/r/214105 [20:40:41] (03PS9) 10Andrew Bogott: Replace many references to virt1000 and labcontrol2001 with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/213543 [20:40:42] greg-g: neat! I'm not really here or I would have seen that already ;) [20:40:45] still only 12 proxy hosts [20:41:32] akosiaris: if you are still up, can you re-run the same compiler job with the latest patchset? change id 213543, hosts virt1000, labnet1001, labvirt1001 [20:41:38] 1 is disabled it seems [20:41:55] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.26wmf7 [20:41:59] I guess it's 1 per row in codfw too? [20:42:00] Logged the message, Master [20:42:10] uh, rack not row [20:42:27] (03CR) 1020after4: [C: 032] Group0 to 1.26wmf8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214111 (owner: 1020after4) [20:42:33] (03Merged) 10jenkins-bot: Group0 to 1.26wmf8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214111 (owner: 1020after4) [20:45:41] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.26wmf8 [20:45:50] Logged the message, Master [20:46:56] !log twentyafterfour Purged l10n cache for 1.26wmf6 [20:47:03] Logged the message, Master [20:47:36] apparently each host is taking somewhere between 3 minutes and 8 minutes to sync [20:48:12] each branch is pretty big [20:48:38] (03CR) 10Mjbmr: [C: 031] CX: Add wikis for CX deployment on 20150528 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213992 (https://phabricator.wikimedia.org/T99535) (owner: 10KartikMistry) [20:48:58] twentyafterfour@tin:/srv/mediawiki-staging/php-1.26wmf8$ du -hs [20:49:01] 4.0G . [20:49:06] * AaronSchulz deletes 4g of logs but still sees 100% on / [20:49:17] * AaronSchulz scratches head [20:49:29] buffered? [20:49:29] twentyafterfour: that should mostly be the l10n cache [20:49:32] AaronSchulz: something still has the log files open? [20:49:39] apt-get clean and shit might save some [20:50:33] bd808: 2.5g of cache [20:50:34] 10Ops-Access-Requests, 6operations, 10Browser-Tests: Please add Elena Tonkovidova labs account to LDAP group wmf - https://phabricator.wikimedia.org/T100560#1315641 (10hashar) [20:50:42] twentyafterfour, yep, restarting the runner made it go to 76 [20:51:10] 6operations, 10Analytics-Cluster, 3Fundraising Sprint Kraftwerk, 3Fundraising Sprint L: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1315643 (10atgo) [20:51:14] 6operations, 10Analytics-Cluster, 3Fundraising Sprint Kraftwerk: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1315644 (10AndyRussG) [20:51:28] 2.5G is "cache", yea [20:51:55] 6operations, 10Analytics-Cluster, 3Fundraising Sprint Kraftwerk, 3Fundraising Sprint L: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1249877 (10AndyRussG) [20:52:13] mutante: s/cache/pre-built cdb files for really really fast hash lookup/ [20:52:16] uhm: 319 Invalid argument: function: not a valid callback array in /srv/mediawiki/php-1.26wmf8/includes/registration/ExtensionRegistry.php on line 172 [20:52:45] know what extension it is? [20:52:48] (03CR) 10Mjbmr: [C: 031] Enable NewUserMessage on sa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212724 (https://phabricator.wikimedia.org/T99879) (owner: 10Dereckson) [20:53:04] no [20:53:12] not sure how to tell [20:53:19] live hack probably :( [20:53:45] we need to patch hhvm to give stack traces on fatals :/ [20:53:46] twentyafterfour: possibly confirmedit/fancycaptcha.. let me find the bug [20:54:09] twentyafterfour: argh, no bug, just a change: https://gerrit.wikimedia.org/r/#/c/214086/ [20:55:22] (03CR) 10Mjbmr: [C: 031] Modify AbuseFilter block configuration on eswikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206510 (https://phabricator.wikimedia.org/T96669) (owner: 10Glaisher) [20:56:05] FlorianSW: cherry-picked to wmf/1.26wmf8 [20:56:20] twentyafterfour: see it :) [20:58:34] twentyafterfour: sorry for this interruption :( [21:04:08] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 33.33% of data above the critical threshold [500.0] [21:09:12] FlorianSW: it's ok :) [21:09:24] :) [21:09:39] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 1 below the confidence bounds [21:09:59] !log twentyafterfour Synchronized php-1.26wmf8: Fix ConfirmEdit fatal Change-Id: I22353669a85391c3d9760a5253cac1263e895cf9 (duration: 01m 08s) [21:10:05] Logged the message, Master [21:10:05] mw2187 is getting an unfair share of rsync requests in scap -- https://phabricator.wikimedia.org/P93#3264 [21:11:50] what's going on? [21:12:30] ok I see backscroll [21:12:44] bblack: Nothing horrible; just scap taking longer than I would like to see [21:13:06] there was a decent reqerr spike [21:13:16] but not horrible in the grand scheme of things [21:13:44] error rate? that would be the fatal which should now be fixed [21:13:54] ok [21:14:02] 6operations, 6Labs: Investigate why nscd is used in labs - https://phabricator.wikimedia.org/T100564#1315712 (10yuvipanda) 3NEW [21:14:18] FlorianSW, would that be another breakage in the 1.25 ConfirmEdit release? [21:14:28] I'm still seeing some junk in fatalmonitor but that's a separate issue which I reported in phabricator [21:14:48] https://phabricator.wikimedia.org/T100558 [21:14:48] PROBLEM - Translation cache space on mw1017 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [21:14:58] Krenair: just if you use FancyCaptcha, and it's not a fatal, it's a warning. If there are fatal errors, they doesn't come from ConfirmEdit [21:15:04] twentyafterfour: ^ [21:15:56] andrewbogott: https://phabricator.wikimedia.org/P694 [21:16:05] Krenair: and if you use ReCaptcha you get https://gerrit.wikimedia.org/r/#/c/214046/ / https://phabricator.wikimedia.org/T100505 [21:16:10] seems like labvirt1001 is a noop [21:16:11] 10Ops-Access-Requests, 6operations, 10Browser-Tests: Please add Elena Tonkovidova labs account to LDAP group wmf - https://phabricator.wikimedia.org/T100560#1315727 (10Krenair) Why is L3 signature required to be in the wmf group? [21:16:45] 10Ops-Access-Requests, 6operations, 10Browser-Tests: Please add Rummana Yasmeen labs account to LDAP group wmf - https://phabricator.wikimedia.org/T100559#1315729 (10Dzahn) I think just adding a user to the WMF LDAP group is not considered "server access", so unless we create an actual shell account, signing... [21:17:29] 6operations, 6Labs: Investigate why nscd is used in labs - https://phabricator.wikimedia.org/T100564#1315736 (10yuvipanda) Related to T100554 [21:18:13] bd808, https://gerrit.wikimedia.org/r/#/c/213010/ [21:18:58] Krenair: ah, and second one for recaptcha: https://gerrit.wikimedia.org/r/#/c/214045/ / https://phabricator.wikimedia.org/T100504 [21:19:14] 10Ops-Access-Requests, 6operations, 10Browser-Tests: Please add Rummana Yasmeen labs account to LDAP group wmf - https://phabricator.wikimedia.org/T100559#1315739 (10Dzahn) done. since i could confirm that user ryasmee already existed and uses an @wikimedia.org email address that is fine. [terbium:~] $ ldap... [21:19:29] (03CR) 10BryanDavis: [C: 031] Fixed totally broken runner JSON response code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213010 (owner: 10Aaron Schulz) [21:19:41] 10Ops-Access-Requests, 6operations, 6Release-Engineering, 10Wikimedia-Git-or-Gerrit: Add all Release-Engineering team as Gerrit admins - https://phabricator.wikimedia.org/T100565#1315740 (10hashar) 3NEW [21:19:59] 10Ops-Access-Requests, 6operations, 10Browser-Tests: Please add Rummana Yasmeen labs account to LDAP group wmf - https://phabricator.wikimedia.org/T100559#1315750 (10Dzahn) 5Open>3Resolved [21:20:20] (03CR) 10BryanDavis: Remove FormatJson from mediawiki-config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208711 (https://phabricator.wikimedia.org/T98051) (owner: 10Ori.livneh) [21:22:09] (03PS1) 10Hashar: Add all Release-Engineering team as Gerrit admins [puppet] - 10https://gerrit.wikimedia.org/r/214255 (https://phabricator.wikimedia.org/T100565) [21:22:33] 10Ops-Access-Requests, 6operations, 10Browser-Tests: Please add Elena Tonkovidova labs account to LDAP group wmf - https://phabricator.wikimedia.org/T100560#1315774 (10Dzahn) done. the user is already registered with an @wikimedia.org which is sufficient to show user is an empoylee and should be added to th... [21:22:59] 10Ops-Access-Requests, 6operations, 10Browser-Tests: Please add Elena Tonkovidova labs account to LDAP group wmf - https://phabricator.wikimedia.org/T100560#1315776 (10Dzahn) 5Open>3Resolved [21:26:23] Request: GET http://cs.wikiversity.org/wiki/Wikiverzita:Diskuse_o_smaz%C3%A1n%C3%AD, from 10.20.0.166 via cp3040 cp3040 ([10.20.0.175]:3128), Varnish XID 956098580 [21:26:26] Forwarded for: 88.101.17.127, 10.20.0.166, 10.20.0.166 [21:26:29] Error: 503, Service Unavailable at Wed, 27 May 2015 21:26:03 GMT [21:29:49] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 21.43% of data above the critical threshold [500.0] [21:30:19] PROBLEM - Translation cache space on mw1250 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [21:30:59] PROBLEM - Translation cache space on mw1244 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [21:31:10] PROBLEM - Translation cache space on mw1254 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [21:31:19] PROBLEM - Translation cache space on mw1238 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [21:31:37] https://gdash.wikimedia.org/dashboards/reqerror/ [21:32:05] i was just looking at that, and wishing again that i could change the time scale [21:32:32] possibly it's the TC stuff [21:32:39] someone reported an error in -tech. http 504 [21:32:50] 504 is unusual [21:32:54] 503* sorry [21:33:01] yeah we have one here too [21:33:02] typo [21:33:39] PROBLEM - Translation cache space on mw1243 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [21:33:41] I'm gonna kick off a rolling restart of hhvm then [21:33:49] PROBLEM - Translation cache space on mw1256 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [21:33:58] bblack: ok [21:34:11] does anyone know from previous stuff what a good time interval is for that to not make matters worse? [21:34:21] I think I did 90s before and nothing got hurt, but it would be nice to go faster [21:35:16] same 503 for for de.wp's watchlist page, also mentioning cp3040 [21:35:41] cp3040 is just a very-likely source, I don't think the problem is at the cache layer; there's been no changes there [21:36:03] whereas we have a deployment just above, and if nothing wrong with the deploy itself, perhaps TC issues from hhvm not liking deploys in general. [21:36:20] ^this seems so much more likely to me considering [21:36:29] we know hhvm does not recover on its own [21:36:33] (03CR) 10Andrew Bogott: [C: 032] Labcontrol1001 will be the new virt1000 [puppet] - 10https://gerrit.wikimedia.org/r/214102 (owner: 10Andrew Bogott) [21:37:49] salt sucking doesn't help either [21:38:57] is there a way to see post deploy the deploy health state of a host? [21:39:02] like they said deploy was taking long [21:39:19] is there something I can run to see on a particular mw* host that everything is good? [21:39:58] RECOVERY - Translation cache space on mw1238 is OK: HHVM_TC_SPACE OK TC sizes are OK [21:40:13] so that's me doing a manual restart ^ and hhvm coming back [21:42:06] !log restarting hhvm everywhere on 30s intervals between hosts [21:42:11] Logged the message, Master [21:42:24] ^ which will take a few hours, I think we have to just manually restart the alerting ones in the interim [21:42:30] I really don't know if it's faster to restart them all quicker [21:42:36] err, s/faster/safe/ [21:43:20] the last time we saw this, the pattern started out like this with a handful, and then spread to many within an hour or so, many of which SIGABRT -> fix themselves, some of which didn't [21:43:36] (03CR) 10QChris: [C: 031] Add all Release-Engineering team as Gerrit admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/214255 (https://phabricator.wikimedia.org/T100565) (owner: 10Hashar) [21:43:43] I see quite a few HHVM_TC_SPACE WARNING code.main: 90% in icinga [21:43:47] going ot hit a batch here directly [21:43:51] same here [21:43:54] mw1017 now [21:44:49] so the loop will get them all eventually, but, yeah, I guess focus on CRITs from icinga? [21:45:17] we should kill the init script too? [21:45:26] the loop is going linearly from mw1000 -> mw1258, it's on like #7 now [21:45:29] RECOVERY - Translation cache space on mw1017 is OK: HHVM_TC_SPACE OK TC sizes are OK [21:45:29] since we are not supposed to use it and it causes exceptions [21:45:31] such as [21:45:33] Uncaught exception: HHVM no longer supports the built-in webserver as of 3.0.0. Please use your own webserver (nginx or apache) talking to HHVM over fastcgi. https://github.com/facebook/hhvm/wiki/FastCGI\n [21:45:38] mutante: just "service hhvm restart" [21:45:59] bblack: on the reqerror graphs when i select the option to show code deploys, sync-file and sync-dir are the same color (afaict). which one is the dotted line and which is solid? [21:46:02] bblack: yea, i did that and it made it recover, just saying we should probably removed that anyways [21:46:13] jgage: no idea [21:46:48] PROBLEM - DPKG on labcontrol1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:47:19] PROBLEM - Translation cache space on mw1246 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [21:47:19] RECOVERY - Translation cache space on mw1243 is OK: HHVM_TC_SPACE OK TC sizes are OK [21:47:21] !log restarted hhvm on mw1017,mw1243,mw1244 [21:47:27] Logged the message, Master [21:47:43] the looped restarts should theoretically finish at about 23:00 UTC [21:47:59] RECOVERY - Translation cache space on mw1244 is OK: HHVM_TC_SPACE OK TC sizes are OK [21:48:18] RECOVERY - Translation cache space on mw1254 is OK: HHVM_TC_SPACE OK TC sizes are OK [21:48:26] not all mw* even have hhvm, but I didn't think of a clever way to loop that into the delays either :/ [21:48:31] or to know which are really affected [21:49:09] RECOVERY - Translation cache space on mw1250 is OK: HHVM_TC_SPACE OK TC sizes are OK [21:49:18] RECOVERY - Translation cache space on mw1256 is OK: HHVM_TC_SPACE OK TC sizes are OK [21:49:22] !log restarted hhvm on mw1250,mw1254,mw1256 [21:49:27] Logged the message, Master [21:49:31] i think that's all that was on icinga right now [21:50:05] lots more in warning [21:50:06] there's about 180 cases of: HHVM_TC_SPACE WARNING code.main: 90% [21:50:10] ah, mw1200 has a different one: "HHVM rendering" connection refused [21:50:10] RECOVERY - DPKG on labcontrol1001 is OK: All packages OK [21:50:16] but hopefully/maybe those do ok until the loop reaches them? [21:50:35] we need to turn off that alert [21:50:54] I wish we could just hit them faster, but I just don't know what's sane there. If I did them serially with ~1-2s from ssh delays it might cause all kinds of problems, I think [21:51:06] ori: why? [21:51:13] ori: the rendering one? [21:51:20] the translation cache space one [21:51:27] they're not actionable [21:51:29] PROBLEM - Translation cache space on mw1252 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [21:51:34] ori: why not? [21:51:43] because when hhvm runs out of space, it restarts [21:51:57] logging in to each machine and restarting hhvm doesn't improve matters [21:52:28] you don't think high TC consumption but not-yet-restarted contributes to errors? [21:53:05] last time this happened (was it last week?) many of them did self-restart (via SIGABRT -> upstart restarting) [21:53:06] no [21:53:47] but in the face of that pending across a large amount of them, I still think we're better off with a slow rolling restart to reset the clocks than letting them potentially SIGABRT in large/close batches [21:54:09] PROBLEM - puppet last run on labcontrol1001 is CRITICAL Puppet has 3 failures [21:54:37] i don't agree, i think it's a lot of manual work and chaos on the channel with alerts for no palpable gain [21:55:22] if they hit their limits and restarted at unrelated times, maybe [21:55:50] I don't agree it's operationally acceptable to just let them all die in clumps when an scap hits [21:55:50] well, i had a thing to restart them at random times throughout the day [21:55:59] i never said it was operationally acceptable [21:56:04] there's a bug and i'm working on it [21:56:20] i just don't think the alerts help [21:56:42] so is that responsable for a 5xx spike? because it hasn't completely gone away yet still I think [21:56:55] well in this case, the alerts are at least telling us it's pending, which has caused me to start a slow rolling restart, which might prevent clumped restarts at some point in the near future that might otherwise happen. [21:57:01] chasemp: I don't know [21:57:17] that's !log restarting hhvm everywhere on 30s intervals between hosts i suspect [21:59:00] if it is, it's still better than a clumped restart pending soon? [21:59:06] but really, I don't think that's the source of it [21:59:18] I can stop it for a bit and test that theory if you'd like [22:00:14] sure, yeah [22:00:22] paused it (it just finished mw1031) [22:00:32] 6operations, 6Labs: Investigate why nscd is used in labs - https://phabricator.wikimedia.org/T100564#1315872 (10scfc) It was introduced with de059228933681b6b0f97a818a561cde22901e1e ("Initial commit of public puppet repo."). In the past IIRC we have increased the caching TTLs several times to reduce network l... [22:09:00] anyways, I think 5xx is still ongoing regardless of TC spike or TC-related restart [22:09:12] I can see backend health issues at the varnish level too, but I'm not sure if they're indirect... [22:10:02] let's check fatals [22:10:40] cp1052, cp1052, cp1065 - all of these are showing varnish->varnish backend health failures ... [22:10:50] whooooooweeee [22:10:59] even cp1065->cp1065 [22:11:01] lots of junk in fatal log [22:11:01] who deployed? [22:11:03] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1315908 (10JAufrecht) Joel, Kevin, Chris, and Chase met to resolve discussion: **Identify people who need to be involved in this decision** Do we have everyone necessary to approve... [22:11:09] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1315912 (10JAufrecht) a:3JAufrecht [22:11:09] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [22:11:49] ACKNOWLEDGEMENT - puppet last run on labcontrol1001 is CRITICAL Puppet has 2 failures andrew bogott precise vs. trusty differences. Im working on it. [22:12:14] ori: train deploy [22:12:18] icinga error = Duplicate definition found for host 'labs-ns0.wikimedia.org' [22:12:22] andrewbogott: [22:12:26] * Reedy goes to have a look [22:12:38] mutante: yep that’s me. [22:12:44] I’ll look [22:13:20] the health issues are because varnishd backends are legitimately taking too long to answer simple /check health requests even over localhost [22:13:22] andrewbogott: did it get new puppet keys? [22:13:28] but the question is what kind of overload is causing that... [22:13:42] mutante: I don’t understand your quetion. Did what get new puppet keys? [22:13:47] since it's specific to a few text backends, it's probably specific to certain request volumes that hash to those... [22:13:47] andrewbogott: labs-ns0 [22:14:00] that’s just a service ip on virt1000 [22:14:09] andrewbogott: icinga thinks it's a host though [22:14:14] which is probably getting applied to labcontrol1001 as well [22:14:18] I don’t know why icinga thinks taht [22:15:07] bblack: which backends? [22:15:37] cp1052, cp1053, cp1065, that I've seen [22:16:02] as in, eqiad frontends or ulsfo|esams backends hitting those, are showing a heavy rate of healthcheck failures just to those 3x varnish-be's in eqiad [22:16:17] cp1065 fe->be over localhost shows the same, it's not network [22:16:21] seems to be subsiding? https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1040&height=520&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color(cactiStyle(alias(reqstats.500,%22500%20resp/min%22)),%22red%22)&target=color(cactiStyle(alias(reqstats.5xx,%225xx%20resp/min%22)),%22blue%22) [22:16:30] andrewbogott: yes, odd, puppetstoredconfigclean says "Can't find host labs-ns0.wikimedia.org. [22:16:35] and since it's only 3/N, it's likely related to certain URLs hashing to those 3.... [22:16:39] andrewbogott: but at the same time: puppet_hosts.cfg: host_name labs-ns0.wikimedia.org [22:16:42] "The MariaDB server is running with the --read-only option so it cannot execute this statement (10.64.16.156)" [22:17:02] ori: parser cache [22:17:14] andrewbogott: let's see if puppet adds it back when i remove it? [22:17:16] ori: any hints of something that would cache-bust varnish? [22:17:28] I'm starting to see some stats like this could just be massive cache-invalidation/bust [22:17:29] mutante: sure, worth a try [22:18:21] just after 21:00 UTC (train time?) is when stats go crazy on those varnish backends [22:19:03] free space dies, allocator failures surge, seems like backends start closing connections on us earlier than they should, etc [22:19:43] the 5xx spike more or less coincides with you starting the rolling restart and then pausing it [22:19:57] it's tapering off now [22:20:16] (03CR) 10Mobrovac: [C: 04-1] CX: Log to logstash (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) (owner: 10KartikMistry) [22:20:27] (03PS1) 10Andrew Bogott: Don't include include role::dns::ldap on labcontrol1001 yet. [puppet] - 10https://gerrit.wikimedia.org/r/214260 [22:20:28] ori: you mean the one visible in https://gdash.wikimedia.org/dashboards/reqerror/ ? [22:20:32] yeah [22:20:33] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1315926 (10MZMcBride) >>! In T85141#1314662, @Dzahn wrote: >>>! In T85141#1314263, @MZMcBride wrote: >> I think static-bugzilla and old-bugzilla are sufficient. > > The plan is to re... [22:20:38] you might argue the peaks do, but not the background level which continues [22:20:46] and I still have varnish health problems that are not tapering off at all [22:20:54] andrewbogott: yes, it adds it back [22:21:03] (03CR) 10Andrew Bogott: [C: 032] Don't include include role::dns::ldap on labcontrol1001 yet. [puppet] - 10https://gerrit.wikimedia.org/r/214260 (owner: 10Andrew Bogott) [22:21:08] oh yeah, hmm [22:21:20] mutante: I don’t know why icinga is doing that, but ^^ should fix it for the moment. [22:21:30] andrewbogott: ok! [22:21:39] (also the first few peaks were before the rolling restart) [22:22:15] you're right, my bad [22:22:24] I'm starting to think this is all about one special URL or URL-pattern [22:22:28] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1315927 (10Dzahn) Well the sanitized dump is basically done now, so we can release it soon anyways. [22:22:48] and the reason 3 backends are involved is that various varnishes are failing that request off down the chash as they mark the first option unhealthy for a while, etc [22:23:19] and that request is causing the applayer backends to abort on us in some awful way that kills varnish perf... [22:23:27] (closing persistent connections) [22:23:52] mutante: try now? [22:23:59] http://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&c=Text+caches+eqiad&h=cp1065.eqiad.wmnet&jr=&js=&v=71983&m=varnish.backend_toolate&vl=N%2Fs&ti=Backend+conn.+was+closed [22:24:13] andrewbogott: ok, running puppet [22:24:24] ^ cp1065 backend connection-close-rate spiking post-21:00, in spite of a dropping request rate because frontends aren't choosing it as often [22:24:35] mutante: I’m not sure if removing from puppet is sufficient or if I need to manually clean up something [22:25:18] andrewbogott: the manually cleaning from icinga would usually be this: [22:25:28] [palladium:~] $ sudo puppetstoredconfigclean.rb labs-ns0.wikimedia.org [22:25:32] but it can't find it [22:25:43] try cleaning labcontrol1001 [22:26:07] Killing labcontrol1001.wikimedia.org...done. [22:26:38] "uri_query": "?action=query&format=json&meta=filerepoinfo&smaxage=86400&maxage=86400", [22:26:38] waits for puppet run on neon [22:26:46] ori: what's that ^ seems common in 503s [22:27:38] seems like real browsers, e.g.: [22:27:40] "http_method": "GET", [22:27:40] "uri_host": "en.wikipedia.org", [22:27:40] "uri_path": "/w/api.php", [22:27:40] "uri_query": "?action=query&format=json&meta=filerepoinfo&smaxage=86400&maxage=86400", [22:27:43] "content_type": "text/html; charset=utf-8", [22:27:46] "referer": "http://en.wikipedia.org/wiki/New_Jersey", [22:28:00] mediaviewer, iirc [22:28:12] memcached traffic spiked: http://ganglia.wikimedia.org/latest/graph.php?c=Memcached%20eqiad&m=cpu_report&r=4hr&s=by%20name&hc=4&mc=2&st=1432765649&g=network_report&z=medium [22:28:54] that URL-pattern accounts for the vast majority of recent 503 [22:28:59] filerepoinfo thing [22:29:34] why are we the only ones looking at this? [22:29:45] twentyafterfour: hey, backup? [22:30:01] (03PS1) 10Ottomata: Add varnishlog python module [puppet] - 10https://gerrit.wikimedia.org/r/214261 [22:32:06] (03PS2) 10Ottomata: Add varnishlog python module [puppet] - 10https://gerrit.wikimedia.org/r/214261 [22:32:16] funny it's cached for me when I hit that directly on en.wp [22:32:42] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [22:33:02] PROBLEM - puppetmaster https on labcontrol1001 is CRITICAL: Connection refused [22:33:35] (03CR) 10Ori.livneh: "Nice!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/214261 (owner: 10Ottomata) [22:34:06] bblack: i suggest we roll back wmf8 for now [22:34:11] greg-g: hey [22:34:23] ori: +1, I'm still chasing all kinds of random leads with no end in sight :/ [22:34:30] 6operations, 10Wikimedia-Site-requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1315995 (10greg) [22:34:57] ori: wasn't sure if that abstracted callback signature for python use includes enough fro you [22:35:06] i only need tag and value, but you need more for your thing [22:35:06] the whole content of that filerepoinfo request is: [22:35:08] {"query":{"repos":[{"name":"shared","displayname":"Wikimedia Commons","rootUrl":"//upload.wikimedia.org/wikipedia/commons","url":"//upload.wikimedia.org/wikipedia/commons","thumbUrl":"//upload.wikimedia.org/wikipedia/commons/thumb","initialCapital":"","descBaseUrl":"//commons.wikimedia.org/wiki/File:","scriptDirUrl":"//commons.wikimedia.org/w","fetchDescription":"","favicon":"/static/favicon/c [22:35:09] ori: hi [22:35:11] so i included what you seemed to use in yours [22:35:14] ommons.ico"},{"name":"local","displayname":"Wikipedia","rootUrl":"//upload.wikimedia.org/wikipedia/en","local":"","url":"//upload.wikimedia.org/wikipedia/en","thumbUrl":"//upload.wikimedia.org/wikipedia/en/thumb","initialCapital":"","scriptDirUrl":"/w","favicon":"http://en.wikipedia.org/static/favicon/wikipedia.ico"}]}} [22:35:21] but not everythign from the C api, e.g. no bitmap, ptr, spec, length, etc. [22:35:28] which seems pretty static, yet it's hit-for-pass back through to mediawiki, and I've seen some requests take several seconds to return a response back to me [22:35:45] greg-g: the deploy of wmf8 coincides with a number of issues: a spike in 5xx errors; a spike in memcached traffic; and some strange errors in the logs. [22:36:04] greg-g: i am not able to triage this right now, so unless someone else wants to look i suggest rolling back wmf8. [22:36:15] but it could be that filerepoinfo is a just a victim of poor caching due to something else destroying caches, too [22:37:07] I just had a manual curl test of filerepoinfo through esams text-lb take 38s but still return the correct result, and it was a cache-miss through to mediawiki [22:37:32] again, I'm just not sure if that's primary to what's going on or not [22:37:43] twentyafterfour: you still around? [22:38:34] (03CR) 10Dereckson: [C: 031] Enable Extension:NewUserMessage on ta.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213841 (https://phabricator.wikimedia.org/T100431) (owner: 10Shanmugamp7) [22:39:42] ori: is it wmf8 (on testwikis) or 7 (on wikipedias)? [22:39:50] the latter [22:39:56] eek [22:40:04] sorry --- i should have been clearer -- i mean roll back the train deploy [22:40:14] (03CR) 10Dereckson: "Add perhaps a // T100513" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214247 (https://phabricator.wikimedia.org/T100513) (owner: 10Mjbmr) [22:40:24] i pinged twentyafterfour above too, i don't think he's around [22:40:29] gah [22:40:52] i'd drop everything and look after this but i'm somehow trying to do https://phabricator.wikimedia.org/T100454 in the background as well [22:41:14] well I'm still looking too, but honestly this is just not looking like anything simple and fast to figure out [22:41:15] bblack is on it but i think that to be effective he's going to need someone poking at this from the MediaWiki end. [22:41:21] ori, roll back only mmv? [22:41:31] was there a notable change to mmv? [22:41:35] I can look, btw [22:41:38] MMV makes the call, it doesn't respond [22:41:50] right [22:42:12] MaxSem: if you're looking, can you also assume responsibility for performing a rollback if you can't isolate the issue? [22:42:27] (and thanks for offering btw) [22:42:31] I can try, just received my access back [22:42:44] but I've never messed with wikiversions before [22:42:59] * MaxSem pokes marktraceur [22:43:16] greg-g: perhaps call twentyafterfour if it's not too late where he is? [22:43:42] mutante: fixed? [22:43:49] hmm, https://gerrit.wikimedia.org/r/#/q/project:mediawiki/extensions/MultimediaViewer,n,z look empty [22:43:52] I'm fairly certain now that the bulk of the 503s are due to unhealthy varnish backends, which are getting unhealthy because they're stacking up hung/closing connections to mediawiki for filerepoinfo... [22:44:01] ori: yeah, doing that now [22:44:22] but it's still possible that's because it was always that way, but usually cached better, and something else is injuring cacheability in general on the same hash subset of the URL space [22:45:22] perf spiked as well - https://performance.wikimedia.org/#!/day [22:45:30] see backend response [22:45:53] mukunda is coming [22:46:20] twss [22:46:24] okay, I see it makes this request when you click on an image [22:46:50] it shouldn't take long long times for MW to answer that request though, right? [22:47:00] it seems like it's pretty basic info inside the response data [22:47:21] (it's not always long, it's intermittent) [22:47:21] also, the url sugeests it should be cached for anons, at least [22:47:28] andrewbogott: yes, fixed. thanks [22:47:33] MaxSem: it's not :/ [22:47:39] at least, not now [22:48:47] expires:Thu, 28 May 2015 09:48:19 GMT [22:49:03] cache-control:s-maxage=86400, max-age=86400, public [22:49:16] that's if you're anon [22:50:00] x-cache:cp1065 hit (2), cp4010 hit (3), cp4009 frontend hit (310) [22:50:19] (03PS1) 1020after4: roll back everything to 1.26wmf6 except testwiki to 1.26wmf8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214264 [22:50:41] (03CR) 10Ottomata: "Abstracted out the C API here:" [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh) [22:50:53] sometimes, but not always [22:51:00] (03CR) 1020after4: [C: 032] roll back everything to 1.26wmf6 except testwiki to 1.26wmf8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214264 (owner: 1020after4) [22:51:04] to me it sounds like the problem is that filerepoinfo is getting slow? [22:51:06] (03Merged) 10jenkins-bot: roll back everything to 1.26wmf6 except testwiki to 1.26wmf8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214264 (owner: 1020after4) [22:51:18] the 38s response time I had earlier was: [22:51:20] < X-Cache: cp1052 miss (0), cp3014 hit (209), cp3014 frontend miss (0) [22:51:37] (I'm assuming the hit in the middle is actually hit-for-pass or else nothing else makes sense) [22:52:11] again, it could be that this is always potentially slow, usually cached, but something else is blowing it out of cache frequently because of an unrelated cacheability problem [22:52:27] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: roll back everything but testwiki to 1.26wmf6 [22:52:43] PROBLEM - Translation cache space on mw1216 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [22:52:59] bblack, there should be also a number of varying queries like https://en.wikipedia.org/w/api.php?action=query&format=json&prop=imageinfo&titles=* [22:53:01] Request: GET http://cs.wikiversity.org/w/index.php?diff=62008&oldid=61966, from 10.20.0.105 via cp1067 cp1067 ([10.64.0.104]:3128), Varnish XID 2662401229 [22:53:04] Forwarded for: 88.101.17.127, 10.20.0.175, 10.20.0.105 [22:53:05] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: git.wikimedia.org replication from gerrit stopped or lags - https://phabricator.wikimedia.org/T99990#1316044 (10QChris) >>! In T99990#1312092, @QChris wrote: > Starting forced replication nonetheless That back-fired due to github overloading repo na... [22:53:07] Error: 503, Service Unavailable at Wed, 27 May 2015 22:52:44 GMT [22:53:14] e.g. https://en.wikipedia.org/w/api.php?action=query&format=json&prop=imageinfo&titles=File%3AChiefOshkosh%2Ejpg&iiprop=timestamp%7Cuser%7Curl%7Csize%7Cmime%7Cmediatype%7Cextmetadata&iiextmetadatafilter=DateTime%7CDateTimeOriginal%7CObjectName%7CImageDescription%7CLicense%7CLicenseShortName%7CUsageTerms%7CLicenseUrl%7CCredit%7CArtist%7CAuthorCount%7CGPSLatitude%7CGPSLongitude%7CPermission%7CAttribution%7CAttributionRequired%7CNonFree%7CRestric [22:53:15] tions&iiextmetadatalanguage=en [22:53:33] 6operations, 7database: Document x1 DB requirements for new wikis - https://phabricator.wikimedia.org/T100527#1316047 (10Krenair) [22:53:49] enwiki is down [22:53:53] PROBLEM - Translation cache space on mw1177 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [22:53:56] um [22:53:59] who broke the site [22:54:00] everything is [22:54:01] * tfinc looks about [22:54:02] PROBLEM - Translation cache space on mw1184 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [22:54:19] no localisation cache [22:54:20] 2015-05-27 22:53:59 mw1004 euwiki exception INFO: [3548d54c] /rpc/RunJobs.php?wiki=euwiki&type=htmlCacheUpdate&maxtime=30&maxmem=300M MWException from line 466 of /srv/mediawiki/php-1.26wmf6/includes/cache/LocalisationCache.php: No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php. [22:54:22] [75274fee] 2015-05-27 22:54:15: Fatal exception of type MWException [22:54:25] I get : [4731d3e2] 2015-05-27 22:53:37: Fatal exception of type MWException [22:54:47] Did it get nuked and then something rollbacked? [22:54:50] I guess so [22:54:53] (03PS1) 10Ori.livneh: Revert "roll back everything to 1.26wmf6 except testwiki to 1.26wmf8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214265 [22:54:54] twentyafterfour: ^ [22:55:00] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "roll back everything to 1.26wmf6 except testwiki to 1.26wmf8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214265 (owner: 10Ori.livneh) [22:55:09] thanks ori [22:55:13] uhm [22:55:18] yikes [22:55:22] !log ori rebuilt wikiversions.cdb and synchronized wikiversions files: (no message) [22:55:33] what went wrong? [22:55:39] what is on 10.20.0.175? [22:55:43] PROBLEM - Apache HTTP on mw1095 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 0.037 second response time [22:55:43] PROBLEM - Apache HTTP on mw1245 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 0.030 second response time [22:55:43] PROBLEM - HHVM rendering on mw1063 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 0.052 second response time [22:55:43] PROBLEM - HHVM rendering on mw1194 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 0.041 second response time [22:55:44] PROBLEM - HHVM rendering on mw1215 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 0.055 second response time [22:55:44] PROBLEM - HHVM rendering on mw1104 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 0.045 second response time [22:55:44] PROBLEM - Apache HTTP on mw1063 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 0.035 second response time [22:55:44] PROBLEM - HHVM rendering on mw2043 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 0.125 second response time [22:55:45] PROBLEM - HHVM rendering on mw1181 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 0.043 second response time [22:55:45] PROBLEM - Apache HTTP on mw2146 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 0.116 second response time [22:55:46] PROBLEM - HHVM rendering on mw2026 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 0.125 second response time [22:55:49] oh lord [22:55:52] PROBLEM - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50558 bytes in 0.042 second response time [22:55:56] Danny_B: that's just a cache machine [22:55:58] PROBLEM - HHVM rendering on mw2090 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 0.190 second response time [22:55:58] PROBLEM - HHVM rendering on mw2033 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 0.129 second response time [22:55:58] PROBLEM - HHVM rendering on mw2179 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 0.139 second response time [22:55:58] PROBLEM - HHVM rendering on mw2184 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 0.131 second response time [22:55:58] PROBLEM - HHVM rendering on mw1224 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 0.077 second response time [22:55:59] PROBLEM - HHVM rendering on mw2131 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 0.128 second response time [22:55:59] PROBLEM - HHVM rendering on mw2192 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 0.120 second response time [22:56:08] meltdown [22:56:15] bblack: keep hittin it and always with bsod [22:56:21] [14df5f79] 2015-05-27 22:55:41: Fatal exception of type MWException [22:56:22] Danny_B, you can run "host " in labs to find out what a given private network ip is [22:56:29] twentyafterfour: Did you delete the l10n cache for wmf6? [22:56:30] www.mediawiki.org : [9b67bb92] 2015-05-27 22:56:13: Fatal exception of type MWException [22:56:37] I getting those too ^ [22:56:40] yep [22:56:41] that one was cp3040 [22:56:45] it's being fixed [22:56:45] ohhh [22:56:46] Everyone: we know :) [22:56:48] Reedy: yes [22:56:53] ori, we don't have localisation for wmf6? [22:56:55] shhhit [22:57:01] shall we send out a tweet? https://webcache.googleusercontent.com/search?q=cache:slhxlSnogzMJ:https://wikitech.wikimedia.org/wiki/Incident_response ---> "Communicating with the public" [22:57:06] it's part of the procedure :-/ [22:57:08] forgot that [22:57:11] heh [22:57:24] I tended to wait till later, incase we had to rollback [22:57:33] !log ori rebuilt wikiversions.cdb and synchronized wikiversions files: (no message) [22:57:41] Logged the message, Master [22:57:44] * domas looked with a bad eye [22:57:49] greg-g: Maybe a todo task... Check we have a l10n cache when running just sync-wikiversion [22:57:51] we're starting to recover now [22:57:53] Scap it's not an issue [22:57:55] back up [22:57:56] * Reedy files [22:57:57] woo i got a page load [22:58:01] we're loading [22:58:07] Reedy: thanks [22:58:08] back [22:58:10] ori: thanks [22:58:13] RECOVERY - Translation cache space on mw1184 is OK: HHVM_TC_SPACE OK TC sizes are OK [22:58:13] RECOVERY - Apache HTTP on mw1095 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.078 second response time [22:58:14] RECOVERY - Apache HTTP on mw1245 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.049 second response time [22:58:14] RECOVERY - HHVM rendering on mw1063 is OK: HTTP OK: HTTP/1.1 200 OK - 66055 bytes in 0.222 second response time [22:58:15] RECOVERY - HHVM rendering on mw1194 is OK: HTTP OK: HTTP/1.1 200 OK - 66063 bytes in 0.229 second response time [22:58:15] RECOVERY - HHVM rendering on mw1215 is OK: HTTP OK: HTTP/1.1 200 OK - 66055 bytes in 0.339 second response time [22:58:15] StevenW: are you back? [22:58:16] RECOVERY - HHVM rendering on mw1104 is OK: HTTP OK: HTTP/1.1 200 OK - 66063 bytes in 0.375 second response time [22:58:16] RECOVERY - Apache HTTP on mw1063 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.078 second response time [22:58:17] RECOVERY - HHVM rendering on mw2043 is OK: HTTP OK: HTTP/1.1 200 OK - 66053 bytes in 0.480 second response time [22:58:17] RECOVERY - HHVM rendering on mw1181 is OK: HTTP OK: HTTP/1.1 200 OK - 66075 bytes in 0.181 second response time [22:58:18] No [22:58:22] RECOVERY - HHVM rendering on mw2026 is OK: HTTP OK: HTTP/1.1 200 OK - 66053 bytes in 0.501 second response time [22:58:22] RECOVERY - Apache HTTP on mw2146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.738 second response time [22:58:22] RECOVERY - LVS HTTP IPv4 on api.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 16638 bytes in 0.198 second response time [22:58:26] StevenW: :( [22:58:28] RECOVERY - HHVM rendering on mw2033 is OK: HTTP OK: HTTP/1.1 200 OK - 66053 bytes in 0.442 second response time [22:58:28] RECOVERY - HHVM rendering on mw2179 is OK: HTTP OK: HTTP/1.1 200 OK - 66053 bytes in 0.621 second response time [22:58:28] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 66055 bytes in 0.358 second response time [22:58:28] RECOVERY - HHVM rendering on mw2025 is OK: HTTP OK: HTTP/1.1 200 OK - 66053 bytes in 0.397 second response time [22:58:28] RECOVERY - HHVM rendering on mw2192 is OK: HTTP OK: HTTP/1.1 200 OK - 66053 bytes in 0.621 second response time [22:58:29] RECOVERY - Apache HTTP on mw2201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.628 second response time [22:58:29] RECOVERY - HHVM rendering on mw2090 is OK: HTTP OK: HTTP/1.1 200 OK - 66054 bytes in 1.486 second response time [22:58:29] yeah I want to run a lot of sanity checks before letting any of the scap commands succeed [22:58:30] The MariaDB server is running with the --read-only option so it cannot execute this statement (10.64.16.157) DELETE FROM `pc030` WHERE keyname = 'enwiki:pcache:idoptions:21515774' AND exptime = '2015-05-27 08:53:19' [22:58:31] "Wikimedia sites are experiencing technical difficulties at the moment. Our engineers are working on it." [22:58:40] HaeB: too late [22:58:43] MaxSem: That was there earlier [22:58:45] lol you guys need an automated alerts channel [22:58:56] But then people would ignore it :/ [22:59:01] fwiw the error my bot got from the api is: No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php [22:59:01] exactly [22:59:08] yes, we know [22:59:09] We hide the error from normal page views, but show it in the api? [22:59:23] canonical statement: [22:59:25] no localisation cache [22:59:30] HaeB: I always enjoy how "our engineers are working on it" is usually a lie! [22:59:31] the l10n thing + wider very short outage is from a bad revert, it's not the original problem [22:59:32] twentyafterfour's rollback broke things [22:59:43] i reverted but didn't merge before running sync-wikiversions, which did nothing [22:59:51] then i merged and ran sync-wikiversions, which brought the site back up [22:59:52] lol [22:59:54] that's it [23:00:04] [23:00:04] RoanKattouw, ^d, Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150527T2300). Please do the needful. [23:00:10] jouncebot_: not now [23:00:19] greg-g ;) "Wikimedia sites experienced a brief outage, but should be back to normal now" [23:00:23] HaeB: :) [23:00:27] well [23:00:36] bblack: not counting the other things going on [23:00:40] have any of those tshirts left, bd808? [23:00:47] we still have an issue from before that brief outage, which may still warrant a rollback [23:00:54] yes, a proper one [23:00:58] is the rollback procedure documented somewhere? [23:00:58] sure, but with a scap first ;) [23:01:01] * domas looks at ganglia and giggles [23:01:04] twentyafterfour: so, let's rebuild a wmf6 l10n cache, sync that out, then do this again [23:01:11] rollback is just sync-wikiversions but we need the cache [23:01:13] domas, we hate you too! :P [23:01:16] https://phabricator.wikimedia.org/T100573 [23:01:37] will a regular scap run rebuild the cache? what's the command to do it for a specific branch instead of all [23:01:51] yup, if you stage the wikiverisons change, then run scap, that'll do it [23:02:06] Reedy: won't that take forever rebuilding all the branches? [23:02:06] It'll mostly be a noop for branches that haven't changed [23:02:10] ah [23:02:17] cheapskate solution: copy wmf7's cache to wmf6? [23:02:18] it'll obviously do something, but it's minimal work [23:02:26] * Reedy eyes MaxSem [23:02:33] I like maxsem's plan ;) [23:02:47] but I'll do it the 'correct' way [23:03:09] I think there was talk of scap for just one version.. I don't know if a task was ever logged for that... bd808? [23:03:36] We have talked about it for sure [23:03:40] can someone check dewiki's logs for exception eab5c38a ? [23:04:01] jackmcbarn: Is it something new? [23:04:07] jackmcbarn, what do you think all these people here doing? [23:04:18] parsercache or l10ncache I presume [23:04:20] MaxSem: is that the same exception? [23:04:27] every exception is unique [23:04:35] yes, but i mean is it obviously the same trace [23:04:55] (03PS1) 1020after4: lets try 1.26wmf6 again, this time with l10ncache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214266 [23:05:03] haha oh dear, my inbox after that [23:05:18] (03CR) 1020after4: [C: 032] lets try 1.26wmf6 again, this time with l10ncache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214266 (owner: 1020after4) [23:05:24] (03Merged) 10jenkins-bot: lets try 1.26wmf6 again, this time with l10ncache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214266 (owner: 1020after4) [23:05:36] twentyafterfour: do you need to update the /srv/mediawiki-staging/php symlink? Other symlinks? [23:05:54] No [23:05:56] thcipriani: I don't think so [23:05:59] kk [23:06:08] the php one is sort of a hack :) [23:06:09] how about we sync-common this on mw1017 and check first this time? :P [23:06:18] as long as the symlink for each version is present it should be ok [23:06:26] Krenair: ok [23:06:49] Krenair: Why? We know exactly what went wrong :P [23:07:01] sync-common builds localization cache? [23:07:25] no [23:07:26] We just had 5-10 minutes of sweeping failure...is that deploy stuff? [23:07:27] ugh, good point, probably not [23:08:01] chasemp: it's over [23:08:13] OK reading back :) [23:09:07] so should I go ahead with scap? [23:09:23] Yup [23:09:32] !log twentyafterfour Started scap: scap, now with 10% less fail [23:09:33] if you're sure it won't break stuff again, I don't see why not... [23:09:34] If you've pulled your revert revert back on [23:09:37] Logged the message, Master [23:09:57] PROBLEM - Varnish HTTP text-backend on cp1065 is CRITICAL - Socket timeout after 10 seconds [23:10:03] ? [23:10:13] should I be worried about that ^ [23:10:17] 1065 is one of the backends I was refering to earlier [23:10:30] it's been in trouble throughout, it's part of the reason we're trying to rollback [23:10:55] and here goes the scary stuff: http://graphite.wikimedia.org/render/?width=588&height=311&_salt=1432768192.869&target=MediaWiki.xhprof.ApiQueryFileRepoInfo.execute.calls.rate [23:11:06] so, yes in general, but no don't stop your rollback [23:11:48] did someone API caching? [23:12:17] scap is building the localisation cache, this takes a long time unfortunately [23:12:29] when I query the repoinfo URL directly, I always seem to get cacheable headers from mediawiki in testing [23:12:51] I've still never really reproduced the hanging version of the request directly, only see it via-varnish [23:13:05] http://graphite.wikimedia.org/render/?width=588&height=311&_salt=1432768368.63&target=MediaWiki.xhprof.ApiQueryFileRepoInfo.execute.cpu.mean [23:13:06] PROBLEM - Varnish HTTP text-backend on cp1053 is CRITICAL - Socket timeout after 10 seconds [23:13:21] we see a slowdown, but is it called by the rate of requests? [23:13:31] cp105[23], and to a lesser extend cp105[14] are also affected [23:13:54] the problem URLs hash to one of the servers, kill it, it gets unhealthy, so the frontends start moving the requests to a different one and killing that one, etc... [23:14:30] 1065 is the most common victim though, and I think the original hash destination of the problem URL [23:15:45] ori/godog, what's the scale of calls rate in xhprof? [23:16:37] PROBLEM - Varnish HTTP text-backend on cp1052 is CRITICAL - Socket timeout after 10 seconds [23:17:39] ok scap is starting to sync [23:18:06] PROBLEM - Varnish HTTP text-backend on cp1054 is CRITICAL - Socket timeout after 10 seconds [23:18:17] RECOVERY - Varnish HTTP text-backend on cp1065 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 4.112 second response time [23:21:04] any chance the train deploy contained changes related to cookies? [23:21:15] I haven't really even started to dig into what all is in there [23:21:35] bblack, at least mmv haven't changed a slightest bit [23:21:44] MaxSem: I don't remember any major API changes in the past week related to caching... [23:22:11] they don't have to be /directly/ related to caching, kekeke [23:22:17] yeah... [23:22:32] this is in wmf7 right? [23:22:57] it could still be that the repoinfo thing is just a victim, and the cause is huge cache inefficiency from something else like the addition of some cache-busting query-arg to some other common URL [23:23:09] but repoinfo is the best hint I have to go on so far [23:23:51] (and its long execution time makes it suspicious, too. that alone could be causing problems potentially) [23:24:17] https://gerrit.wikimedia.org/r/#/c/209304/ would affect all query modules, but seems unlikely [23:24:18] 50% sync'd [23:24:54] honestly, I don't see any scary API changes either [23:26:05] we also deployed the ApiFeatureUsage extension, but iirc that was already on wmf7 wikis [23:26:25] legoktm, it's wmf7 that was broken [23:26:36] er, group2* [23:26:36] that's why we're rolling back to wmf6 [23:26:42] right [23:27:19] oh, it's only on group0 [23:28:16] RECOVERY - Varnish HTTP text-backend on cp1054 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.024 second response time [23:28:24] wmf8 branch was today. wmf7 has been on group1 since yesterday [23:28:27] PROBLEM - Varnish HTTP text-backend on cp3040 is CRITICAL - Socket timeout after 10 seconds [23:28:47] but I rolled everything back to wmf6 just now (and it's almost done syncing )... [23:28:59] hmmm https://github.com/wikimedia/mediawiki/commit/f37cee996ee9459ae8cd6b23e604517d941daf18 [23:29:33] this would've been the most dangeorus change, but I don't see anything wrong with it [23:29:56] RECOVERY - Varnish HTTP text-backend on cp3040 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.176 second response time [23:31:39] !log twentyafterfour Finished scap: scap, now with 10% less fail (duration: 22m 07s) [23:31:43] Logged the message, Master [23:31:46] RECOVERY - Varnish HTTP text-backend on cp1053 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 9.309 second response time [23:31:47] RECOVERY - Varnish HTTP text-backend on cp1052 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.006 second response time [23:31:57] and just like that...10% less fail [23:32:41] 10Ops-Access-Requests, 6operations, 6Release-Engineering, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Add all Release-Engineering team as Gerrit admins - https://phabricator.wikimedia.org/T100565#1316156 (10Dzahn) p:5Triage>3Normal [23:33:00] 6operations, 7database: Document x1 DB requirements for new wikis - https://phabricator.wikimedia.org/T100527#1316158 (10Dzahn) p:5Triage>3Normal [23:33:08] so 22 minutes to scap one version of l10n cache? [23:33:44] Reedy: yep. it's always quite slow, though today it seems slower than it has been the past few weeks [23:33:57] that isn't too bad imho [23:34:02] considering the more servers etc [23:34:04] well it is 2.5gb [23:34:15] and nearly 500 servers now? [23:34:21] 466 [23:34:47] 1.165TB transferred in 22 minutes? [23:34:50] right, so, the idea that repoinfo long execution times could've been "ok" in the past and it's just a victim is kind of BS [23:35:08] because we still get tons of uncacheable hits for it due to logged-in users hitting it (all their hits are uncacheable in general) [23:35:26] so the long executions there are definitely a real piece of the puzzle [23:35:38] they're kinda random and not often, but they can take for-freaking-ever at times... [23:35:49] which consumes a backend connection and so-on [23:36:11] I logged one earlier which gave up after 5 full minutes with a 503 [23:36:18] geez [23:36:21] (a test from curl, through normal outside-world entry point) [23:36:54] so we need to look at the codepath for repoinfo, and diff between wmf6 and wmf7? [23:37:07] twentyafterfour, it hasn't chenged [23:37:11] oh [23:37:13] well hrm [23:37:21] yeah, the cache backends are still unhealthy, etc [23:37:23] request rate is what increased [23:37:28] twentyafterfour, Reedy anyone want to deploy https://gerrit.wikimedia.org/r/213010 ? [23:37:53] AaronSchulz: I can [23:37:54] heh [23:38:06] MaxSem: more than would be expected from retries due to slow exe time? [23:38:11] bblack: what do you mean still unhealthy? [23:38:13] or has the avg exe time not really changed much? [23:38:28] bblack, no idea how our retries code works [23:38:45] * AaronSchulz is still impressed by the deep and rich blue of his terminal with his new IPS monitor [23:38:51] http://graphite.wikimedia.org/render/?width=588&height=311&_salt=1432768192.869&target=MediaWiki.xhprof.ApiQueryFileRepoInfo.execute.calls.rate [23:38:55] twentyafterfour: I mean the affected varnish backends are still unhealthy, because they're getting hammered by whatever this problem is [23:39:00] 1080p terminal or gtfo [23:39:08] Reedy, 4K [23:39:09] bblack, ^_^ is the request rate [23:39:09] bblack: still, even after this rollback? [23:39:17] so far, yeah [23:39:33] so wtf [23:39:50] of course if MaxSem is right that this is a request-rate increase because we're publishing refs to the repoinfo URL in places we never did before [23:39:56] and those places are themselves cacheable pages... [23:40:09] then yeah rolling back doesn't fix much in the short term heh [23:40:18] because the refs are cached? [23:40:34] by refs I mean "links browsers will fetch in normal cacheable pages, that weren't there before" [23:40:41] yeah [23:40:49] so rolling back doesn't remove those links from cache [23:41:00] I've got another test query on repoinfo running now that's been going for minutes [23:41:25] so that problem in particular certainly isn't gone, but perhaps it was always that way and wasn't a huge deal because it was being hit less frequently? [23:41:25] bblack, see what appserver it is, gdb, etc? [23:41:45] I can't really see where it goes, no :/ [23:42:07] PROBLEM - Varnish HTTP text-backend on cp1052 is CRITICAL - Socket timeout after 10 seconds [23:42:16] hell [23:42:27] disable MMV? [23:42:37] this is from earlier, the 5m one from before: [23:42:38] https://phabricator.wikimedia.org/P696 [23:43:50] isn't there a way to force it to a specific back-end? [23:44:05] X-Wikimedia-Debug [23:44:10] sends it to mw1017 [23:44:19] yeah [23:44:22] api too right? [23:44:40] yes [23:44:52] does it just randomly hang but usually runs fast? [23:45:10] oh I got one to run for 1.7s through debug-1017 [23:45:15] that's at least slow-er [23:45:20] yeah kinda random [23:45:26] RECOVERY - Varnish HTTP text-backend on cp1052 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 4.569 second response time [23:45:41] even 1s, if there was many such requests in the non-debug case, could stack up [23:46:57] PROBLEM - Varnish HTTP text-backend on cp1053 is CRITICAL - Socket timeout after 10 seconds [23:47:01] bblack: does the response contain any API warnings? [23:47:38] also, who broke Collection? :P [23:47:50] legoktm: just normal headers + json response data [23:48:19] ErrorException from line 361 of /srv/mediawiki/php-1.26wmf6/vendor/oojs/oojs-ui/php/Tag.php: PHP Fatal: exception 'OOUI\Exception' with message 'Potentially [23:48:19] unsafe 'href' attribute value. Scheme: ''; value: '/wiki/Vikihaber:2009/Ekim/20'.' in /srv/mediawiki/php-1.26wmf6/vendor/oojs/oojs-ui/php/Tag.php:317 [23:48:26] MaxSem: known [23:49:10] bblack: from what I see mmv is setting maxage and smaxage params and the API module is set for public caching, but on mediawiki.org I'm getting cache-control: private [23:49:33] oh you know what [23:49:33] legoktm, wen logged in? [23:49:36] MaxSem: yes [23:49:37] that's kinda normal I think, in the sense that we consume and rewrite cache headers going through varnish [23:49:40] because uselang=user [23:49:46] kinda expeeected [23:49:58] I've got one running forever on mw1017 now, just took enough random tries [23:50:07] RECOVERY - Varnish HTTP text-backend on cp1053 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 1.159 second response time [23:50:08] I'm testing now with: [23:50:09] time curl -v -H 'X-Wikimedia-Debug: 1' 'http://en.wikipedia.org/w/api.php?action=query&format=json&meta=filerepoinfo&smaxage=86400&maxage=86400' [23:50:16] I have one that's just been hung for nearly a minute right now [23:50:30] can you dissect it now? [23:50:38] DB deadlock? [23:50:40] how do I do that with hhvm? [23:50:49] there is a nice interactive debugger in hhvm [23:50:51] 6operations: Alias docs.wikimedia.org to doc.wikimedia.org - https://phabricator.wikimedia.org/T100349#1316204 (10Krenair) Please associate projects with tasks so that people actually see them [23:51:03] https://www.mediawiki.org/wiki/Manual:How_to_debug [23:51:28] :) [23:51:46] hhvm -m debug --debug-host localhost --debug-port 8089 [23:52:14] MaxSem: Huh? What was wrong with MMV? [23:52:17] I'm assuming mw1017 has it turned on [23:52:18] Sorry I was afk [23:52:28] We haven't changed anything [23:52:29] marktraceur, it melts teh API [23:52:34] Wuh oh. [23:52:54] MaxSem: https://gerrit.wikimedia.org/r/214269 [23:53:34] legoktm, I think uselang=user was broken since winter [23:53:52] MaxSem: broken? [23:54:05] breaking caching [23:54:06] the debug-port stuff seems wrong [23:54:24] MaxSem: right. but this will make it less worse. [23:54:44] oh I see, I have to restart hhvm in some kind of debug mode first [23:55:12] cræp [23:56:31] marktraceur: https://gerrit.wikimedia.org/r/#/c/214269/ [23:57:34] Uhh [23:57:43] what does uselang:content do there? [23:58:17] legoktm: But we need to get the language that the current user understands...won't this break that? [23:58:42] AaronSchulz, is "1847 JobQueueGroup::__destruct: 1 buffered job(s) never inserted. in /srv/mediawiki/php-1.26wmf6/includes/jobqueue/JobQueueGroup.php on line 419" known? [23:59:12] marktraceur: what in meta=filerepoinfo requires the current user's language? [23:59:23] legoktm: Oh...erm [23:59:26] Maybe misread [23:59:40] I guess that's fine then