[00:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151124T0000). [00:00:08] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 2 failures [00:00:09] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 2 failures [00:00:51] andrewbogott: did you know apt::pin in puppet exists? [00:01:09] yes… trying to do this by hand first so I know what I want [00:01:17] maybe that’s pointless :) [00:02:03] am going to deploy https://gerrit.wikimedia.org/r/#/c/255046/ [00:02:21] andrewbogott: does it not work? what you have there seems right to me [00:02:31] idk how Package: * works out tho [00:02:34] If I apt-get install --dry-run designate [00:02:41] it seems to want to downgrade everything [00:04:49] chasemp: maybe I don’t understand what’s happening… I feel like it should just say ‘the latest package is installed’ since… it is [00:05:09] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [00:05:09] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 2 failures [00:05:26] andrewbogott: "This won't control the version, but the source preference if both packages has the same version. If you need to assign highest priority for the same package version in your local repo, list them at the top to the /etc/apt/sources.list file" [00:05:58] …ok then why is a naked apt-get install trying to downgrade? [00:07:09] I...think it's saying that the pinning has created a bad situation [00:08:22] andrewbogott: I think for what you want changing the order of sources.list is the 'right' thing [00:09:09] but also probably the best thing is to specify packages using apt::pin [00:09:16] What I would expect it to do is… 1) compile a list of available packages and versions 2) prefer the latest version 3) use pinning if a given package+version is available in two places [00:10:08] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 2 failures [00:10:52] chasemp: I should explain that I just now added ubuntucloud.pref and it changed the behavior not a whit [00:12:32] "Several instances of the same version of a package may be available when the sources.list(5) file contains references to more than one source. In this case apt-get(8) downloads the instance listed earliest in the sources.list(5) file. The APT preferences file does not affect the choice of instance, only the choice of version." [00:12:45] so I take that to mean what you are doing doesn't work the way you visualize [00:13:14] (03PS1) 10EBernhardson: Enable wikidatawiki for es labs replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255051 [00:13:35] andrewbogott: I gotta head out man, I'll kick this around w/ you later! sorry I couldn't be more helpful [00:13:46] I’ll poke at it more, thanks [00:14:24] (03CR) 10EBernhardson: [C: 032] Enable wikidatawiki for es labs replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255051 (owner: 10EBernhardson) [00:14:46] (03Merged) 10jenkins-bot: Enable wikidatawiki for es labs replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255051 (owner: 10EBernhardson) [00:15:08] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 199 seconds ago with 0 failures [00:15:15] Coren, interested in taking a “why is apt doing this?” shift? [00:15:30] * andrewbogott should really be bugging people in California who aren’t eating dinner [00:15:37] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Enable wikidatawiki for elasticsearch labs replica (duration: 00m 27s) [00:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:19:02] Hi ops, does anyone here know why I can't get vpnc to tunnel all traffic? /etc/vpnc/default.conf is configured with group name from https://office.wikimedia.org/wiki/VPN_Setup#Full_Tunnel , but whatismyip.com still shows me with my ISP's IP addy [00:19:27] (and more to the point, I can't test payment API locally) [00:19:28] ejegg: OIT manages the vpn, not operations [00:19:34] Ah, thanks! [00:19:37] np :) [00:24:47] (03PS1) 10EBernhardson: Enable commonswiki writes for ES labs replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255055 [00:24:58] (03CR) 10EBernhardson: [C: 032] Enable commonswiki writes for ES labs replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255055 (owner: 10EBernhardson) [00:25:20] (03Merged) 10jenkins-bot: Enable commonswiki writes for ES labs replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255055 (owner: 10EBernhardson) [00:26:43] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Enable commonswiki for elasticsearch labs replica writes (duration: 00m 28s) [00:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:36:02] woah.. [00:36:07] !log krenair@tin Synchronized php-1.27.0-wmf.7/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.DesktopArticleTarget.init.js: https://gerrit.wikimedia.org/r/#/c/255053/ (duration: 00m 28s) [00:36:11] bd808, some strange errors from sync-file [00:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:36:36] Krenair: about something other than mira? [00:37:06] no, I think it's all about mira actually [00:37:21] I recall reading a task about this [00:37:27] we still have unresolved permissions things there [00:37:54] https://phabricator.wikimedia.org/T119165 [01:32:28] PROBLEM - very high load average likely xfs on ms-be1010 is CRITICAL: CRITICAL - load average: 274.47, 172.74, 83.18 [01:40:15] 6operations: Off Boarding: Remove user pbeaudette from aliases - https://phabricator.wikimedia.org/T116248#1827329 (10Dzahn) Hi, what would you like us to do with aliases that had only pbeaudette as the only user? one is "wikivoyage-announce". [01:42:45] 6operations: Off Boarding: Remove user pbeaudette from aliases - https://phabricator.wikimedia.org/T116248#1744433 (10Dzahn) removed from **web-tools@** - now just jalexander by himself removed from **wikiguides@** - now just jalexander by himself removed from**box6699@** - mdennis, gbyrd, archive01 remain (n... [01:45:25] 6operations: Off Boarding: Remove user pbeaudette from aliases - https://phabricator.wikimedia.org/T116248#1827336 (10Dzahn) done. i think another offboarding ticket for Garfield is needed [01:45:43] 6operations: Off Boarding: Remove user pbeaudette from aliases - https://phabricator.wikimedia.org/T116248#1827337 (10Dzahn) a:3JGulingan [01:45:49] 6operations: Off Boarding: Remove user pbeaudette from aliases - https://phabricator.wikimedia.org/T116248#1827338 (10Dzahn) 5Open>3Resolved [01:51:38] (03PS7) 10Dzahn: sudo journalctl: make missing restrictions obvious [puppet] - 10https://gerrit.wikimedia.org/r/251714 (https://phabricator.wikimedia.org/T115067) (owner: 10JanZerebecki) [01:54:02] (03CR) 10Dzahn: [C: 032] "while this changes sudo lines, it's true that this never was a working limitation and it's better to make that obvious and also consistent" [puppet] - 10https://gerrit.wikimedia.org/r/251714 (https://phabricator.wikimedia.org/T115067) (owner: 10JanZerebecki) [01:57:28] (03CR) 10Dzahn: "would you mind to split this up into 2 things and first just add the framework/bootstrap with a short reason why and what's cool about it " [debs/wikistats] - 10https://gerrit.wikimedia.org/r/252249 (owner: 10Southparkfan) [02:02:17] What: A relaxed hack day. Puppet community members and employees alike will gather online, get to know each other, and collaborate on pull requests for modules and other Puppet code. [02:02:25] When: 4:00am - 4:00pm PST | Tuesday, December 15th, 2015 [02:02:32] Where: Online, in the #puppethack IRC channel on Freenode [02:02:59] Who: Intermediate and advanced Puppet users, contributors, developers, module authors, docs writers, and anyone else contributing to the Puppet community [02:26:02] !log l10nupdate@tin Synchronized php-1.27.0-wmf.7/cache/l10n: l10nupdate for 1.27.0-wmf.7 (duration: 05m 34s) [02:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:55] mutante: you could probably put that on the calendar in phab [02:55:13] 6operations, 10Gitblit: Update gitblit to 1.7.1 - https://phabricator.wikimedia.org/T119409#1827371 (10Peachey88) [02:59:28] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: puppet fail [03:17:22] (03PS1) 10Aude: Enable data access for wikinews, meta-wiki, mediawiki.org and wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255063 [03:17:38] (03CR) 10Aude: [C: 04-2] "not yet..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255063 (owner: 10Aude) [03:24:40] PROBLEM - HTTP 5xx reqs/min anomaly on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 8 below the confidence bounds [03:26:52] (03PS3) 10Yuvipanda: Move debdeploy role into autolayout compatible names [puppet] - 10https://gerrit.wikimedia.org/r/254625 [03:27:34] (03CR) 10Yuvipanda: [C: 032 V: 032] "pcc says go" [puppet] - 10https://gerrit.wikimedia.org/r/254625 (owner: 10Yuvipanda) [03:27:51] (03PS2) 10Yuvipanda: extdist: Make role name autolayout compatible [puppet] - 10https://gerrit.wikimedia.org/r/254626 [03:30:04] (03CR) 10Yuvipanda: [C: 032] extdist: Make role name autolayout compatible [puppet] - 10https://gerrit.wikimedia.org/r/254626 (owner: 10Yuvipanda) [03:33:32] (03PS3) 10Yuvipanda: Make redisdb role conform to autolayout [puppet] - 10https://gerrit.wikimedia.org/r/254627 [03:38:18] (03CR) 10Yuvipanda: [C: 032] "pcc says go" [puppet] - 10https://gerrit.wikimedia.org/r/254627 (owner: 10Yuvipanda) [03:39:50] PROBLEM - HTTP 5xx reqs/min anomaly on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 6 below the confidence bounds [03:40:39] (03PS1) 10Yuvipanda: Rename phragile role and mark it as labs-only [puppet] - 10https://gerrit.wikimedia.org/r/255066 [03:41:03] (03Abandoned) 10Yuvipanda: admin: Provision all stat** users on bastion too [puppet] - 10https://gerrit.wikimedia.org/r/242779 (owner: 10Yuvipanda) [03:41:32] (03Abandoned) 10Yuvipanda: labs: Rename and move DNS roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/254495 (owner: 10Yuvipanda) [03:49:28] (03CR) 10Yuvipanda: [C: 032] "There seem to be nobody using this at the moment, since" [puppet] - 10https://gerrit.wikimedia.org/r/255066 (owner: 10Yuvipanda) [03:54:23] (03PS1) 10Yuvipanda: labs: Move labmon role to labs::graphite role [puppet] - 10https://gerrit.wikimedia.org/r/255067 [03:55:18] (03CR) 10jenkins-bot: [V: 04-1] labs: Move labmon role to labs::graphite role [puppet] - 10https://gerrit.wikimedia.org/r/255067 (owner: 10Yuvipanda) [04:00:38] PROBLEM - HTTP 5xx reqs/min anomaly on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [04:00:42] (03PS2) 10Yuvipanda: labs: Move labmon role to labs::graphite role [puppet] - 10https://gerrit.wikimedia.org/r/255067 [04:00:44] (03PS1) 10Yuvipanda: Move labs::dns role's hiera file to proper location [puppet] - 10https://gerrit.wikimedia.org/r/255068 [04:02:20] (03CR) 10Yuvipanda: [C: 032] "pcc says go" [puppet] - 10https://gerrit.wikimedia.org/r/255067 (owner: 10Yuvipanda) [04:03:28] (03PS1) 10KartikMistry: CX: Fix article-recommender campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255069 [04:07:04] (03CR) 10Yuvipanda: [C: 032] Move labs::dns role's hiera file to proper location [puppet] - 10https://gerrit.wikimedia.org/r/255068 (owner: 10Yuvipanda) [04:08:00] (03PS1) 10Yuvipanda: labs: Move db_aliases template into role module [puppet] - 10https://gerrit.wikimedia.org/r/255070 [04:09:54] (03CR) 10Yuvipanda: [C: 032] labs: Move db_aliases template into role module [puppet] - 10https://gerrit.wikimedia.org/r/255070 (owner: 10Yuvipanda) [04:16:28] (03PS1) 10Yuvipanda: Fixup and move 'mediawiki_singlenode' role to role/deprecated [puppet] - 10https://gerrit.wikimedia.org/r/255071 [04:17:54] (03CR) 10Yuvipanda: [C: 032] Fixup and move 'mediawiki_singlenode' role to role/deprecated [puppet] - 10https://gerrit.wikimedia.org/r/255071 (owner: 10Yuvipanda) [04:22:08] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 10.71% of data above the critical threshold [100000000.0] [04:26:08] (03PS1) 10Yuvipanda: wdq: Rename role to fit with autolayout [puppet] - 10https://gerrit.wikimedia.org/r/255072 [04:29:28] (03CR) 10Yuvipanda: [C: 032] wdq: Rename role to fit with autolayout [puppet] - 10https://gerrit.wikimedia.org/r/255072 (owner: 10Yuvipanda) [04:31:08] (03Abandoned) 10KartikMistry: CX should default to using rest.wm.o, not parsoid-lb [puppet] - 10https://gerrit.wikimedia.org/r/254151 (https://phabricator.wikimedia.org/T111562) (owner: 10KartikMistry) [04:32:43] (03PS2) 10KartikMistry: CX: Fix article-recommender-1 campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255069 (https://phabricator.wikimedia.org/T118033) [04:33:13] (03PS1) 10Yuvipanda: labsvagrant: Move role to deprecated [puppet] - 10https://gerrit.wikimedia.org/r/255073 [04:34:30] I wonder if I should let that stew [04:34:39] RECOVERY - HTTP 5xx reqs/min anomaly on graphite1001 is OK: OK: No anomaly detected [04:35:10] I don't really hav ethe energy to redo all the tools roles thoug :| [04:47:18] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [04:47:18] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [04:47:41] (fixed) [04:49:08] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [04:49:08] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [04:52:18] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [05:01:08] PROBLEM - Hadoop NodeManager on analytics1028 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [05:04:18] PROBLEM - puppet last run on wtp2003 is CRITICAL: CRITICAL: puppet fail [05:23:59] 6operations, 6WMF-Legal, 7domains: wikipedia.lol - https://phabricator.wikimedia.org/T88861#1827435 (10MZMcBride) >>! In T88861#1693869, @Slaporte wrote: > We now have wikipedia.lol registered. Why? [05:24:28] 6operations, 6WMF-Legal, 7domains: wikipedia.lol - https://phabricator.wikimedia.org/T88861#1827437 (10ori) >>! In T88861#1827435, @MZMcBride wrote: > Why? For the lulz. [05:25:28] RECOVERY - Hadoop NodeManager on analytics1028 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [05:30:39] RECOVERY - puppet last run on wtp2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:40:00] PROBLEM - Disk space on ms-be1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:40:08] PROBLEM - swift-object-updater on ms-be1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:40:09] PROBLEM - Check size of conntrack table on ms-be1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:40:19] PROBLEM - swift-container-server on ms-be1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:40:28] PROBLEM - swift-account-server on ms-be1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:40:38] PROBLEM - DPKG on ms-be1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:40:38] PROBLEM - swift-account-replicator on ms-be1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:40:39] PROBLEM - RAID on ms-be1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:40:49] PROBLEM - swift-container-auditor on ms-be1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:40:58] PROBLEM - configured eth on ms-be1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:40:59] PROBLEM - swift-container-replicator on ms-be1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:41:10] PROBLEM - salt-minion processes on ms-be1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:41:18] PROBLEM - swift-object-auditor on ms-be1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:41:18] PROBLEM - swift-container-updater on ms-be1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:41:29] PROBLEM - dhclient process on ms-be1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:41:38] PROBLEM - swift-account-auditor on ms-be1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:41:38] PROBLEM - swift-object-replicator on ms-be1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:41:49] PROBLEM - swift-object-server on ms-be1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:41:49] PROBLEM - swift-account-reaper on ms-be1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:57:18] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [5000000.0] [05:58:17] <_joe_> !log powercycling ms-be1010, xfs kernel lockups [05:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:00:29] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [5000000.0] [06:01:09] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 85.71% of data above the critical threshold [5000000.0] [06:01:20] RECOVERY - DPKG on ms-be1010 is OK: All packages OK [06:01:20] RECOVERY - swift-account-replicator on ms-be1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [06:01:20] RECOVERY - RAID on ms-be1010 is OK: OK: optimal, 14 logical, 14 physical [06:01:39] RECOVERY - swift-container-auditor on ms-be1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:01:40] RECOVERY - configured eth on ms-be1010 is OK: OK - interfaces up [06:01:48] RECOVERY - swift-container-replicator on ms-be1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [06:01:59] RECOVERY - salt-minion processes on ms-be1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:01:59] RECOVERY - swift-object-auditor on ms-be1010 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [06:01:59] RECOVERY - swift-container-updater on ms-be1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [06:02:18] RECOVERY - dhclient process on ms-be1010 is OK: PROCS OK: 0 processes with command name dhclient [06:02:19] RECOVERY - swift-object-replicator on ms-be1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [06:02:19] RECOVERY - swift-account-auditor on ms-be1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [06:02:30] RECOVERY - swift-object-server on ms-be1010 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [06:02:30] RECOVERY - swift-account-reaper on ms-be1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [06:02:39] RECOVERY - Disk space on ms-be1010 is OK: DISK OK [06:02:39] RECOVERY - very high load average likely xfs on ms-be1010 is OK: OK - load average: 11.50, 4.20, 1.53 [06:02:39] RECOVERY - swift-object-updater on ms-be1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [06:02:40] _joe_: I see those happen from time to time. Is that a known bug or just something we have? [06:02:49] RECOVERY - Check size of conntrack table on ms-be1010 is OK: OK: nf_conntrack is 8 % full [06:02:50] RECOVERY - swift-container-server on ms-be1010 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [06:02:59] RECOVERY - swift-account-server on ms-be1010 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [06:03:09] <_joe_> AaronSchulz: what do you mean? it's a known bug with xfs we have :) [06:03:34] heh [06:03:46] <_joe_> since those swift machines are still on precise, I doubt any kernel bug would be useful to reporto at this point [06:06:49] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [06:12:28] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [5000000.0] [06:14:56] (03PS1) 10Yuvipanda: archiva: Move nginx proxy template into module [puppet] - 10https://gerrit.wikimedia.org/r/255075 [06:16:28] PROBLEM - HTTP 5xx reqs/min anomaly on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [06:17:20] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [06:17:38] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [5000000.0] [06:17:59] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [06:21:19] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 87.50% of data above the critical threshold [5000000.0] [06:28:58] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [5000000.0] [06:30:18] PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:49] PROBLEM - puppet last run on mw2069 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:49] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:09] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:29] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:49] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 5 failures [06:31:59] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:39] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:39] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0] [06:39:09] PROBLEM - HTTP 5xx reqs/min anomaly on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 3 below the confidence bounds [06:46:07] (03PS1) 10Aaron Schulz: Remove obsolete "claimTTL" settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255077 [06:46:23] (03CR) 10jenkins-bot: [V: 04-1] Remove obsolete "claimTTL" settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255077 (owner: 10Aaron Schulz) [06:55:59] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:56:18] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:29] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:56:29] RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:08] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:09] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:57:09] RECOVERY - puppet last run on mw2069 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:24] (03PS1) 10Mdann52: Add rights to CU+OS groups on en.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255078 (https://phabricator.wikimedia.org/T119446) [06:57:29] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:57:42] (03CR) 10jenkins-bot: [V: 04-1] Add rights to CU+OS groups on en.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255078 (https://phabricator.wikimedia.org/T119446) (owner: 10Mdann52) [07:05:38] PROBLEM - HTTP 5xx reqs/min anomaly on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 9 below the confidence bounds [07:05:53] (03PS1) 10Yuvipanda: Move all monitoring groups to one file [puppet] - 10https://gerrit.wikimedia.org/r/255080 [07:06:14] _joe_: ^ (for monitoring_groups) [07:06:19] I'll merge a bunch of these tomorrow I guess [07:06:47] <_joe_> YuviPanda: add me as a reviewer if you like a CR [07:07:28] _joe_: sure, although I don't want to distract you from etcd :D [07:07:43] <_joe_> YuviPanda: that is actually very, very boring work [07:07:58] _joe_: so are the patches I'm doing :D [07:08:17] <_joe_> YuviPanda: as boring as backporting 25 packages? [07:08:26] oh definitely not [07:08:31] I can't do that as a coping mechanism no [07:08:40] * YuviPanda isn't really sure debs are worth it for go packages [07:08:47] * YuviPanda doesn't have a legit alternative yet though [07:08:53] maybe they are for things like etcd. [07:09:09] <_joe_> YuviPanda: I think Stapleberg really devised a braindead scheme for go packaging, but maybe it's me... [07:09:27] stapleberg? [07:09:58] <_joe_> https://people.debian.org/~stapelberg/ [07:10:01] also one of the biggest advantages of debian packaging (security updates for linked libraries) is lost since it's all static... [07:10:09] so you gotta rebuild your binary anyway [07:10:16] <_joe_> YuviPanda: yeah, that's my point [07:11:24] so IMO long term we should use something like scap3 instead of debian packages [07:11:49] <_joe_> for go apps? yeah I agree [07:11:59] <_joe_> or even simply git-fat and archiva ;) [07:12:04] yeah [07:12:12] I should move kubernetes to that some time [07:12:53] what is archiva? [07:13:07] <_joe_> archiva is actually a floss version of maven central [07:13:21] actually yeah, why archiva? [07:13:24] <_joe_> it's thought for java artifacts [07:13:28] archiva makes sense for jars... [07:13:39] i get it now, i watched once of the short videos [07:13:42] <_joe_> YuviPanda: well, it's what we're using with git-fat as a provider atm [07:14:51] ah [07:14:53] ok [07:14:54] fair enough [07:15:03] _joe_: do we have puppet abstractions for this? [07:15:51] <_joe_> YuviPanda: we use... trebuchet and git-fat atm [07:16:20] ah [07:16:26] so... no abstractions I can use yet [07:16:28] ok [07:17:34] <_joe_> but well, it's workable [07:18:08] wel, in prod since you already have a deployment host. [07:18:20] I can set one up in tools but I guess I can wait for scap3 [07:18:44] I'll play with different things in the mattermost module [07:18:55] (03Abandoned) 10Mdann52: Add rights to CU+OS groups on en.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255078 (https://phabricator.wikimedia.org/T119446) (owner: 10Mdann52) [07:20:38] PROBLEM - HTTP 5xx reqs/min anomaly on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 9 below the confidence bounds [07:22:03] (03PS1) 10Yuvipanda: k8s: Remove standalone role [puppet] - 10https://gerrit.wikimedia.org/r/255081 [07:26:18] PROBLEM - HTTP 5xx reqs/min anomaly on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [07:27:29] (03PS1) 10Yuvipanda: role: Move quarry to use autolayout [puppet] - 10https://gerrit.wikimedia.org/r/255082 [07:35:49] PROBLEM - HTTP 5xx reqs/min anomaly on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [07:41:28] PROBLEM - HTTP 5xx reqs/min anomaly on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [07:43:07] (03CR) 10Nikerabbit: [C: 031] CX: Fix article-recommender-1 campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255069 (https://phabricator.wikimedia.org/T118033) (owner: 10KartikMistry) [07:45:09] PROBLEM - HTTP 5xx reqs/min anomaly on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 7 below the confidence bounds [07:50:49] PROBLEM - HTTP 5xx reqs/min anomaly on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 7 below the confidence bounds [08:00:18] RECOVERY - HTTP 5xx reqs/min anomaly on graphite1001 is OK: OK: No anomaly detected [08:07:38] (03PS2) 10Aaron Schulz: Remove obsolete "claimTTL" settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255077 [08:24:44] !log dbstore1002 (analytics-slave) mysql is not responding, forcing restart [08:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:25:11] [10387456.618957] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) [08:51:40] 6operations, 10ops-eqiad: dbstore1002.mgmt.eqiad.wmnet: "No more sessions are available for this type of connection!" - https://phabricator.wikimedia.org/T119488#1827597 (10jcrespo) 3NEW [08:56:50] (03PS6) 10KartikMistry: WIP: service-runner migration for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) [09:01:18] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 1 failures [09:12:56] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Backport etcd 2.2 to jessie - https://phabricator.wikimedia.org/T118830#1827628 (10Joe) The easiest way to do this is to just take the etcd package in stretch and include it in our repository as-is. This will mean that we should, in order: 1)... [09:13:15] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972#1827629 (10Joe) p:5Normal>3High [09:28:20] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:32:46] (03PS3) 10Filippo Giunchedi: monitoring/graphite: Add until time limit argument [puppet] - 10https://gerrit.wikimedia.org/r/254846 (https://phabricator.wikimedia.org/T116035) (owner: 10Joal) [09:32:54] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] monitoring/graphite: Add until time limit argument [puppet] - 10https://gerrit.wikimedia.org/r/254846 (https://phabricator.wikimedia.org/T116035) (owner: 10Joal) [09:34:14] apergos: what do you think re: https://gerrit.wikimedia.org/r/#/c/254128/1 btw? [09:35:42] godog: yep [09:35:59] (03CR) 10ArielGlenn: [C: 031] deployment: add redis socket_connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/254128 (https://phabricator.wikimedia.org/T118380) (owner: 10Filippo Giunchedi) [09:36:50] kk, I'll followup with another review to actually set the timeout [09:37:06] (03PS2) 10Filippo Giunchedi: deployment: add redis socket_connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/254128 (https://phabricator.wikimedia.org/T118380) [09:37:12] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] deployment: add redis socket_connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/254128 (https://phabricator.wikimedia.org/T118380) (owner: 10Filippo Giunchedi) [09:38:19] (03PS1) 10Giuseppe Lavagetto: etcd: remove package etcdctl [puppet] - 10https://gerrit.wikimedia.org/r/255088 (https://phabricator.wikimedia.org/T118830) [09:42:52] (03PS1) 10Filippo Giunchedi: deployment: set socket_connect_timeout to 2s [puppet] - 10https://gerrit.wikimedia.org/r/255090 (https://phabricator.wikimedia.org/T118380) [09:42:57] apergos: ^ [09:43:31] also _joe_ if interested [09:43:41] (03CR) 10ArielGlenn: [C: 031] deployment: set socket_connect_timeout to 2s [puppet] - 10https://gerrit.wikimedia.org/r/255090 (https://phabricator.wikimedia.org/T118380) (owner: 10Filippo Giunchedi) [09:44:23] (03CR) 10Giuseppe Lavagetto: [C: 031] "JDI!" [puppet] - 10https://gerrit.wikimedia.org/r/255090 (https://phabricator.wikimedia.org/T118380) (owner: 10Filippo Giunchedi) [09:45:22] haha thanks [09:45:26] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] deployment: set socket_connect_timeout to 2s [puppet] - 10https://gerrit.wikimedia.org/r/255090 (https://phabricator.wikimedia.org/T118380) (owner: 10Filippo Giunchedi) [09:45:53] now let's see how much stuff I broke [09:47:43] heh [09:55:29] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures [09:55:29] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures [09:56:21] that's me, fixing shortly [09:56:53] rookie mistake, the pyredis option we're using has been introduced in 2014 [09:59:45] (03PS1) 10Filippo Giunchedi: deployment: fix pyredis timeout argument and timeout to 5s [puppet] - 10https://gerrit.wikimedia.org/r/255092 (https://phabricator.wikimedia.org/T118380) [10:00:41] 6operations, 10Gitblit: Update gitblit to 1.7.1 - https://phabricator.wikimedia.org/T119409#1827688 (10hashar) 5Open>3declined a:3hashar As I said on {T118156} >>! In T118156#1813516, @hashar wrote: > gitblit is being phased out in favor of Phabricator Diffusion. We have no plan to upgrade gitblit so no... [10:01:46] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: (null) [10:02:28] oh ffs, expect a shower of alerts [10:03:27] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [10:03:34] (03PS1) 10Filippo Giunchedi: Revert "monitoring/graphite: Add until time limit argument" [puppet] - 10https://gerrit.wikimedia.org/r/255094 [10:03:37] PROBLEM - Varnishkafka Delivery Errors per minute on cp1057 is CRITICAL: (null) [10:03:45] (03PS2) 10Filippo Giunchedi: Revert "monitoring/graphite: Add until time limit argument" [puppet] - 10https://gerrit.wikimedia.org/r/255094 [10:03:46] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL: (null) [10:03:46] PROBLEM - Varnishkafka Delivery Errors per minute on cp4020 is CRITICAL: (null) [10:03:47] PROBLEM - Varnishkafka Delivery Errors per minute on cp2004 is CRITICAL: (null) [10:03:47] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: (null) [10:03:47] PROBLEM - carbon-relay queue full on graphite1001 is CRITICAL: (null) [10:03:47] PROBLEM - Varnishkafka Delivery Errors per minute on cp4018 is CRITICAL: (null) [10:03:59] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "monitoring/graphite: Add until time limit argument" [puppet] - 10https://gerrit.wikimedia.org/r/255094 (owner: 10Filippo Giunchedi) [10:04:07] PROBLEM - Varnishkafka Delivery Errors per minute on cp2024 is CRITICAL: (null) [10:04:16] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL: (null) [10:04:17] PROBLEM - MediaWiki jobs not being inserted on graphite1001 is CRITICAL: (null) [10:04:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp3003 is CRITICAL: (null) [10:04:18] PROBLEM - carbon-cache write error on graphite1001 is CRITICAL: (null) [10:04:18] PROBLEM - HTTP 5xx reqs/min threshold on graphite1001 is CRITICAL: (null) [10:04:36] PROBLEM - swift eqiad-prod object availability on graphite1001 is CRITICAL: (null) [10:04:36] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: (null) [10:04:37] PROBLEM - Varnishkafka Delivery Errors per minute on cp1059 is CRITICAL: (null) [10:04:47] PROBLEM - Varnishkafka Delivery Errors per minute on cp3015 is CRITICAL: (null) [10:04:47] PROBLEM - Varnishkafka Delivery Errors per minute on cp3049 is CRITICAL: (null) [10:04:47] PROBLEM - MediaWiki jobs not dequeued on graphite1001 is CRITICAL: (null) [10:04:47] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: (null) [10:05:06] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: (null) [10:05:07] PROBLEM - Varnishkafka Delivery Errors per minute on cp1052 is CRITICAL: (null) [10:05:07] PROBLEM - Varnishkafka Delivery Errors per minute on cp1060 is CRITICAL: (null) [10:05:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp1048 is CRITICAL: (null) [10:05:27] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [10:05:36] PROBLEM - Varnishkafka Delivery Errors per minute on cp1047 is CRITICAL: (null) [10:05:37] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: (null) [10:05:45] sigh and can't kick icinga-wm, sorry for the noise [10:05:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp2009 is CRITICAL: (null) [10:05:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp3032 is CRITICAL: (null) [10:06:06] PROBLEM - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is CRITICAL: (null) [10:07:08] !log stop ircecho on neon temporarily while https://gerrit.wikimedia.org/r/#/c/255094/ applies [10:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:21:55] RECOVERY - HTTP 5xx reqs/min threshold on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:21:55] RECOVERY - Varnishkafka Delivery Errors per minute on cp3010 is OK: OK: Less than 80.00% above the threshold [0.0] [10:21:56] RECOVERY - swift eqiad-prod object availability on graphite1001 is OK: OK: Less than 1.00% under the threshold [95.0] [10:21:56] RECOVERY - Varnishkafka Delivery Errors per minute on cp2003 is OK: OK: Less than 80.00% above the threshold [0.0] [10:21:56] RECOVERY - Varnishkafka Delivery Errors per minute on cp3040 is OK: OK: Less than 80.00% above the threshold [0.0] [10:22:05] RECOVERY - Varnishkafka Delivery Errors per minute on cp3047 is OK: OK: Less than 80.00% above the threshold [0.0] [10:22:05] RECOVERY - Incoming network saturation on labstore1001 is OK: OK: Less than 10.00% above the threshold [75000000.0] [10:22:07] RECOVERY - carbon-cache write error on graphite1001 is OK: OK: Less than 1.00% above the threshold [1.0] [10:22:07] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [10:22:15] RECOVERY - Varnishkafka Delivery Errors per minute on cp4019 is OK: OK: Less than 80.00% above the threshold [0.0] [10:22:16] RECOVERY - Varnishkafka Delivery Errors per minute on cp3042 is OK: OK: Less than 80.00% above the threshold [0.0] [10:22:16] RECOVERY - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [10:22:17] RECOVERY - MediaWiki jobs not being inserted on graphite1001 is OK: OK: Less than 1.00% under the threshold [1.0] [10:22:26] RECOVERY - Varnishkafka Delivery Errors per minute on cp2023 is OK: OK: Less than 80.00% above the threshold [0.0] [10:22:26] RECOVERY - Varnishkafka Delivery Errors per minute on cp2014 is OK: OK: Less than 80.00% above the threshold [0.0] [10:22:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp1065 is OK: OK: Less than 80.00% above the threshold [0.0] [10:22:35] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK: OK: Less than 80.00% above the threshold [0.0] [10:22:35] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK: OK: Less than 80.00% above the threshold [0.0] [10:22:37] RECOVERY - MediaWiki jobs not dequeued on graphite1001 is OK: OK: Less than 1.00% under the threshold [1.0] [10:22:37] RECOVERY - Outgoing network saturation on labstore1001 is OK: OK: Less than 10.00% above the threshold [75000000.0] [10:27:12] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0] [10:27:13] (03PS2) 10Filippo Giunchedi: deployment: fix pyredis timeout argument and timeout to 5s [puppet] - 10https://gerrit.wikimedia.org/r/255092 (https://phabricator.wikimedia.org/T118380) [10:27:19] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] deployment: fix pyredis timeout argument and timeout to 5s [puppet] - 10https://gerrit.wikimedia.org/r/255092 (https://phabricator.wikimedia.org/T118380) (owner: 10Filippo Giunchedi) [10:27:22] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [10:28:16] !log dropping user_daily_contribs view from labsdb hosts [10:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:30:25] we should merge shard s5 and s6 into one [10:49:42] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Looks pretty much ready, minor comments about using hiera to get implicit lookups and setting up master and serverIDs" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/253347 (owner: 10Muehlenhoff) [10:51:24] RECOVERY - Varnishkafka Delivery Errors per minute on cp2020 is OK: OK: Less than 80.00% above the threshold [0.0] [10:51:59] 6operations, 10Gitblit: Update gitblit to 1.7.1 - https://phabricator.wikimedia.org/T119409#1827754 (10Paladox) Yes but you carn't view the raw file in phabricator if the patch is merged. This update should fix what was broken when we updated gitblit to 1.6.2. [10:57:59] 6operations, 10Gitblit: Update gitblit to 1.7.1 - https://phabricator.wikimedia.org/T119409#1827770 (10Aklapper) >>! In T119409#1827754, @Paladox wrote: > Yes but you carn't view the raw file in phabricator if the patch is merged. Yes but see T118156#1813516 [11:00:45] 6operations, 10Gitblit: Update gitblit to 1.7.1 - https://phabricator.wikimedia.org/T119409#1827772 (10Paladox) Yes. But why was gitblit updated when it was deprecated. Yes I know it is being deprecated it was updated which broke it. You carn't view a raw file when a patch still has to be merged: [11:08:25] 6operations, 7Database, 5Patch-For-Review: Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1827773 (10jcrespo) Finally, I have deleted all tables and view referencing this table and I am ready to close. One last thing, @ori, I saw the table bing recreat... [11:09:52] (03CR) 10Filippo Giunchedi: "I've reverted this change, it caused all graphite checks to report (null) in icinga and go CRITICAL with consequent alarm storm, it'll nee" [puppet] - 10https://gerrit.wikimedia.org/r/254846 (https://phabricator.wikimedia.org/T116035) (owner: 10Joal) [11:15:40] 6operations, 6Labs, 10Tool-Labs, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1827787 (10Luke081515) 5Open>3Resolved Replag is gone: http://tools.wmflabs.org/betacommand-dev/cgi-bin/replag Thanks for this quick fix. [11:16:39] 6operations, 6Labs, 10Tool-Labs, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1827790 (10jcrespo) p:5Unbreak!>3High The previous tables have been checked. Things seem back to normal. [11:18:49] 6operations, 10Salt, 5Patch-For-Review: slow salt-call invocation on minions - https://phabricator.wikimedia.org/T118380#1827793 (10fgiunchedi) 5Open>3stalled "fixed" as in the `socket_connect_timeout` option wasn't introduced until pyredis 2.10 (that means jessie) so we are passing `socket_timeout` to s... [11:19:16] 6operations, 6Labs, 10Tool-Labs, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1827795 (10jcrespo) This is not 100% fixed to me, some checks and actionables I mentioned are pending, although lower priority but I suppose we can track those on... [11:39:14] 6operations, 10Wikimedia-General-or-Unknown, 7user-notice: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1827831 (10IKhitron) Yesterday was another run day. Still nothing on hewiki. [11:46:12] 6operations, 10Wikimedia-General-or-Unknown, 7user-notice: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1827837 (10Krenair) It's still the same error as last time, @IKhitron. [11:48:09] !log rebooting lvs3003 with linux 4.2.6-1~bpo8+wmf1 [11:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:48:21] 6operations, 10Wikimedia-General-or-Unknown, 7user-notice: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1827842 (10IKhitron) I thought it was fixed from "I think the issue has improved". [11:48:24] Hey godog, just backloged the deploy thing [11:48:57] joal: you mean the graphite check? [11:49:01] yessir [11:49:13] Doesn't make sense to me :( [11:49:29] I'm gonna investigate [11:50:20] joal: thanks! let me know if you get stuck somewhere [11:50:41] 6operations, 10Wikimedia-General-or-Unknown, 7user-notice: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1827843 (10Krenair) If we thought the issue was fully fixed again, this would have been set back to resolved fixed. [11:51:58] 6operations, 10Wikimedia-General-or-Unknown, 7user-notice: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1827845 (10IKhitron) Again, I thought it wasn't closed because of enwiki problem only. [11:56:34] PROBLEM - Host lvs3003 is DOWN: PING CRITICAL - Packet loss = 100% [12:02:46] RECOVERY - Host lvs3003 is UP: PING WARNING - Packet loss = 61%, RTA = 89.48 ms [12:07:35] PROBLEM - pybal on lvs3003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [12:09:26] RECOVERY - pybal on lvs3003 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [12:13:03] rebooting it again, ignore lvs3003 warnings until I !log again [12:14:23] 6operations, 10Wikimedia-General-or-Unknown, 7user-notice: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1827896 (10jcrespo) No, I've been monitoring the errors to understand the impact (it is impacting several wikis).... [12:14:55] PROBLEM - Host lvs3003 is DOWN: PING CRITICAL - Packet loss = 100% [12:16:05] RECOVERY - Host lvs3003 is UP: PING OK - Packet loss = 0%, RTA = 88.30 ms [12:18:40] !log stopping pybal on lvs3001, switching its traffic to lvs3003 [12:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:22:55] PROBLEM - pybal on lvs3001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [12:24:54] RECOVERY - pybal on lvs3001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [12:28:21] <_joe_> I know you said to ignore those messages, it's still scary every time I read them :P [12:29:31] heh [12:29:38] oh puppet started pybal dammit :) [12:30:46] * paravoid patiently awaits for tunable MED in pybal [12:30:49] this friday, right mark? :P [12:31:23] if y'all leave me alone, perhaps :P [12:34:36] PROBLEM - pybal on lvs3001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [12:34:47] 4.3.0 seems stable [12:39:09] Hey godog, do you have a minute? [12:39:17] 6operations, 10Wikimedia-General-or-Unknown, 7user-notice: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1827929 (10IKhitron) I see. Thanks. [12:39:28] (03PS2) 10DCausse: Add 2 payloads map fields to CirrusSearchRequestSet avro schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252957 (https://phabricator.wikimedia.org/T118570) [12:39:53] (03CR) 10jenkins-bot: [V: 04-1] Add 2 payloads map fields to CirrusSearchRequestSet avro schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252957 (https://phabricator.wikimedia.org/T118570) (owner: 10DCausse) [12:41:47] (03PS1) 10KartikMistry: CX: Use ContentTranslationRESTBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255102 [12:41:57] (03PS3) 10DCausse: Add 2 payloads map fields to CirrusSearchRequestSet avro schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252957 (https://phabricator.wikimedia.org/T118570) [12:42:59] (03PS2) 10KartikMistry: CX: Use ContentTranslationRESTBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255102 (https://phabricator.wikimedia.org/T111562) [12:44:11] (03PS4) 10DCausse: Add 2 payloads map fields to CirrusSearchRequestSet avro schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252957 (https://phabricator.wikimedia.org/T118570) [12:45:03] ok, starting pybal on 3001 again, wanna go grab something to eat [12:45:08] (03CR) 10DCausse: [C: 04-1] "Depends on : Ia57d53c and Icc0f92b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252957 (https://phabricator.wikimedia.org/T118570) (owner: 10DCausse) [12:46:05] RECOVERY - pybal on lvs3001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [13:03:25] PROBLEM - Disk space on restbase1007 is CRITICAL: DISK CRITICAL - free space: /var 70540 MB (3% inode=99%) [13:05:36] <_joe_> godog, mobrovac ^^ [13:19:48] thanks, looking [13:25:03] stopping pybal on lvs3001 again [13:28:30] joal: yup, what's up? [13:30:24] PROBLEM - pybal on lvs3001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [13:38:43] (03PS1) 10BBlack: varnish: refactor instance parameters (no-op) [puppet] - 10https://gerrit.wikimedia.org/r/255107 (https://phabricator.wikimedia.org/T119396) [13:38:45] (03PS1) 10BBlack: misc-cluster 2layer refactor, step 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/255108 (https://phabricator.wikimedia.org/T119394) [13:38:47] (03PS1) 10BBlack: misc-cluster 2layer refactor, step 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/255109 (https://phabricator.wikimedia.org/T119394) [13:38:49] (03PS1) 10BBlack: misc-cluster 2layer refactor, step 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/255110 (https://phabricator.wikimedia.org/T119394) [13:39:19] ACKNOWLEDGEMENT - Disk space on restbase1007 is CRITICAL: DISK CRITICAL - free space: /var 53253 MB (3% inode=99%): Filippo Giunchedi restbase1001 bootstrapping [13:40:12] (03CR) 10jenkins-bot: [V: 04-1] varnish: refactor instance parameters (no-op) [puppet] - 10https://gerrit.wikimedia.org/r/255107 (https://phabricator.wikimedia.org/T119396) (owner: 10BBlack) [13:40:26] (03CR) 10jenkins-bot: [V: 04-1] misc-cluster 2layer refactor, step 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/255108 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [13:40:36] (03CR) 10jenkins-bot: [V: 04-1] misc-cluster 2layer refactor, step 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/255109 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [13:40:43] (03CR) 10jenkins-bot: [V: 04-1] misc-cluster 2layer refactor, step 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/255110 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [13:47:00] jenkins-bot: :P [13:48:19] (03PS2) 10BBlack: misc-cluster 2layer refactor, step 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/255108 (https://phabricator.wikimedia.org/T119394) [13:48:21] (03PS2) 10BBlack: misc-cluster 2layer refactor, step 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/255109 (https://phabricator.wikimedia.org/T119394) [13:48:23] (03PS2) 10BBlack: misc-cluster 2layer refactor, step 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/255110 (https://phabricator.wikimedia.org/T119394) [13:48:24] PROBLEM - Disk space on lvs3003 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied [13:48:25] (03PS2) 10BBlack: varnish: refactor instance parameters (no-op) [puppet] - 10https://gerrit.wikimedia.org/r/255107 (https://phabricator.wikimedia.org/T119396) [13:50:11] (03CR) 10jenkins-bot: [V: 04-1] misc-cluster 2layer refactor, step 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/255109 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [13:51:10] (03CR) 10jenkins-bot: [V: 04-1] misc-cluster 2layer refactor, step 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/255110 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [13:53:34] /sys/kernel/debug/tracing? lol [13:54:27] debugfs [13:54:41] hey godog [13:55:08] I wonder if the errors could not come from deployment order based on merge [13:55:32] godog: --^ [13:55:55] RECOVERY - Disk space on lvs3003 is OK: DISK OK [13:56:01] godog: I have tested and re-tested the python with various metrics, and I didn't have any error [13:56:56] !log upgrade diamond to 3.5-4 on precise hosts in esams [13:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:57:24] esams precise hosts? [13:57:26] that's just hooft, right? [13:57:43] waiting on a port of the ganglia stuff to systemd, iirc [13:58:16] paravoid: in production only hooft yes, there's also ms-be hosts there [13:58:22] oh right [13:58:47] joal: let me check again the logs on neon (the icinga host) [13:58:57] sure [13:59:11] Can you show in the mean time godog, trying to better understand : [14:00:50] (03PS3) 10BBlack: misc-cluster 2layer refactor, step 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/255109 (https://phabricator.wikimedia.org/T119394) [14:00:52] (03PS3) 10BBlack: misc-cluster 2layer refactor, step 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/255110 (https://phabricator.wikimedia.org/T119394) [14:01:29] hooft should not be hard to upgrade I think? [14:04:11] lol [14:04:19] "If you have questions on this topic that are not answered by the documentation, you may wish to go on #wikimedia-databases and talk to meta:User:Jynus, who is WMF's Senior Database Administrator and an expert on MySQL performance" [14:05:51] joal: this is the part of the diff applied by puppet on neon, looks correct to me though https://phabricator.wikimedia.org/P2351 [14:06:49] godog: these changes are dependant on the change on python [14:07:57] yeah the python changes were applied too together by python [14:08:03] by puppet... [14:08:39] hm :( [14:09:51] godog: icinga logs tells you a bit more or nothing ? [14:13:15] joal: not a whole lot, I've commented on the paste, I'm surprised too by just "(null)" from icinga, the docs apparently say "The internal call to execvp didn't return anything." [14:16:41] akosiaris: did you by any chance disable puppet on the maps-test hosts in late Oct for something related to postgres? and if so, can they be re-enabled now? [14:16:55] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:18:42] joal: could be also that icinga doesn't like the command definition being changed without a restart [14:19:40] akosiaris, could you comment on https://gerrit.wikimedia.org/r/#/c/254490/ [14:22:28] !log upgrade diamond to 3.5-4 on precise hosts in ulsfo [14:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:23:53] 6operations: Incomplete puppet fact generation for packages help back for adding new binary packages - https://phabricator.wikimedia.org/T119503#1828019 (10MoritzMuehlenhoff) 3NEW [14:24:03] !log upgrade diamond to 3.5-4 on precise hosts in eqiad [14:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:26:27] godog: what's the state on rb1007? [14:26:30] * mobrovac looking [14:26:34] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 47663 bytes in 9.787 second response time [14:26:35] should be ok though [14:26:41] compaction etc [14:26:47] happened this weekend on rb1008 [14:27:13] 6operations: Incomplete puppet fact generation for packages held back for adding new binary packages - https://phabricator.wikimedia.org/T119503#1828026 (10MoritzMuehlenhoff) [14:27:15] PROBLEM - puppet last run on ms-be1004 is CRITICAL: CRITICAL: Puppet has 1 failures [14:27:25] PROBLEM - puppet last run on magnesium is CRITICAL: CRITICAL: Puppet has 1 failures [14:27:50] mobrovac: running short on disk space on 1007 in this case is due to decommissioning 1001 as it is moving its data in rack 'a' [14:28:01] ah damn [14:29:20] godog: it's gonna run out of space [14:30:28] godog: I'm ask for ottomata help here [14:30:36] godog: https://phabricator.wikimedia.org/P2352 [14:30:57] we need to do something about this [14:31:01] asap [14:31:04] Can't figure out what the stuff is (python returns a result o nteh few examples I have tested) [14:31:15] RECOVERY - puppet last run on ms-be1004 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [14:31:24] RECOVERY - puppet last run on magnesium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:31:32] mobrovac: *nod* I've seen cassandra trying to free up disk space if it running short [14:33:02] hey joal, any particular reason you made --until $ARG7$instead of just appending it to the CLI as $ARG9$? if it was ARG9, the others wouldn't have to be reordered (not that I see why it makes a difference anyway) [14:33:14] mobrovac: one viable option I think might be to shut cassandra on 1007 [14:33:44] lemme check the other nodes godog [14:35:47] mobrovac: other nodes are fine afaics, it is 1002 and 1007 that are receiving data from 1001 [14:39:14] godog: rb1007 seems to be freeing up space, it's down to 95% now [14:39:52] godog: let's keep it up and monitor the situation [14:40:14] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [14:40:29] ottomata: just to have from / until in a visible order [14:40:37] ah ok [14:40:38] yeah ok [14:40:45] joal, i'll try to look into this today, mayyyybe tomorrow [14:41:03] ottomata: thanks man, I really don't get it [14:41:14] should be able to run the actual check_graphite command with the options that puppet will pass down the line and see what happens [14:41:30] mobrovac: what happened is that streaming from 1001 failed and it is continuing to stream to 1002 which has enough disk [14:41:59] ah ok [14:42:19] godog: we're still not out of the woods, as compaction may build it up again for rb1007 [14:42:26] ottomata: yeah the change seems fine, it might be icinga though [14:42:35] yeah that's true [14:42:51] Thanks ottomata [14:43:28] (03PS7) 10Muehlenhoff: labs openldap role [puppet] - 10https://gerrit.wikimedia.org/r/253347 [14:47:10] (03PS1) 10coren: Add nsswitch_conf_source as parameter to ldap::client::nss [puppet] - 10https://gerrit.wikimedia.org/r/255113 [14:47:21] mutante: ^^ new approach. [14:49:54] 6operations, 7Database, 7Wikimedia-log-errors: Spikes of job runner new connection errors to mysql "Error connecting to 10.64.32.24: Can't connect to MySQL server on '10.64.32.24' (4)" - mainly on db1035 - https://phabricator.wikimedia.org/T107072#1828049 (10jcrespo) p:5Normal>3Low Nah, it is still happe... [14:56:55] apergos: When you have a minute, can you check https://gerrit.wikimedia.org/r/255113 please? [14:58:44] lookin. [15:00:42] (03PS6) 10Muehlenhoff: Exclude apport from toollabs genpp python list [puppet] - 10https://gerrit.wikimedia.org/r/254156 [15:01:33] (03CR) 10Muehlenhoff: [C: 032 V: 032] Exclude apport from toollabs genpp python list [puppet] - 10https://gerrit.wikimedia.org/r/254156 (owner: 10Muehlenhoff) [15:02:51] (03PS3) 10BBlack: misc-cluster 2layer refactor, step 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/255108 (https://phabricator.wikimedia.org/T119394) [15:02:53] (03PS4) 10BBlack: misc-cluster 2layer refactor, step 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/255109 (https://phabricator.wikimedia.org/T119394) [15:02:55] (03PS4) 10BBlack: misc-cluster 2layer refactor, step 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/255110 (https://phabricator.wikimedia.org/T119394) [15:02:57] (03PS3) 10BBlack: varnish: refactor instance parameters (no-op) [puppet] - 10https://gerrit.wikimedia.org/r/255107 (https://phabricator.wikimedia.org/T119396) [15:03:51] (03CR) 10Muehlenhoff: "python-apport as installed by toollabs has been cleared in" [puppet] - 10https://gerrit.wikimedia.org/r/253593 (owner: 10Muehlenhoff) [15:04:30] (03CR) 10jenkins-bot: [V: 04-1] misc-cluster 2layer refactor, step 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/255108 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [15:04:37] (03CR) 10jenkins-bot: [V: 04-1] misc-cluster 2layer refactor, step 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/255109 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [15:04:52] (03CR) 10jenkins-bot: [V: 04-1] misc-cluster 2layer refactor, step 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/255110 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [15:05:26] (03CR) 10jenkins-bot: [V: 04-1] varnish: refactor instance parameters (no-op) [puppet] - 10https://gerrit.wikimedia.org/r/255107 (https://phabricator.wikimedia.org/T119396) (owner: 10BBlack) [15:05:31] blah [15:08:03] (03PS4) 10BBlack: misc-cluster 2layer refactor, step 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/255108 (https://phabricator.wikimedia.org/T119394) [15:08:05] (03PS5) 10BBlack: misc-cluster 2layer refactor, step 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/255109 (https://phabricator.wikimedia.org/T119394) [15:08:07] (03PS5) 10BBlack: misc-cluster 2layer refactor, step 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/255110 (https://phabricator.wikimedia.org/T119394) [15:08:09] (03PS4) 10BBlack: varnish: refactor instance parameters (no-op) [puppet] - 10https://gerrit.wikimedia.org/r/255107 (https://phabricator.wikimedia.org/T119396) [15:08:20] (03CR) 10Filippo Giunchedi: [C: 04-1] "minor nits" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/255080 (owner: 10Yuvipanda) [15:08:50] -256 [15:08:55] oops [15:09:33] (03CR) 10jenkins-bot: [V: 04-1] misc-cluster 2layer refactor, step 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/255108 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [15:09:35] (03PS1) 10Muehlenhoff: Add comment on server_id parameter in openldap module [puppet] - 10https://gerrit.wikimedia.org/r/255115 [15:09:45] (03CR) 10jenkins-bot: [V: 04-1] misc-cluster 2layer refactor, step 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/255109 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [15:09:48] (03CR) 10jenkins-bot: [V: 04-1] misc-cluster 2layer refactor, step 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/255110 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [15:10:30] (03CR) 10jenkins-bot: [V: 04-1] varnish: refactor instance parameters (no-op) [puppet] - 10https://gerrit.wikimedia.org/r/255107 (https://phabricator.wikimedia.org/T119396) (owner: 10BBlack) [15:10:37] (03CR) 10ArielGlenn: [C: 031] Add nsswitch_conf_source as parameter to ldap::client::nss [puppet] - 10https://gerrit.wikimedia.org/r/255113 (owner: 10coren) [15:11:08] (03PS2) 10coren: Add nsswitch_conf_source as parameter to ldap::client::nss [puppet] - 10https://gerrit.wikimedia.org/r/255113 [15:12:35] (03CR) 10coren: [V: 032] Add nsswitch_conf_source as parameter to ldap::client::nss [puppet] - 10https://gerrit.wikimedia.org/r/255113 (owner: 10coren) [15:12:46] (03CR) 10coren: [C: 032] Add nsswitch_conf_source as parameter to ldap::client::nss [puppet] - 10https://gerrit.wikimedia.org/r/255113 (owner: 10coren) [15:14:31] I kan haz suxess! [15:18:01] 6operations, 10Deployment-Systems: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1828101 (10fgiunchedi) thanks @bd808 for the investigation! indeed as @mmodell points out it seems simpler to reserve an uid and have those matching on tin/mira given the above [15:22:04] (03PS1) 10coren: Labs: remove has_admin override for labs::nfs::fileserver [puppet] - 10https://gerrit.wikimedia.org/r/255118 (https://phabricator.wikimedia.org/T87870) [15:22:15] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:18] paravoid: Do you want to do the honors? ^ :-) [15:22:29] (03PS2) 10Filippo Giunchedi: diamond: use upstream StatsdHandler [puppet] - 10https://gerrit.wikimedia.org/r/254872 (https://phabricator.wikimedia.org/T116033) [15:22:34] sorry, busy [15:22:40] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] diamond: use upstream StatsdHandler [puppet] - 10https://gerrit.wikimedia.org/r/254872 (https://phabricator.wikimedia.org/T116033) (owner: 10Filippo Giunchedi) [15:22:52] paravoid: kk. Just thought you'd enjoy it - there is no need for you to be the one doing it. :-P [15:27:43] !log disabling puppet on both gallium and scandium [15:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:50] (03PS2) 10Andrew Bogott: zuul: support for zuul-merger gerrit ssh key [puppet] - 10https://gerrit.wikimedia.org/r/253925 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [15:29:48] (03CR) 10Andrew Bogott: [C: 032] zuul: support for zuul-merger gerrit ssh key [puppet] - 10https://gerrit.wikimedia.org/r/253925 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [15:30:30] (03PS5) 10BBlack: misc-cluster 2layer refactor, step 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/255108 (https://phabricator.wikimedia.org/T119394) [15:30:32] (03PS6) 10BBlack: misc-cluster 2layer refactor, step 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/255109 (https://phabricator.wikimedia.org/T119394) [15:30:34] (03PS6) 10BBlack: misc-cluster 2layer refactor, step 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/255110 (https://phabricator.wikimedia.org/T119394) [15:30:36] (03PS5) 10BBlack: varnish: refactor instance parameters (no-op) [puppet] - 10https://gerrit.wikimedia.org/r/255107 (https://phabricator.wikimedia.org/T119396) [15:32:11] (03CR) 10jenkins-bot: [V: 04-1] misc-cluster 2layer refactor, step 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/255110 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [15:32:17] (03CR) 10jenkins-bot: [V: 04-1] varnish: refactor instance parameters (no-op) [puppet] - 10https://gerrit.wikimedia.org/r/255107 (https://phabricator.wikimedia.org/T119396) (owner: 10BBlack) [15:32:20] (03PS2) 10coren: Labs: remove has_admin override for labs::nfs::fileserver [puppet] - 10https://gerrit.wikimedia.org/r/255118 (https://phabricator.wikimedia.org/T87870) [15:32:22] (03CR) 10jenkins-bot: [V: 04-1] misc-cluster 2layer refactor, step 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/255108 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [15:32:32] apergos: ^^ w/ extra too [15:32:47] (03CR) 10jenkins-bot: [V: 04-1] misc-cluster 2layer refactor, step 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/255109 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [15:33:13] (03CR) 10ArielGlenn: [C: 031] Labs: remove has_admin override for labs::nfs::fileserver [puppet] - 10https://gerrit.wikimedia.org/r/255118 (https://phabricator.wikimedia.org/T87870) (owner: 10coren) [15:33:31] yay [15:33:40] !log enabled puppet on both gallium and scandium [15:33:41] (03PS3) 10coren: Labs: remove has_admin override for labs::nfs::fileserver [puppet] - 10https://gerrit.wikimedia.org/r/255118 (https://phabricator.wikimedia.org/T87870) [15:33:43] (03PS6) 10BBlack: misc-cluster 2layer refactor, step 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/255108 (https://phabricator.wikimedia.org/T119394) [15:33:45] (03PS7) 10BBlack: misc-cluster 2layer refactor, step 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/255109 (https://phabricator.wikimedia.org/T119394) [15:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:47] (03PS7) 10BBlack: misc-cluster 2layer refactor, step 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/255110 (https://phabricator.wikimedia.org/T119394) [15:33:49] (03PS6) 10BBlack: varnish: refactor instance parameters (no-op) [puppet] - 10https://gerrit.wikimedia.org/r/255107 (https://phabricator.wikimedia.org/T119396) [15:35:18] (03CR) 10coren: [C: 032] "Praise $deity!" [puppet] - 10https://gerrit.wikimedia.org/r/255118 (https://phabricator.wikimedia.org/T87870) (owner: 10coren) [15:37:50] apergos: Huh. Created the users and group (ops) as expected, but: [15:37:56] Error: /usr/local/sbin/enforce-users-groups returned 1 instead of one of [0] [15:37:56] Error: /Stage[main]/Admin/Exec[enforce-users-groups-cleanup]/returns: change from notrun to 0 failed: /usr/local/sbin/enforce-users-groups returned 1 instead of one of [0] [15:38:13] !log restarted analytics1030 to pick up openjdk security update (plus updates for libpng, nss, nspr, pixbuf and libxml as used by openjdk) [15:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:19] some user in there with a dup or bad uid I guess [15:38:49] I wonder which user it doesn't like [15:39:17] '/usr/local/sbin/enforce-users-groups dryrun' unhelfully exits with 1 but no message. [15:39:29] what host is that? [15:39:37] labstore1001 [15:39:44] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 47677 bytes in 8.198 second response time [15:40:45] PROBLEM - puppet last run on labstore1001 is CRITICAL: CRITICAL: Puppet has 1 failures [15:41:45] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [15:42:03] if [[ `hostname -s` =~ ^labstore100 ]]; then [15:42:03] exit 1 [15:42:03] fi [15:42:07] that would be the problem [15:42:13] Oh lulz. [15:42:20] Special case for the loss. [15:42:26] yup [15:42:41] * Coren fixies. [15:43:15] PROBLEM - puppet last run on labstore1002 is CRITICAL: CRITICAL: Puppet has 1 failures [15:43:16] PROBLEM - RAID on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:43:34] (03PS2) 10Jhobs: Third QuickSurveys external survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254908 (https://phabricator.wikimedia.org/T116433) [15:43:34] PROBLEM - Disk space on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:43:45] PROBLEM - configured eth on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:43:53] (03CR) 10Zfilipin: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/254838 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [15:43:54] PROBLEM - Check size of conntrack table on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:43:57] (03CR) 10Zfilipin: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/254841 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [15:44:00] (03CR) 10Zfilipin: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/254855 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [15:44:11] (03CR) 10jenkins-bot: [V: 04-1] RuboCop: fixed Lint/UnusedMethodArgument offense [puppet] - 10https://gerrit.wikimedia.org/r/254838 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [15:44:14] PROBLEM - dhclient process on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:44:15] PROBLEM - DPKG on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:44:29] (03CR) 10jenkins-bot: [V: 04-1] RuboCop: fixed Style/AndOr offense [puppet] - 10https://gerrit.wikimedia.org/r/254841 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [15:44:46] PROBLEM - salt-minion processes on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:44:54] PROBLEM - puppet last run on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:45:27] (03PS2) 10Andrew Bogott: Remove lanthanum.eqiad.wmnet hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/254132 (https://phabricator.wikimedia.org/T86658) (owner: 10Hashar) [15:46:10] (03PS1) 10coren: Labs: remove labstore special case from enforce-users-groups [puppet] - 10https://gerrit.wikimedia.org/r/255119 (https://phabricator.wikimedia.org/T87870) [15:46:22] apergos: ^^ [15:46:41] (03CR) 10Andrew Bogott: [C: 032] Remove lanthanum.eqiad.wmnet hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/254132 (https://phabricator.wikimedia.org/T86658) (owner: 10Hashar) [15:46:50] (03CR) 10ArielGlenn: [C: 031] Labs: remove labstore special case from enforce-users-groups [puppet] - 10https://gerrit.wikimedia.org/r/255119 (https://phabricator.wikimedia.org/T87870) (owner: 10coren) [15:47:28] (03PS2) 10coren: Labs: remove labstore special case from enforce-users-groups [puppet] - 10https://gerrit.wikimedia.org/r/255119 (https://phabricator.wikimedia.org/T87870) [15:48:59] (03CR) 10coren: [C: 032] "Special cases are bad, mmmkay?" [puppet] - 10https://gerrit.wikimedia.org/r/255119 (https://phabricator.wikimedia.org/T87870) (owner: 10coren) [15:49:41] hey coren just to confirm, the labvirt* and the labnet/labnodepool hosts have a special stanza on the routers for network isolation right? [15:50:08] They should indeed. [15:50:16] RECOVERY - puppet last run on labstore1001 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [15:50:28] ok, gotta get the new saltmaster added to that too. thanks [15:50:57] apergos: There should be two stanzas, actually. "labs hosts" and "labs support". [15:51:15] Coren: did you remember to remove the exception from the enforce-user-groups script? [15:51:16] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review, 7Ruby: Move RuboCop job from experimental pipeline to the usual pipelines for operations/puppet - https://phabricator.wikimedia.org/T110019#1828231 (10zeljkofilipin) 5Open>3Resolved [15:51:18] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review, 7WorkType-Maintenance: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1828232 (10zeljkofilipin) [15:51:36] paravoid: No, but that was quickly found and fixed. :-) [15:51:38] (03PS1) 10EBernhardson: Enable es labs replica writes for nlwiki, frwiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255122 [15:52:10] cool :) [15:52:30] paravoid: The deed is done, btw. labstore* are now normal hosts. [15:52:40] yay [15:52:46] RECOVERY - puppet last run on labstore1002 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [15:53:04] paravoid: That LD_PRELOAD trickery was outright inspired. [15:53:37] the other option would be running it a different namespace, but it'd be more complicated [15:53:42] this one is simple enough I think [15:54:19] nice work on the implementation [15:54:27] paravoid: Interestingly enough, it's usable for other tools (like getent, etc) so you can still debug the LDAP-ized view. I should probably make a small generic 'useldap ' and put it in the package. [15:54:33] it took a little while longer that I would have liked, but it's nice to finally see it done :) [15:54:52] yeah, that doesn't sound like a bad idea at all [15:55:10] it should work for even things like "ls" [15:55:15] * Coren does so before he forgets since it's short-and-sweet. [15:56:27] 6operations, 10netops: add new saltmaster (neodymium) to network exceptions for labvirt* etc hosts - https://phabricator.wikimedia.org/T119512#1828250 (10ArielGlenn) 3NEW [15:59:45] (03PS8) 10Andrew Bogott: Rename holmium to labservices1002 [puppet] - 10https://gerrit.wikimedia.org/r/254465 [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151124T1600). Please do the needful. [16:00:04] James_F jhobs MatmaRex kart_: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:12] * James_F waves. [16:00:21] yo [16:00:22] sup [16:01:01] I can SWAT today, I'll run through config first and then move on to the core patches. James_F you're up :) [16:01:13] Hey. [16:02:08] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250473 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [16:02:15] (03PS1) 10coren: Add 'useldap' command line utility [debs/nfsd-ldap] - 10https://gerrit.wikimedia.org/r/255125 [16:02:18] here [16:02:27] paravoid: ^^ [16:02:49] (03Merged) 10jenkins-bot: Enable VisualEditor for all new accounts on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250473 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [16:03:08] Gosh. [16:03:57] Coren: not "$@" I think? [16:04:07] oh hrm, that will be tricky [16:04:14] hmm, looks like portals has a dirty workdir, anyone know anything there? [16:04:23] paravoid: Yeah, "$@" has the right magic. [16:04:31] yeah I guess so, nevermind [16:04:35] lgtm [16:04:50] But that $0 in single quotes is not going to be so useful. ;_0 [16:04:56] (03PS2) 10coren: Add 'useldap' command line utility [debs/nfsd-ldap] - 10https://gerrit.wikimedia.org/r/255125 [16:05:00] heh true [16:06:21] (03CR) 10coren: [C: 032 V: 032] " lgtm" [debs/nfsd-ldap] - 10https://gerrit.wikimedia.org/r/255125 (owner: 10coren) [16:06:46] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable VisualEditor for all new accounts on eswiki [[gerrit:250473]] (duration: 00m 46s) [16:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:06:50] ^ James_F sync'd! [16:06:51] * Coren builds, puts in the repo, and *finally* ends working on that. [16:06:55] thcipriani: Whee. [16:07:32] 6operations: Incomplete puppet fact generation for packages held back for adding new binary packages - https://phabricator.wikimedia.org/T119503#1828272 (10akosiaris) [16:08:16] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254908 (https://phabricator.wikimedia.org/T116433) (owner: 10Jhobs) [16:08:39] (03Merged) 10jenkins-bot: Third QuickSurveys external survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254908 (https://phabricator.wikimedia.org/T116433) (owner: 10Jhobs) [16:08:45] !log starting pybal on lvs3001 again, testing is over [16:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:10:25] RECOVERY - pybal on lvs3001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [16:10:51] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Third QuickSurveys external survey [[gerrit:254908]] (duration: 00m 30s) [16:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:10:56] ^ jhobs check please [16:12:00] thcipriani: looks good thanks! [16:12:08] jhobs: thanks for checking! [16:12:26] (03PS2) 10Jforrester: Enable VisualEditor for all accounts on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250474 (https://phabricator.wikimedia.org/T117410) [16:12:34] MatmaRex: I'm going to come back to yours because Zend tests. [16:13:00] (03CR) 10Jforrester: "Scheduled for Dec 1st." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250474 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [16:13:06] sure. [16:13:57] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255069 (https://phabricator.wikimedia.org/T118033) (owner: 10KartikMistry) [16:14:35] (03Merged) 10jenkins-bot: CX: Fix article-recommender-1 campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255069 (https://phabricator.wikimedia.org/T118033) (owner: 10KartikMistry) [16:15:36] (03PS2) 10ArielGlenn: Add jgirault and jdrewniak to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/254409 (owner: 10Andrew Bogott) [16:16:00] 6operations: Incomplete puppet fact generation for packages held back for adding new binary packages - https://phabricator.wikimedia.org/T119503#1828294 (10akosiaris) That specific behavior was added in https://github.com/servermon/servermon/commit/d8a25f0c2dda2a5465096b2af67c65dd204bf989 when dist_upgrade was... [16:16:42] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: CX: Fix article-recommender-1 campaign [[gerrit:255069]] (duration: 00m 27s) [16:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:48] ^ kart_ check please [16:17:03] 6operations, 10Traffic: Upgrade LVS servers to a 4.3+ kernel - https://phabricator.wikimedia.org/T119515#1828296 (10faidon) 3NEW [16:17:13] (03CR) 10ArielGlenn: [C: 032] Add jgirault and jdrewniak to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/254409 (owner: 10Andrew Bogott) [16:17:59] thcipriani: good now. [16:18:03] kart_: thanks! [16:18:08] thcipriani: var_dump( $wgContentTranslationCampaigns ); works as expected. [16:19:02] ebernhardson: ping for SWAT, you around? [16:19:24] thcipriani: yup [16:19:36] kk, you're up :) [16:19:51] :) [16:20:06] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255122 (owner: 10EBernhardson) [16:20:46] (03Merged) 10jenkins-bot: Enable es labs replica writes for nlwiki, frwiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255122 (owner: 10EBernhardson) [16:21:06] ebernhardson: anything special needed before sync? [16:21:48] thcipriani: nope, just sync [16:21:51] kk [16:23:13] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable es labs replica writes for nlwiki, frwiki and eswiki [[gerrit:255122]] (duration: 00m 27s) [16:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:20] ^ ebernhardson check please [16:24:01] thcipriani: writes look to be going through, replica server generally still looks happy [16:24:12] mostly i just have to monitor it over the next hour for things to start backing up [16:24:21] ebernhardson: okie doke, thanks! [16:24:22] 10Ops-Access-Requests, 6operations: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1828322 (10ArielGlenn) We may still need to give you the right access on stat1003; your accounts are up on s... [16:25:19] 6operations, 10Traffic: upgrade lvs1001-3 to jessie - https://phabricator.wikimedia.org/T119517#1828325 (10BBlack) 3NEW a:3BBlack [16:25:34] PROBLEM - NTP on seaborgium is CRITICAL: NTP CRITICAL: No response from NTP server [16:25:47] 6operations, 10Traffic: Upgrade LVS servers to a 4.3+ kernel - https://phabricator.wikimedia.org/T119515#1828335 (10BBlack) [16:25:48] 6operations, 10Traffic: upgrade lvs1001-3 to jessie - https://phabricator.wikimedia.org/T119517#1828325 (10BBlack) [16:27:52] (03PS2) 10ArielGlenn: adding ejegg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/252778 (https://phabricator.wikimedia.org/T118320) (owner: 10RobH) [16:29:20] (03CR) 10ArielGlenn: [C: 032] adding ejegg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/252778 (https://phabricator.wikimedia.org/T118320) (owner: 10RobH) [16:29:23] 6operations, 10Gitblit: Update gitblit to 1.7.1 - https://phabricator.wikimedia.org/T119409#1828339 (10demon) >>! In T119409#1827772, @Paladox wrote: > Yes. But why was gitblit updated when it was deprecated. > Don't ask me, I didn't upgrade it. [16:29:37] 10Ops-Access-Requests, 6operations: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1828342 (10Ottomata) If one has stat1002 access via the `analytics-privatedata-users` group, they can read th... [16:30:54] 10Ops-Access-Requests, 6operations, 6Multimedia: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1828344 (10ArielGlenn) p:5Triage>3Normal a:3ArielGlenn [16:33:24] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for ejegg - https://phabricator.wikimedia.org/T118320#1828347 (10ArielGlenn) Robh already had a patch, it's merged, your account is created and ready on stat1002. Please verify that access works for you and I'll close this task. [16:36:14] 6operations: Build Linux 4.3 for jessie-wikimedia - https://phabricator.wikimedia.org/T119519#1828355 (10faidon) 3NEW [16:36:59] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for ejegg - https://phabricator.wikimedia.org/T118320#1828361 (10Ejegg) 5Open>3Resolved Just logged in with no problems. Thanks @ArielGlenn and @RobH! [16:38:07] godog: got a minute to help with an apt mystery? [16:38:29] (03PS1) 10Mdann52: Add rights to CU+OS groups on en.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255129 (https://phabricator.wikimedia.org/T119446) [16:38:48] (03PS1) 10EBernhardson: Optimize elasticsearch on nobelium for spinning disks [puppet] - 10https://gerrit.wikimedia.org/r/255130 [16:40:08] !log thcipriani@tin Synchronized php-1.27.0-wmf.7/resources/lib/oojs-ui/oojs-ui.js: SWAT: OOjs UI: Backport 4fbbc737c86b500c11bbb471ec1001c50ab8853c [[gerrit:255032]] (duration: 00m 28s) [16:40:10] 10Ops-Access-Requests, 6operations, 6Multimedia: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1828364 (10ArielGlenn) It looks like this is the first access request for Bartosz. Please have him go through the steps here https://wikitech.wi... [16:40:12] (03CR) 10Chad: [C: 031] Optimize elasticsearch on nobelium for spinning disks [puppet] - 10https://gerrit.wikimedia.org/r/255130 (owner: 10EBernhardson) [16:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:40:25] MatmaRex check oojs backport please [16:40:57] ebernhardson: it's 2015 who uses spinning metal geez :p [16:41:05] doing [16:41:08] ostriches: we do :P [16:42:01] although with any luck...when we get proper hardware for labs es replicas they will have ssd's :) [16:42:36] thcipriani: like a charm. thanks! [16:42:44] MatmaRex: thanks for checking! [16:42:56] ebernhardson: mmmm, ssds. and the good ones, too :) [16:44:03] andrewbogott: sure [16:44:19] godog: log into holmium.wikimedia.org [16:44:35] then $apt-cache show designate [16:44:45] and then $ apt-get install --dry-run designate [16:44:53] and explain why it’s trying to install 4.x when 5.x is available? [16:47:29] 6operations: Weird message from the Facebook team to list admins - https://phabricator.wikimedia.org/T119232#1828377 (10ArielGlenn) This looks to me like someone sent mail to the Education list from a possibly forged facebook address that ends up at that blackhole of noreply@. I expect it was spam. But if you c... [16:47:44] andrewbogott: apt-cache policy designate has the answer, 2014.1-18 comes from "us" which has priority 1000 [16:48:11] godog: ok… I thought that pinning only mattered when the same version was available from two places [16:48:47] anybody here taking care of puppet swat? [16:49:41] andrewbogott: odd though, see /etc/apt/preferences.d/ubuntucloud.pref [16:50:03] godog: I just now added that [16:50:10] godog: but note that it makes no difference [16:50:17] (this is now back to where I got yesterday) [16:50:30] maybe I just misnamed something in that file [16:51:27] PROBLEM - SSH on seaborgium is CRITICAL: Server answer [16:51:42] 6operations: apt-get update partial failure lots of places - https://phabricator.wikimedia.org/T119242#1828384 (10ArielGlenn) did you do the 'fix', i.e. remove the apt lists cache? I've seen apt in puppet fail numerous times because of either this sort of race condition or something being interrupted; either way... [16:52:02] SMalyshev: I think it's apergos and mutante fyi this week [16:52:14] puppet swat? [16:52:23] yes that's me and mutante indeed [16:52:28] chasemp: thanks, the Deployments page doesn't have anybody named [16:52:29] are you signed up for the slot today? [16:52:39] apergos: yes [16:52:42] sweet [16:52:52] I was afraid after looking yesterday we would have no takers [16:53:25] RECOVERY - SSH on seaborgium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [16:53:28] chasemp: apergos mutante: whoever does puppet swat, please update [[wikitech:Deployments]] before :) [16:53:54] ah. sorry about that! first time on this for me :-) [16:54:01] apergos, mutante, y’all should add yourselves to the deployment calendar for puppetswat. https://wikitech.wikimedia.org/wiki/Deployments#Week_of_November_23rd [16:54:16] I'm going to do that right now [16:54:22] apergos: no worries, it's annoying, two places to track the info (I assume you all have it noted in your weekly ops meeting notes as well/primarily) [16:54:27] we do [16:54:36] * greg-g nods [16:54:36] usually a few weeks ahead even [16:54:38] anyways [16:54:51] but no one reads the ops notes :) [16:55:00] (office wiki and all) [16:56:04] 6operations, 7Graphite: provide aggregated cluster data with graphite, similar to ganglia - https://phabricator.wikimedia.org/T119520#1828394 (10fgiunchedi) 3NEW [16:56:19] andrewbogott: I'm checking [16:56:21] I tried to designate an official person to communicate puppet swat and clinic duty, and the list spoke as one that whoever is on duty should communicate it themselves :) [16:56:24] godog: thanks! [16:56:30] (03PS2) 10Smalyshev: Support /sparql as an endpoint [puppet] - 10https://gerrit.wikimedia.org/r/254497 (https://phabricator.wikimedia.org/T119081) [16:59:15] PROBLEM - SSH on seaborgium is CRITICAL: Server answer [16:59:27] andrewbogott: the problem is that the preference file doesn't match o=ubuntucloud, you probably want o=Canonical or more likely n=trusty-updates/kilo to pin it even further [16:59:40] andrewbogott: head /var/lib/apt/lists/ubuntu-cloud.archive.canonical.com_ubuntu_dists_trusty-updates_kilo_Release [17:00:04] Deploy window Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151124T1700) [17:00:05] Smalyshev ebernhardson: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:26] heh updated the page just in time [17:00:27] I'm here :) [17:00:47] all right, lemme have a look at those patches! [17:00:52] so https://gerrit.wikimedia.org/r/#/c/254378/ is very simple, just adding a counter [17:01:16] RECOVERY - SSH on seaborgium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [17:01:16] godog: so what exactly is named in the preferences file? It doesn’t relate to the sources.list.d entry I take it? [17:02:43] (03PS3) 10Smalyshev: Support /sparql as an endpoint [puppet] - 10https://gerrit.wikimedia.org/r/254497 (https://phabricator.wikimedia.org/T119081) [17:02:45] yup and it lgtm so I'll rebase and merge it [17:03:13] (03PS3) 10ArielGlenn: WDQS Also use queryStartCount counter [puppet] - 10https://gerrit.wikimedia.org/r/254378 (https://phabricator.wikimedia.org/T119178) (owner: 10Addshore) [17:03:17] (03PS1) 10EBernhardson: Use event-schemas repository for avro schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255135 [17:03:39] (03PS4) 10Smalyshev: Support /sparql as an endpoint [puppet] - 10https://gerrit.wikimedia.org/r/254497 (https://phabricator.wikimedia.org/T119081) [17:03:42] (03CR) 10jenkins-bot: [V: 04-1] Use event-schemas repository for avro schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255135 (owner: 10EBernhardson) [17:03:52] andrewbogott: no, you have selectors based on the Release file downloaded from sources.list entries, the gory details are in apt_preferences(5) [17:04:01] https://gerrit.wikimedia.org/r/#/c/254497 is simple too - just adding endpoint alias. I'm cleaning some whitespace and it's ready [17:04:11] godog: ok. Surely there’s puppet to do this already, looking... [17:04:23] ok, ready [17:06:34] (03CR) 10EBernhardson: "the test will fail because the schemas havn't been merged yet, but also it looks like we will need to init submodules in the test run (i'm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255135 (owner: 10EBernhardson) [17:08:13] ok that first patch is live on the wdqs hosts now [17:08:17] checking the next patch [17:10:27] that also lgtm, same routine, rebase, merge, run [17:10:39] (03PS5) 10ArielGlenn: Support /sparql as an endpoint [puppet] - 10https://gerrit.wikimedia.org/r/254497 (https://phabricator.wikimedia.org/T119081) (owner: 10Smalyshev) [17:10:57] godog: ok, I think I understand now. Thank you! [17:11:46] 6operations: Build Linux 4.3 for jessie-wikimedia - https://phabricator.wikimedia.org/T119519#1828421 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [17:11:53] (03CR) 10ArielGlenn: [C: 032] Support /sparql as an endpoint [puppet] - 10https://gerrit.wikimedia.org/r/254497 (https://phabricator.wikimedia.org/T119081) (owner: 10Smalyshev) [17:12:06] apergos: thank you [17:12:47] 6operations: Build Linux 4.3 for jessie-wikimedia - https://phabricator.wikimedia.org/T119519#1828355 (10MoritzMuehlenhoff) And mid-term we'll mostly likely move to 4.4.x as the next LTS kernel (replacing 3.19 as our jessie kernel). [17:12:56] PROBLEM - SSH on seaborgium is CRITICAL: Server answer [17:14:54] RECOVERY - SSH on seaborgium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [17:15:32] SMalyshev: live on both wdqs hosts, please check that it's all working as you expect [17:16:07] ebernhardson: you're up now. [17:16:42] apergos: sweet [17:17:16] apergos: seems to be working fine, thanks [17:17:26] great. [17:17:37] apergos: i have the necessary perms to reboot elasticsearch on nobelium, so just the deploy and eventual puppet run is enough [17:17:46] okey dokey [17:17:51] just looking at the tuning doc now [17:18:15] <_joe_> SMalyshev: wdqs1002 has almost a full disk [17:18:47] <_joe_> SMalyshev: is there a way to compact the jnl file on disk? [17:18:48] _joe_: I know. It's because of https://phabricator.wikimedia.org/T119398 [17:19:04] _joe_: yes, but it needs another 100G for that, unfortunately [17:19:25] _joe_: I was about to ask if we can add bigger non-ssd disk there for stuff like that [17:19:35] <_joe_> SMalyshev: I don't think we have 100 more gigs [17:19:37] <_joe_> :/ [17:19:43] _joe_: compacting is not in-place unfortunately [17:19:50] <_joe_> oh ok [17:20:08] <_joe_> SMalyshev: please file tasks, I definitely don't have time to tackle this right now [17:20:09] _joe_: can't we just install there a regular SATA drive? those are huge nowdays [17:20:26] <_joe_> I just enlarged the LV in an emergency the other day [17:20:32] !log stop decommissioning restbase1001, bounce cassandra [17:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:20:39] <_joe_> SMalyshev: I'm not sure [17:20:53] ebernhardson: looks good. starting the rebase merge run train [17:20:58] _joe_: yes, I will. I just discovered that yesterday and so far I was trying to find out why it doubled suddently. But I'll reload it over the holidays and will request more space there [17:21:03] (03PS2) 10ArielGlenn: Optimize elasticsearch on nobelium for spinning disks [puppet] - 10https://gerrit.wikimedia.org/r/255130 (owner: 10EBernhardson) [17:22:20] (03CR) 10ArielGlenn: [C: 032] Optimize elasticsearch on nobelium for spinning disks [puppet] - 10https://gerrit.wikimedia.org/r/255130 (owner: 10EBernhardson) [17:23:01] (03Abandoned) 10EBernhardson: Enable CirrusSearch writes to enwiki and dewiki as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252121 (owner: 10EBernhardson) [17:26:36] PROBLEM - SSH on seaborgium is CRITICAL: Server answer [17:28:26] RECOVERY - SSH on seaborgium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [17:28:37] ebernhardson: seems the count might have already been 1 in the config file on nobelium [17:28:46] anyways, after a puppet run it's still 1 so there you go [17:28:48] 6operations: Weird message from the Facebook team to list admins - https://phabricator.wikimedia.org/T119232#1828488 (10Selsharbaty-WMF) Thanks for your reply, @ArielGlenn. I do the moderator queue clean-up regularly. There is nothing related and there are currently no moderation requests at all! [17:29:06] apergos: hmm, i suppose i didn't look directly was just reviewing the puppet. thanks [17:29:25] sure [17:29:45] RECOVERY - DPKG on labservices1001 is OK: All packages OK [17:29:51] no one else listed for puppet swat but I'm here til the window ends [17:31:27] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1828490 (10fgiunchedi) I've started decommissioning `restbase1001` on monday, though there isn't enough space on `restbase1007` to hold half data... [17:34:12] 6operations, 6Analytics-Kanban, 7Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1828504 (10jcrespo) This is the physical view: ``` root@db1046:/srv$ df -h | grep /srv /dev/mapper/tank-data 1.4T 1.3T 136G 91% /srv root@db1046:/srv$ du -h --max-depth=2 663G ./sqlda... [17:34:35] PROBLEM - SSH on seaborgium is CRITICAL: Server answer [17:35:04] (03Abandoned) 10Chad: Remove myself from udp2log-users, don't need [puppet] - 10https://gerrit.wikimedia.org/r/254408 (owner: 10Chad) [17:36:25] RECOVERY - SSH on seaborgium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [17:37:48] 10Ops-Access-Requests, 6operations: Give contint-admins sudo rights to start/stop zuul-merger - https://phabricator.wikimedia.org/T119526#1828516 (10Andrew) 3NEW [17:38:06] 10Ops-Access-Requests, 6operations: Give contint-admins sudo rights to start/stop zuul-merger - https://phabricator.wikimedia.org/T119526#1828525 (10Andrew) [17:38:46] (03PS2) 10Andrew Bogott: contint: grant zuul-merger sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/254129 (https://phabricator.wikimedia.org/T116921) (owner: 10Hashar) [17:39:21] (03CR) 10Andrew Bogott: [C: 031] "this needs to go through Ops access review, but looks fine to me." [puppet] - 10https://gerrit.wikimedia.org/r/254129 (https://phabricator.wikimedia.org/T116921) (owner: 10Hashar) [17:40:19] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Give contint-admins sudo rights to start/stop zuul-merger - https://phabricator.wikimedia.org/T119526#1828538 (10ArielGlenn) I suppose as this is a sudo rights ticket it will have to be discussed at the ops meeting. However I'm for it, yesterday even.... [17:44:55] PROBLEM - HHVM rendering on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:54] PROBLEM - Apache HTTP on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:15] PROBLEM - Check size of conntrack table on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:46:35] PROBLEM - dhclient process on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:46:54] PROBLEM - SSH on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:54] PROBLEM - RAID on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:47:14] PROBLEM - configured eth on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:47:14] PROBLEM - HHVM processes on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:47:16] PROBLEM - nutcracker port on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:47:16] PROBLEM - DPKG on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:47:35] PROBLEM - nutcracker process on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:47:55] PROBLEM - puppet last run on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:47:55] PROBLEM - Disk space on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:49:10] ottomata: http://www.confluent.io/blog/apache-kafka-0.9-is-released [17:49:32] woobooy [17:51:54] PROBLEM - salt-minion processes on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:09] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Update tag and racktables for holmium: rename to labservices1002. - https://phabricator.wikimedia.org/T119533#1828624 (10Andrew) 3NEW a:3Andrew [17:52:49] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Update tag and racktables for holmium: rename to labservices1002. - https://phabricator.wikimedia.org/T119533#1828624 (10Andrew) um... not yet, though, I have to merge some other stuff. [17:55:34] lots to learn there [17:55:36] lunchtime! [17:56:36] jouncebot, next [17:56:37] In 6 hour(s) and 3 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151125T0000) [17:56:42] jouncebot, help [17:56:46] jouncebot, refresh [17:56:48] I refreshed my knowledge about deployments. [17:57:21] grr, it still uses notices? [17:57:25] RECOVERY - nutcracker process on mw1147 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [17:57:35] RECOVERY - Disk space on mw1147 is OK: DISK OK [17:57:36] RECOVERY - salt-minion processes on mw1147 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:57:36] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 45 minutes ago with 0 failures [17:57:54] RECOVERY - Check size of conntrack table on mw1147 is OK: OK: nf_conntrack is 0 % full [17:58:15] RECOVERY - dhclient process on mw1147 is OK: PROCS OK: 0 processes with command name dhclient [17:58:34] RECOVERY - SSH on mw1147 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [17:58:35] RECOVERY - RAID on mw1147 is OK: OK: no RAID installed [17:58:45] RECOVERY - HHVM processes on mw1147 is OK: PROCS OK: 6 processes with command name hhvm [17:58:46] RECOVERY - configured eth on mw1147 is OK: OK - interfaces up [17:58:50] (03CR) 10Luke081515: [C: 031] Add rights to CU+OS groups on en.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255129 (https://phabricator.wikimedia.org/T119446) (owner: 10Mdann52) [17:58:55] RECOVERY - nutcracker port on mw1147 is OK: TCP OK - 0.000 second response time on port 11212 [17:58:55] RECOVERY - DPKG on mw1147 is OK: All packages OK [17:59:13] !log nodetool cleanup on restbase1007 [17:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:59:25] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.247 second response time [18:00:44] RECOVERY - HHVM rendering on mw1147 is OK: HTTP OK: HTTP/1.1 200 OK - 66511 bytes in 1.979 second response time [18:03:00] 6operations: Incomplete puppet fact generation for packages held back for adding new binary packages - https://phabricator.wikimedia.org/T119503#1828698 (10faidon) I think we should keep dist_upgrade=True. The linux-meta case -or the equivalent in Debian upstream, linux-image-amd64- is actually a good example ca... [18:06:48] 6operations, 10Beta-Cluster-Infrastructure, 10netops, 7Database: Evaluate security concerns of logging beta cluster db queries on tendril - https://phabricator.wikimedia.org/T119461#1828715 (10faidon) This isn't a strictly #netops concern but I'll give my opinion anyway: I think it's a very bad idea to hav... [18:06:57] 6operations, 10Beta-Cluster-Infrastructure, 7Database: Evaluate security concerns of logging beta cluster db queries on tendril - https://phabricator.wikimedia.org/T119461#1828717 (10faidon) [18:08:16] 6operations, 10Beta-Cluster-Infrastructure, 7Database: Evaluate security concerns of logging beta cluster db queries on tendril - https://phabricator.wikimedia.org/T119461#1828724 (10jcrespo) 5Open>3Resolved a:3jcrespo [18:18:53] 6operations, 6Discovery: Fix CirrusSearch monitoring - https://phabricator.wikimedia.org/T84163#1828797 (10Deskana) p:5High>3Low Lowering priority to reflect the reality of the team's prioritisation. [18:20:08] (03PS1) 10JanZerebecki: Fix redirect that come in via https to target https [puppet] - 10https://gerrit.wikimedia.org/r/255149 (https://phabricator.wikimedia.org/T119532) [18:21:15] PROBLEM - SSH on seaborgium is CRITICAL: Server answer [18:21:27] (03CR) 10Andrew Bogott: [C: 031] "seems better!" [puppet] - 10https://gerrit.wikimedia.org/r/253612 (https://phabricator.wikimedia.org/T109316) (owner: 10coren) [18:22:05] (03PS2) 10coren: Tool labs: start gridengine-master by default [puppet] - 10https://gerrit.wikimedia.org/r/253612 (https://phabricator.wikimedia.org/T109316) [18:22:50] (03PS7) 10BBlack: varnish: refactor instance parameters (no-op) [puppet] - 10https://gerrit.wikimedia.org/r/255107 (https://phabricator.wikimedia.org/T119396) [18:23:12] (03CR) 10BBlack: [C: 032 V: 032] "Compiler verified no-op (other than monitoring descriptions)" [puppet] - 10https://gerrit.wikimedia.org/r/255107 (https://phabricator.wikimedia.org/T119396) (owner: 10BBlack) [18:23:15] (03CR) 10coren: [C: 032] Tool labs: start gridengine-master by default [puppet] - 10https://gerrit.wikimedia.org/r/253612 (https://phabricator.wikimedia.org/T109316) (owner: 10coren) [18:24:45] (03PS3) 10coren: Tool labs: start gridengine-master by default [puppet] - 10https://gerrit.wikimedia.org/r/253612 (https://phabricator.wikimedia.org/T109316) [18:24:47] (03PS7) 10BBlack: misc-cluster 2layer refactor, step 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/255108 (https://phabricator.wikimedia.org/T119394) [18:24:49] (03PS2) 10JanZerebecki: Fix wikidata redirect that come in via https to target https [puppet] - 10https://gerrit.wikimedia.org/r/255149 (https://phabricator.wikimedia.org/T119532) [18:25:34] (03CR) 10coren: [V: 032] "(rebased)" [puppet] - 10https://gerrit.wikimedia.org/r/253612 (https://phabricator.wikimedia.org/T109316) (owner: 10coren) [18:27:28] 6operations: Incomplete puppet fact generation for packages held back for adding new binary packages - https://phabricator.wikimedia.org/T119503#1828840 (10MoritzMuehlenhoff) 5Open>3declined a:3MoritzMuehlenhoff Ok, I'm convinced. Let's close this one, no need to keep it open for the servermon tooltip (tra... [18:27:32] (03PS1) 10JanZerebecki: Fix api redirect that come in via https to target https [puppet] - 10https://gerrit.wikimedia.org/r/255150 [18:28:36] (03PS8) 10BBlack: misc-cluster 2layer refactor, step 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/255108 (https://phabricator.wikimedia.org/T119394) [18:28:55] RECOVERY - SSH on seaborgium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [18:29:34] (03CR) 10BBlack: [C: 032 V: 032] misc-cluster 2layer refactor, step 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/255108 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [18:30:24] !log depooling cp1056 for misc-cluster 2layer work... [18:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:30:55] Coren: ok to merge? [18:31:22] (03CR) 10Daniel Kinzler: Fix wikidata redirect that come in via https to target https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/255149 (https://phabricator.wikimedia.org/T119532) (owner: 10JanZerebecki) [18:31:29] bblack: Yeah, sorry. [18:31:37] merged! [18:32:06] bblack: and here I was trying to figure out why my change seemed to have no effect. :-) [18:32:49] lol [18:33:23] Coren: bblack fun fact: puppet-merge has no effect on labs instances [18:33:29] labs puppetmaster pullts git repos every minute [18:33:36] and is unconnected to puppet merge completely [18:33:39] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [18:33:54] YuviPanda: Is that new? [18:33:59] Coren: nope has always been the case [18:34:06] YuviPanda: But that /also/ explains why my change wasn't yet in. :-) [18:34:19] the timing works out usually and you just think your puppet-merge helped [18:34:27] and if you don't puppet-merge it's too quick usually and doesn't work [18:34:37] which furthers the belief that you need to puppet-merge [18:35:05] Hah. [18:35:30] (03PS4) 10Mdann52: Rename two namespaces at bswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247093 (https://phabricator.wikimedia.org/T115812) (owner: 10Luke081515) [18:36:26] Etherpad seems to be down. [18:36:31] PROBLEM - Varnishkafka log producer on cp1056 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [18:38:08] !log restarted apache2 on etherpad [18:38:10] niedzielski: fixed [18:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:38:22] YuviPanda: thank you :) [18:39:21] PROBLEM - SSH on seaborgium is CRITICAL: Server answer [18:39:30] akosiaris: in this case etherpad was up and apache was down [18:39:32] booo [18:43:10] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [18:44:10] RECOVERY - Varnishkafka log producer on cp1056 is OK: PROCS OK: 1 process with command name varnishkafka [18:47:21] (03PS1) 10BBlack: fix storage sizes for varnish misc [puppet] - 10https://gerrit.wikimedia.org/r/255153 [18:48:30] (03CR) 10BBlack: [C: 032] fix storage sizes for varnish misc [puppet] - 10https://gerrit.wikimedia.org/r/255153 (owner: 10BBlack) [18:49:17] (03PS1) 10Yuvipanda: puppetmaster: Make sure base::puppet is present [puppet] - 10https://gerrit.wikimedia.org/r/255154 [18:49:28] (03PS1) 10Giuseppe Lavagetto: etcd: auth puppettization [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/255155 (https://phabricator.wikimedia.org/T97972) [18:49:28] milimetric: ^ can you cherry-pick this and test? [18:49:38] * milimetric does [18:49:48] <_joe_> YuviPanda: I got presents for you ^^ [18:49:48] (03PS1) 10BBlack: re-fix misc storage sizes [puppet] - 10https://gerrit.wikimedia.org/r/255156 [18:49:59] _joe_: <3 \o/ [18:50:23] <_joe_> it's not finished, and the puppet/ruby part is totally untested, but give it a look if you have time [18:50:32] (03CR) 10BBlack: [C: 032 V: 032] re-fix misc storage sizes [puppet] - 10https://gerrit.wikimedia.org/r/255156 (owner: 10BBlack) [18:50:44] _joe_: will do! [18:50:50] RECOVERY - SSH on seaborgium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [18:51:03] * _joe_ off for now [18:51:45] YuviPanda: "Error: Failed to apply catalog: Could not find dependency Package[git] for Exec[git_clone_operations/puppet] at /etc/puppet/modules/git/manifests/clone.pp:147" [18:51:53] hmmmm [18:52:14] !log repooled cp1056, depooled cp1057 (misc 2layer) [18:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:53:02] YuviPanda: you're welcome to login to limn1 and try whatever you want until you think you fixed it [18:53:25] I'll put my ::apache fix into my HEAD commit and reset the rest [18:54:16] (03PS2) 10Yuvipanda: puppetmaster: Make sure base::puppet is present [puppet] - 10https://gerrit.wikimedia.org/r/255154 [18:54:17] milimetric: can you try just this one? ^ [18:56:30] PROBLEM - SSH on seaborgium is CRITICAL: Server answer [18:56:52] Is cp1065 having issues? [18:57:17] no [18:57:20] why? [18:57:20] hm [18:57:37] I noticed a 503 on phabricator (error page said it was served by cp1065), but when I reloaded the page I was not able to trigger the error anymore [18:57:45] sure it wasn't 1056? [18:57:48] Perhaps it was a one-time thing, idk [18:57:57] Eh [18:58:20] YuviPanda: Error: Failed to apply catalog: Could not find dependency File[/etc/ldap/ldap.conf] for Class[Puppet::Self::Config] at /etc/puppet/modules/puppet/manifests/self/master.pp:62 [18:58:21] RECOVERY - SSH on seaborgium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [18:58:51] I thought it was 1065.... but perhaps it was 1056 (but I'm 100% sure it is either 1065 or 1056) [18:59:32] milimetric: everything seems screwed in some form. I guess I'll file a bug and investigate at some point [18:59:42] Although I can't find 1065 in misc web caching in ganglia, so perhaps it's indeed 1056 like you say [19:00:05] (03PS3) 10Yuvipanda: puppetmaster: Make sure base::puppet is present [puppet] - 10https://gerrit.wikimedia.org/r/255154 [19:00:08] k, thx YuviPanda [19:00:08] hmm [19:00:19] _joe_: you *did* braek self hosted puppetmasters :) [19:00:42] _joe_: although not all of them and I'm not sure why some broke and some aren't [19:01:29] _joe_: one of mine is broken in case you wanna investigate I can help [19:05:58] bblack: meh. It looks like all my phabricator traffic is served by cp1056, but I'm not able to trigger it again [19:09:06] 7Puppet, 6operations, 6Labs: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1829090 (10yuvipanda) 3NEW [19:09:16] milimetric: ^ I've filed a bug [19:09:20] (03CR) 10Krinkle: [C: 031] Remove obsolete "claimTTL" settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255077 (owner: 10Aaron Schulz) [19:09:56] 6operations, 6Labs: Untangle labs/production roles from labs/instance roles - https://phabricator.wikimedia.org/T119401#1829098 (10yuvipanda) So my plan is to move everything that is to do with labs infrastructure in some form or way into labs/ and then rename all other things that run on top of labs to just b... [19:10:15] (03PS2) 10Aude: Enable data access for wikinews, meta-wiki, mediawiki.org and wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255063 [19:11:57] PROBLEM - SSH on seaborgium is CRITICAL: Server answer [19:13:18] 6operations, 10EventBus, 10MediaWiki-Cache, 6Performance-Team, and 2 others: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1829111 (10RobH) [19:13:48] RECOVERY - SSH on seaborgium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [19:14:50] (03CR) 10Krinkle: [C: 032] Make mysql-multiwrite use getInstance() factory spec [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249457 (owner: 10Aaron Schulz) [19:15:08] (03CR) 10jenkins-bot: [V: 04-1] Make mysql-multiwrite use getInstance() factory spec [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249457 (owner: 10Aaron Schulz) [19:15:28] 6operations, 10EventBus, 10MediaWiki-Cache, 6Performance-Team, and 2 others: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1829123 (10RobH) [19:18:01] !log misc-cluster: cp1056+7 pooled (new config), cp1069+70 depooled (old config) [19:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:23:29] <_joe_> YuviPanda: how did I break them? [19:23:56] _joe_: so I think: Exec['compile puppet.conf'] -> Class['puppetmaster::ssl'] [19:24:10] somehow makes the puppet::config happen before ::base happens [19:24:15] and causes other things to fail [19:24:22] <_joe_> sigh [19:24:30] <_joe_> ofc that won't be catched by the compiler [19:24:41] yeah [19:24:44] <_joe_> but I didn't introduce that dependency [19:24:48] <_joe_> did I? [19:25:04] <_joe_> I might have included base::puppet into puppet::self::something [19:25:11] <_joe_> because it logically should be [19:25:19] <_joe_> well, bbiab [19:25:36] _joe_: heh, I didn't even git blame, I just assumed it was you :P let me actually find out [19:26:01] it's been around for 2y [19:26:04] not sure why it broke now [19:26:06] >_> [19:26:57] PROBLEM - SSH on seaborgium is CRITICAL: Server answer [19:27:40] !log misc-cluster: all 4x nodes on new config and pooled [19:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:27:47] what is up with seaborgium? I'm seeing it flapping here for a while [19:29:20] 6operations, 10ops-eqiad: rack/setup/deploy rdb1005 & rdb1006 - https://phabricator.wikimedia.org/T119543#1829155 (10RobH) 3NEW a:3Cmjohnson [19:29:35] 6operations, 10ops-eqiad: rack/setup/deploy rdb1005 & rdb1006 - https://phabricator.wikimedia.org/T119543#1829155 (10RobH) [19:30:01] 6operations, 10ops-eqiad: rack/setup/deploy rdb1005 & rdb1006 - https://phabricator.wikimedia.org/T119543#1829155 (10RobH) [19:30:03] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1829166 (10RobH) [19:30:11] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1035333 (10RobH) [19:30:13] 6operations, 10ops-eqiad: rack/setup/deploy rdb1005 & rdb1006 - https://phabricator.wikimedia.org/T119543#1829155 (10RobH) [19:30:36] !log restarted hadoop workers on analytics1028 to analytics1055 to pick up openjdk security update (plus updates for libpng, nss, nspr, pixbuf and libxml as used by openjdk) [19:30:37] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [19:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:31:07] 6operations, 10ops-eqiad: rack/setup/deploy rdb1005 & rdb1006 - https://phabricator.wikimedia.org/T119543#1829155 (10RobH) [19:31:09] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1829173 (10RobH) 5stalled>3Resolved These allocations have been approved and ordered on T117911. As such, arrival of those systems will be tracked on the ordering task, and the i... [19:31:14] it's only got standard and base::firewall on it which is [19:31:18] why I have been ignoring it [19:31:49] RECOVERY - SSH on seaborgium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [19:33:09] PROBLEM - Hadoop NodeManager on analytics1053 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [19:34:00] <_joe_> YuviPanda: I think it broke because of a recent patch of mine [19:34:17] <_joe_> which includes base::something somewhere in the puppet::self mess [19:34:26] _joe_: yay, my internal blame-r is more accurate than git :D [19:35:26] <_joe_> YuviPanda: actually, no [19:35:38] <_joe_> that doesn't make sense at all with what I did [19:36:38] the entire thing makes somewhat little sense to me [19:39:33] <_joe_> YuviPanda: try to find out when they broke and look at git log [19:39:51] _joe_: yup, prepping for another interview and I'll look at it after [19:40:19] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [5000000.0] [19:42:18] PROBLEM - SSH on seaborgium is CRITICAL: Server answer [19:47:49] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0] [19:53:23] (03PS1) 10Rush: check_legal_html compensate for non-impacting changes [puppet] - 10https://gerrit.wikimedia.org/r/255163 [19:53:39] RECOVERY - Hadoop NodeManager on analytics1053 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [19:55:19] RECOVERY - SSH on seaborgium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [19:55:47] (03PS2) 10Rush: check_legal_html compensate for non-impacting changes [puppet] - 10https://gerrit.wikimedia.org/r/255163 (https://phabricator.wikimedia.org/T119456) [19:56:08] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [5000000.0] [19:57:45] (03CR) 10Rush: [C: 032] check_legal_html compensate for non-impacting changes [puppet] - 10https://gerrit.wikimedia.org/r/255163 (https://phabricator.wikimedia.org/T119456) (owner: 10Rush) [19:58:59] (03PS2) 10Aaron Schulz: Make mysql-multiwrite use getInstance() factory spec [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249457 [20:00:53] (03PS8) 10BBlack: misc-cluster 2layer refactor, step 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/255109 (https://phabricator.wikimedia.org/T119394) [20:01:19] !log disabling puppet on neon to avoid monitoring race-condition spam [20:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:01:40] (03PS1) 10Rush: phab: add IPv6 VCS address [puppet] - 10https://gerrit.wikimedia.org/r/255164 (https://phabricator.wikimedia.org/T100519) [20:02:10] (03PS2) 10Rush: phab: add IPv6 VCS address [puppet] - 10https://gerrit.wikimedia.org/r/255164 (https://phabricator.wikimedia.org/T100519) [20:02:47] (03CR) 10BBlack: [C: 032] misc-cluster 2layer refactor, step 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/255109 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [20:02:56] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 18.18% of data above the critical threshold [100000000.0] [20:06:36] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [5000000.0] [20:06:45] PROBLEM - SSH on seaborgium is CRITICAL: Server answer [20:08:07] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: puppet fail [20:08:16] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: puppet fail [20:08:26] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [20:08:35] RECOVERY - SSH on seaborgium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [20:08:49] (03PS1) 10BBlack: post-merge fixups for 377a829d2 [puppet] - 10https://gerrit.wikimedia.org/r/255166 [20:08:55] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [20:09:17] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: puppet fail [20:09:22] (03CR) 10BBlack: [C: 032 V: 032] post-merge fixups for 377a829d2 [puppet] - 10https://gerrit.wikimedia.org/r/255166 (owner: 10BBlack) [20:10:16] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: puppet fail [20:10:32] (03PS3) 10Rush: phab: add IPv6 VCS real server IP [puppet] - 10https://gerrit.wikimedia.org/r/255164 (https://phabricator.wikimedia.org/T100519) [20:10:46] the cpXXXX puppetfails are mine, not critical [20:11:27] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: puppet fail [20:11:43] bblack: kk btw you may catch a bit of my changes on puppet renable on neon no worries I"ll follow up later [20:11:54] (03PS1) 10BBlack: post-merge fixups for 377a829d2 [puppet] - 10https://gerrit.wikimedia.org/r/255167 [20:11:58] ok [20:12:05] (03PS4) 10Rush: phab: add IPv6 VCS real server IP [puppet] - 10https://gerrit.wikimedia.org/r/255164 (https://phabricator.wikimedia.org/T100519) [20:12:19] (03CR) 10BBlack: [C: 032 V: 032] post-merge fixups for 377a829d2 [puppet] - 10https://gerrit.wikimedia.org/r/255167 (owner: 10BBlack) [20:12:25] PROBLEM - puppet last run on cp2012 is CRITICAL: CRITICAL: puppet fail [20:12:37] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: puppet fail [20:13:23] (03PS8) 10BBlack: misc-cluster 2layer refactor, step 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/255110 (https://phabricator.wikimedia.org/T119394) [20:14:06] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [20:14:27] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:14:58] (03PS5) 10Rush: phab: add IPv6 VCS real server IP [puppet] - 10https://gerrit.wikimedia.org/r/255164 (https://phabricator.wikimedia.org/T100519) [20:15:36] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Puppet has 2 failures [20:16:33] (03CR) 10Rush: [C: 032] phab: add IPv6 VCS real server IP [puppet] - 10https://gerrit.wikimedia.org/r/255164 (https://phabricator.wikimedia.org/T100519) (owner: 10Rush) [20:16:57] PROBLEM - puppet last run on cp2018 is CRITICAL: CRITICAL: Puppet has 2 failures [20:18:38] (03PS9) 10BBlack: misc-cluster 2layer refactor, step 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/255110 (https://phabricator.wikimedia.org/T119394) [20:18:46] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 1 failures [20:18:48] (03CR) 10BBlack: [C: 032 V: 032] misc-cluster 2layer refactor, step 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/255110 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [20:20:05] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: Puppet has 2 failures [20:20:47] (03PS1) 10Rush: phab: VCS ssh listen on IPV6 [puppet] - 10https://gerrit.wikimedia.org/r/255168 (https://phabricator.wikimedia.org/T100519) [20:20:56] (03PS2) 10Rush: phab: VCS ssh listen on IPV6 [puppet] - 10https://gerrit.wikimedia.org/r/255168 (https://phabricator.wikimedia.org/T100519) [20:21:45] PROBLEM - SSH on seaborgium is CRITICAL: Server answer [20:21:58] (03PS1) 10BBlack: post-merge fixup for bf51de02 [puppet] - 10https://gerrit.wikimedia.org/r/255169 [20:22:11] (03CR) 10BBlack: [C: 032 V: 032] post-merge fixup for bf51de02 [puppet] - 10https://gerrit.wikimedia.org/r/255169 (owner: 10BBlack) [20:22:16] (03CR) 10Rush: [C: 032] phab: VCS ssh listen on IPV6 [puppet] - 10https://gerrit.wikimedia.org/r/255168 (https://phabricator.wikimedia.org/T100519) (owner: 10Rush) [20:22:32] (03PS3) 10Rush: phab: VCS ssh listen on IPV6 [puppet] - 10https://gerrit.wikimedia.org/r/255168 (https://phabricator.wikimedia.org/T100519) [20:22:33] seaborgium is a new server recently added by alex afair [20:22:38] looks [20:23:21] oh, right. ldap .. [20:23:31] (03CR) 10Rush: [V: 032] phab: VCS ssh listen on IPV6 [puppet] - 10https://gerrit.wikimedia.org/r/255168 (https://phabricator.wikimedia.org/T100519) (owner: 10Rush) [20:23:44] 1.25 post-merge fixup commits per commit, not bad! [20:24:39] per the comment but as apergos points out, just has standard and base::firewall so far [20:25:06] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:28:05] (03PS1) 10BBlack: cp3019-22 are not actually old ssd config [puppet] - 10https://gerrit.wikimedia.org/r/255171 [20:28:28] (03CR) 10BBlack: [C: 032 V: 032] cp3019-22 are not actually old ssd config [puppet] - 10https://gerrit.wikimedia.org/r/255171 (owner: 10BBlack) [20:31:29] !log gnt-instance reboot seaborgium.wikimedia.org [20:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:33:05] RECOVERY - SSH on seaborgium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [20:33:56] RECOVERY - salt-minion processes on seaborgium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:34:06] RECOVERY - DPKG on seaborgium is OK: All packages OK [20:34:25] RECOVERY - Disk space on seaborgium is OK: DISK OK [20:34:56] RECOVERY - RAID on seaborgium is OK: OK: no RAID installed [20:34:57] RECOVERY - dhclient process on seaborgium is OK: PROCS OK: 0 processes with command name dhclient [20:34:57] RECOVERY - configured eth on seaborgium is OK: OK - interfaces up [20:36:35] RECOVERY - Check size of conntrack table on seaborgium is OK: OK: nf_conntrack is 0 % full [20:36:40] ebernhardson: did we ever merge access via the nginx proxy? [20:36:42] I think not [20:36:45] * YuviPanda does it [20:36:47] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:37:12] (03PS1) 10Rush: phab: add git-ssh IPv6 LVS [puppet] - 10https://gerrit.wikimedia.org/r/255173 (https://phabricator.wikimedia.org/T100519) [20:37:25] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:37:32] (03PS2) 10Rush: phab: add git-ssh IPv6 LVS [puppet] - 10https://gerrit.wikimedia.org/r/255173 (https://phabricator.wikimedia.org/T100519) [20:37:41] (03PS6) 10Yuvipanda: elasticsearch: Add read-only reverse proxy for labs ES [puppet] - 10https://gerrit.wikimedia.org/r/240305 (owner: 10EBernhardson) [20:37:55] (03CR) 10Yuvipanda: [C: 032 V: 032] elasticsearch: Add read-only reverse proxy for labs ES [puppet] - 10https://gerrit.wikimedia.org/r/240305 (owner: 10EBernhardson) [20:38:20] bblack: is this all this takes? https://gerrit.wikimedia.org/r/#/c/255173/ seems right but not sure [20:38:46] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:38:46] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [5000000.0] [20:38:55] RECOVERY - puppet last run on cp2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:40:25] nobelium failure is me [20:40:35] RECOVERY - Disk space on restbase1007 is OK: DISK OK [20:40:36] RECOVERY - Ensure legal html en.wp on en.wikipedia.org is OK: all html is present. [20:41:25] YuviPanda: not yet [20:41:29] yea :) [20:41:43] hmm [20:41:45] Error: /Stage[main]/Elasticsearch::Proxy/Nginx::Site[elasticsearch-proxy]/File[/etc/nginx/sites-available/elasticsearch-proxy]: Could not evaluate: Could not retrieve information from environment production source(s) file:///modules/elasticsearch/labs-es-proxy.nginx.conf [20:41:46] RECOVERY - puppet last run on cp2018 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:41:46] RECOVERY - Ensure legal html en.wb on en.wikibooks.org is OK: all html is present. [20:41:47] that's strange [20:41:56] RECOVERY - Ensure legal html en.m.wp on en.m.wikipedia.org is OK: all html is present. [20:42:16] hahah [20:42:16] RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [20:42:18] + source => 'file:///modules/elasticsearch/labs-es-proxy.nginx.conf' [20:42:45] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [5000000.0] [20:42:57] i see it in modiles/elastisearch/files/labs-es-proxy.nginx.conf ... but aparently thats not right [20:43:00] (03PS1) 10Yuvipanda: elasticsearch: Fixup I9c2e50ec2fd [puppet] - 10https://gerrit.wikimedia.org/r/255174 [20:43:03] ebernhardson: ^ for fix [20:43:17] oh lol :) [20:43:22] 6operations, 7Monitoring: "ensure legal html" footer monitoring turned CRIT - https://phabricator.wikimedia.org/T119456#1829580 (10chasemp) 5Open>3Resolved https://gerrit.wikimedia.org/r/#/c/255163/ [20:43:34] (03CR) 10Yuvipanda: [C: 032 V: 032] elasticsearch: Fixup I9c2e50ec2fd [puppet] - 10https://gerrit.wikimedia.org/r/255174 (owner: 10Yuvipanda) [20:44:15] 7Blocked-on-Operations, 6operations, 7Availability, 5Patch-For-Review, 7Performance: Make redis/redisdb roles support multiple instances on the same servers - https://phabricator.wikimedia.org/T100714#1829583 (10ori) [20:44:17] 6operations, 7Availability, 5Patch-For-Review, 7Performance: Upstart support for redis::instance - https://phabricator.wikimedia.org/T118704#1829582 (10ori) 5Open>3Resolved [20:44:50] (03PS1) 10Ori.livneh: Migrate rdb1008 to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/255175 (https://phabricator.wikimedia.org/T100714) [20:45:35] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:46:20] ebernhardson: hmm [20:46:23] curl localhost/_cat/plugins [20:46:25] [20:46:27] is 404 [20:46:44] 7Blocked-on-Operations, 6operations, 7Availability, 5Patch-For-Review, 7Performance: Make redis/redisdb roles support multiple instances on the same servers - https://phabricator.wikimedia.org/T100714#1829594 (10ori) 5Open>3Resolved [20:47:15] (03PS2) 10Ori.livneh: Migrate rdb1008 to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/255175 (https://phabricator.wikimedia.org/T100714) [20:47:34] (03CR) 10Ori.livneh: [C: 032 V: 032] Migrate rdb1008 to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/255175 (https://phabricator.wikimedia.org/T100714) (owner: 10Ori.livneh) [20:48:01] ebernhardson: something up with the nginx proxy config I guess [20:48:33] ebernhardson: lololol [20:48:37] Server: Apache [20:48:43] it's running apache because mediawiki! [20:48:51] YuviPanda: we can drop mediawiki now [20:48:56] ebernhardson: yeah [20:49:37] (03PS1) 10Yuvipanda: Drop mediawiki from nobelium [puppet] - 10https://gerrit.wikimedia.org/r/255177 [20:49:54] ebernhardson: curl nobelium.eqiad.wmnet/_cat/plugins [20:49:56] whee [20:50:05] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1829601 (10GWicke) Some notes: We should be able to get the data on 1007 down to ~950G by waiting for the right moment in the compaction cycle, a... [20:50:16] (03PS2) 10Yuvipanda: Drop mediawiki from nobelium [puppet] - 10https://gerrit.wikimedia.org/r/255177 [20:50:16] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [20:51:53] (03CR) 10Yuvipanda: [C: 032] Drop mediawiki from nobelium [puppet] - 10https://gerrit.wikimedia.org/r/255177 (owner: 10Yuvipanda) [20:52:06] YuviPanda: and post/put looks apropriatly filtered [20:52:23] ebernhardson: \o/ [20:52:35] ebernhardson: and I guess there's no remaining GET based security exploits :) [20:52:51] YuviPanda: i turned on nlwiki, frwiki and eswiki this morning [20:52:57] \o/ [20:52:59] still good? [20:53:35] PROBLEM - puppet last run on rdb1008 is CRITICAL: CRITICAL: puppet fail [20:53:42] YuviPanda: looks like it [20:54:08] although i do wonder...its permenantly merging things [20:54:23] ebernhardson: cool. let's turn on a bunch of wikitionary and wikisource ones too? [20:54:26] what do you mean by merging [20:54:58] YuviPanda: https://www.elastic.co/guide/en/elasticsearch/reference/1.4/index-modules-merge.html [20:55:25] i will have to look into if it's getting behind, and if we can adjust some settings [20:55:26] RECOVERY - NTP on seaborgium is OK: NTP OK: Offset -0.001930952072 secs [20:55:42] ebernhardson: ok! is there a way for us to track how much 'lag' there is? [20:56:36] (03PS1) 10Ori.livneh: Fix-up for I86cbeff385 [puppet] - 10https://gerrit.wikimedia.org/r/255180 [20:57:18] YuviPanda: not sure about lag, i'm looking at servers.nobelium.elasticsearch.indices.merges.current_size_in_bytes in graphite to see how much it's doing [20:57:25] (03PS2) 10Ori.livneh: Fix-up for I86cbeff385 [puppet] - 10https://gerrit.wikimedia.org/r/255180 [20:57:34] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up for I86cbeff385 [puppet] - 10https://gerrit.wikimedia.org/r/255180 (owner: 10Ori.livneh) [20:59:06] RECOVERY - puppet last run on rdb1008 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [20:59:11] YuviPanda: looks like we should monitor the logs for "Elasticsearch will log INFO-level messages stating now throttling indexing when it detects merging falling behind indexing." [20:59:39] ebernhardson: we can use logster to pump that to graphite [21:02:46] PROBLEM - Redis on rdb1008 is CRITICAL: Connection refused [21:03:00] YuviPanda: ebernhardson: we already have metrics sent from logstash to graphite [21:03:32] YuviPanda: ebernhardson: counts of some MediaWiki errors from logstash : https://grafana.wikimedia.org/dashboard/db/production-logging [21:04:38] don't push it to graphite; grafana is working on an elasticsearch backend [21:04:43] unless you need it urgently [21:04:43] 6operations: sort of SLA clarification grafana.wikimedia.org - https://phabricator.wikimedia.org/T119558#1829640 (10JanZerebecki) 3NEW [21:05:01] hashar: this is an elasticsearch instance using log4j [21:05:06] currently just logging to disk on the server [21:05:36] ebernhardson: ah our forget me so :-D [21:10:31] (03PS1) 10BBlack: add cache_misc monitor groups for other DCs [puppet] - 10https://gerrit.wikimedia.org/r/255181 [21:10:47] (03PS2) 10BBlack: add cache_misc monitor groups for other DCs [puppet] - 10https://gerrit.wikimedia.org/r/255181 [21:10:53] (03CR) 10BBlack: [C: 032 V: 032] add cache_misc monitor groups for other DCs [puppet] - 10https://gerrit.wikimedia.org/r/255181 (owner: 10BBlack) [21:12:35] (03CR) 10Krinkle: [C: 032] Make mysql-multiwrite use getInstance() factory spec [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249457 (owner: 10Aaron Schulz) [21:12:54] (03Merged) 10jenkins-bot: Make mysql-multiwrite use getInstance() factory spec [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249457 (owner: 10Aaron Schulz) [21:14:27] YuviPanda: and...it's throttling/unthrottling pretty regularly [21:15:17] there are a few settings to tweak though, will check the docs [21:15:45] Nemo_bis: true @ "no license". i mailed them if they had anything specific in mind about licensing [21:16:37] Nemo_bis: do you think it should be cc-by-sa ? [21:17:51] 6operations, 6Analytics-Kanban, 7Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1829687 (10Milimetric) @jcrespo, thanks very much for the physical report. As far as team Analytics is concerned, we only need enough data on m4-master to facilitate backfilling. So, if w... [21:18:22] PROBLEM - IPsec on cp1056 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3019_v4, cp3019_v6, cp3020_v4, cp3020_v6, cp3021_v4, cp3021_v6, cp3022_v4, cp3022_v6 [21:19:51] PROBLEM - IPsec on cp1070 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3019_v4, cp3019_v6, cp3020_v4, cp3020_v6, cp3021_v4, cp3021_v6, cp3022_v4, cp3022_v6 [21:20:22] PROBLEM - IPsec on cp1069 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3019_v4, cp3019_v6, cp3020_v4, cp3020_v6, cp3021_v4, cp3021_v6, cp3022_v4, cp3022_v6 [21:22:12] PROBLEM - IPsec on cp1057 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3019_v4, cp3019_v6, cp3020_v4, cp3020_v6, cp3021_v4, cp3021_v6, cp3022_v4, cp3022_v6 [21:22:42] mutante: dunno, any free license would be fine i guess [21:24:45] 6operations, 10Gitblit: Update gitblit to 1.7.1 - https://phabricator.wikimedia.org/T119409#1829709 (10hashar) gitblit got upgraded?... Anyway my point stand, we don't maintain it anymore. Hold your breath a bit more and Diffusion will come in. Meanwhile most people are using the GitHub mirrors. [21:25:21] Nemo_bis: right, we shall see, i wonder what they think about derivative works when it comes to the report [21:26:30] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1829711 (10hashar) Service implementation is pretty much completed and has been running in prod for a few days now. What is left remaining are the ac... [21:28:39] (03CR) 10Dzahn: Fix wikidata redirect that come in via https to target https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/255149 (https://phabricator.wikimedia.org/T119532) (owner: 10JanZerebecki) [21:29:58] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [21:31:09] 10Ops-Access-Requests, 6operations, 6Multimedia: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1829716 (10matmarex) Full name: Bartosz Dziewoński Wikitech page: https://wikitech.wikimedia.org/wiki/User:Bartosz_Dziewoński Labs username: Bart... [21:33:15] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [21:36:45] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [21:37:46] RECOVERY - IPsec on cp1070 is OK: Strongswan OK - 24 ESP OK [21:37:55] RECOVERY - IPsec on cp1069 is OK: Strongswan OK - 24 ESP OK [21:37:56] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [21:38:16] RECOVERY - IPsec on cp1057 is OK: Strongswan OK - 24 ESP OK [21:38:32] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Give contint-admins sudo rights to start/stop zuul-merger - https://phabricator.wikimedia.org/T119526#1829731 (10hashar) The historical `zuul-merger` instance is on gallium a Precise system. There we have sudo access as user `zuul` via: ALL = (zuul)... [21:38:35] RECOVERY - IPsec on cp1056 is OK: Strongswan OK - 24 ESP OK [21:39:03] !log krinkle@tin Synchronized wmf-config/CommonSettings.php: I80d38773057 (duration: 00m 28s) [21:39:06] (03CR) 10Hashar: "Being discussed on T119526" [puppet] - 10https://gerrit.wikimedia.org/r/254129 (https://phabricator.wikimedia.org/T116921) (owner: 10Hashar) [21:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:39:14] (03CR) 10Hashar: [C: 04-1] contint: grant zuul-merger sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/254129 (https://phabricator.wikimedia.org/T116921) (owner: 10Hashar) [21:40:05] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [21:42:38] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, 10Traffic: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1829751 (10Yann) Bug again reported here: https://commons.wikimedia.org/wiki/Commons:Bistro#File:La_derni.C3.A8re_charrette_de_... [21:42:44] ebernhardson: so do you think we can have mwgrep work from labs now? [21:43:14] YuviPanda: We'd have to modify it to exclude private wikis etc [21:43:25] /prevent searching them [21:43:37] Reedy: no this will hit the labs ES replica that we're testing now [21:43:44] Reedy: so that only has very few wikis [21:45:18] (03PS3) 10Dzahn: releases: enable strict transport security [puppet] - 10https://gerrit.wikimedia.org/r/253759 (https://phabricator.wikimedia.org/T118787) [21:45:39] YuviPanda: in theory, yes. not sure what query performance will look like. but we can find out :) [21:45:47] what indexes does mwgrep hit? [21:45:56] ebernhardson: I've no idea :D let me look [21:46:07] https://github.com/wikimedia/operations-puppet/blob/84438b04c8b13d33e95c4717d86551f44149d695/modules/scap/files/mwgrep [21:46:12] ebernhardson: so kaldari wanted something like mwgrep, so I guess we can take it and put it on labs and see what happens [21:46:20] yea [21:46:27] YuviPanda: It'll need extracting from scap [21:46:34] BASE_URI = 'http://search.svc.eqiad.wmnet:9200/_all/page/_search' [21:46:36] And parameterising [21:46:52] kaldari: I do want to mention that this is on temporary hardware atm, and we need to produce enough use cases and enough people asking for it to find budget from somewhere :) [21:46:54] private_wikis = open('/srv/mediawiki/dblists/private.dblist').read().splitlines() [21:46:56] _all, thats a bit scary [21:46:56] And dealing with [21:47:17] ebernhardson: do we have a _all? [21:47:28] YuviPanda: its a fake thing inside elasticsearch that hits everything [21:47:30] Reedy: there's a stand alone mwgrep script somewhere [21:47:46] Doesn't that just hit the public search api? [21:48:07] ebernhardson: ah. is it scary from security perspective or from a 'oh shit this might bring everything down' perspective? [21:48:08] Reedy: Don't think so [21:48:09] I guess latter [21:48:25] mutante: image attribution looked ok, the rest was boilerplate text still i think; not sure about code [21:48:29] YuviPanda: oh, just from the 'it has to query 14k segements' perspective :) [21:48:41] i'm not sure how long querying 14k segements will take on a spinning disk thats already doing 400 iops of writes [21:48:45] 10Ops-Access-Requests, 6operations, 6Multimedia: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1829770 (10Krenair) The labs username stuff needed is just your shell name (uid, not cn which is used for wikitech usernames). Which you've given... [21:48:45] but you can try :) [21:49:02] ebernhardson: +1 I wanna try and see if it breaks :) [21:49:17] more like 'when' it breaks I guess :D [21:49:20] :P [21:49:30] we could maybe turn off some updates and see how it goes after it breaks [21:49:31] security matters too if private wikis are in the queried index [21:49:37] Nemo_bis: they aren't [21:49:41] ok [21:49:52] we've only got enwiki, wikidata, commons, dewiki, frwiki, nlwiki and eswiki [21:49:53] Reedy: You're right that it uses http://search.svc.eqiad.wmnet:9200/_all/page/_search [21:50:22] YuviPanda, ebernhardson, Reedy: https://git.wikimedia.org/blob/operations%2Fpuppet/5b7895dcd5b49b385f97e99438acf837f6a1a1d8/files%2Fmisc%2Fscripts%2Fmwgrep [21:50:35] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [100000000.0] [21:50:44] Nemo_bis: gotcha, ok. thanks [21:50:57] Why do we have 2 scripts called the same thing that are completely different? [21:51:02] kaldari: ok so technically just taking it and replacing teh URL should work [21:51:13] Or, more than trivially different [21:51:24] to replace search.svc.eqiad.wmnet:9200 with nobelium.eqiad.wmnet [21:51:26] let me try it [21:52:56] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [21:53:10] YuviPanda: also remember we imported all the indices (but not the ones in private.dblist), so this will be querying everyhting [21:53:16] all the content indices i mean [21:53:27] ah ofcourse [21:53:29] right [21:53:30] just only the ones we turned on recently are getting writes [21:53:41] ebernhardson: I wonder if one of the things we can do is to not do writes but just do like daily imports [21:54:00] we could try importing the weekly dumps, but who knows how long that takes [21:54:05] just creating the weekly dumps takes 48 hours [21:54:21] eugh [21:54:23] so probably not [21:54:28] kaldari: Reedy how do I use mwgrep?! [21:54:28] https://github.com/wikimedia/operations-puppet/tree/production/files/misc/scripts [21:54:32] https://git.wikimedia.org/tree/operations%2Fpuppet/5b7895dcd5b49b385f97e99438acf837f6a1a1d8/files%2Fmisc%2Fscripts [21:54:36] YuviPanda: mwgrep foobar [21:54:36] (03CR) 10Dzahn: [C: 032] "http->https has been enforced for a bit now and there have been no complaints, so going ahead like we do for other services" [puppet] - 10https://gerrit.wikimedia.org/r/253759 (https://phabricator.wikimedia.org/T118787) (owner: 10Dzahn) [21:54:44] Reedy: hmm that doesn't seem to work [21:54:46] gives me a 403 [21:55:06] On production? [21:55:15] Does the _all thing exist on nobellium? [21:55:28] its an internal elasticsearch thing, always exists [21:55:38] wait what [21:55:40] this code is stupid [21:55:47] http://nobelium.eqiad.wmnet/_all/page/_search?timeout=30s [21:55:52] oh [21:55:54] no it isn't [21:55:56] nevermind me [21:56:08] http://nobelium.eqiad.wmnet/_all/page/_search?timeout=30s [21:56:17] err [21:56:19] 0 req = urllib2.urlopen(uri, json.dumps(search)) [21:56:21] I dunno what that does [21:56:23] if it's a POST [21:56:25] that won't work [21:56:26] Why is files/misc/scripts different on github/gitblit? [21:56:34] lol, it's probably a POST. hmm [21:56:43] Reedy: lol gitblit? [21:56:48] reedy@tin:~$ mwgrep A [21:56:49] ## Public wiki results [21:56:49] abwiki MediaWiki:Collapserefs.js [21:56:49] abwiki MediaWiki:Common.css [21:56:51] we could unblock POST to */_search [21:56:53] YuviPanda: kaldari linked it :P [21:56:59] heh [21:57:03] ebernhardson: hmm, is that safe? [21:57:13] (if so let's do it!0 [21:57:15] ) [21:57:16] Yeah, wtf [21:57:21] YuviPanda: probably, _search is a read only api [21:57:26] https://git.wikimedia.org/tree/operations%2Fpuppet/5b7895dcd5b49b385f97e99438acf837f6a1a1d8/files%2Fmisc%2Fscripts [21:57:32] That's really out of date [21:57:40] ebernhardson: ok, yeah let's do that then. [21:57:56] ebernhardson: can you make a patch or shall I? [21:57:59] (03PS1) 10Rush: labtest: allow lookup by realm [puppet] - 10https://gerrit.wikimedia.org/r/255189 [21:58:14] YuviPanda: i'm in meeting for a bit, i can in a few though [21:58:21] (03CR) 10jenkins-bot: [V: 04-1] labtest: allow lookup by realm [puppet] - 10https://gerrit.wikimedia.org/r/255189 (owner: 10Rush) [21:58:41] ebernhardson: kk I'll take a look for a bit [21:59:02] YuviPanda: can also just run from nobelium itself for a quick 'is it insane' check :) [21:59:05] point to 9200 [21:59:17] true [21:59:32] (03PS2) 10Rush: labtest: allow lookup by realm [puppet] - 10https://gerrit.wikimedia.org/r/255189 [21:59:55] kaldari: Yeah, the file doesn't exist there at HEAD of production :) [22:00:04] 6operations, 7HTTPS, 5Patch-For-Review: releases.wikimedia.org should be https only and have hsts set - https://phabricator.wikimedia.org/T118787#1829810 (10Dzahn) Also merged that part and added the STS headers now that about a week went by without complaints. [22:00:21] https://git.wikimedia.org/tree/operations%2Fpuppet/0ac1d89/files%2Fmisc%2Fscripts#foo [22:00:25] 6operations, 7HTTPS: releases.wikimedia.org should be https only and have hsts set - https://phabricator.wikimedia.org/T118787#1829811 (10Dzahn) [22:00:30] 6operations, 7HTTPS: releases.wikimedia.org should be https only and have hsts set - https://phabricator.wikimedia.org/T118787#1829812 (10Dzahn) 5Open>3Resolved [22:00:54] thanks mutante [22:01:02] YuviPanda: :) yw [22:01:29] https://git.wikimedia.org/tree/operations%2Fpuppet/production/files%2Fmisc%2Fscripts [22:02:58] Reedy: anything we can delete from there,, yes :) [22:03:12] in general things in /files/ and /misc/ [22:03:23] or if they can be moved into a module [22:03:43] mutante: I hope to have emptied out manifests/role this week :D [22:03:44] i remember looking at the "pcntl" one before [22:04:05] YuviPanda: ooh! [22:04:16] i didnt know we were even close [22:04:34] you mean moving it all into module/role/ ? [22:04:43] mutante: we weren't! :D but I think except for 2 files everything is autolayout compatible there now [22:04:52] or has patches I'll merge later that'll make it compatible [22:04:55] yes mutante [22:05:15] YuviPanda: oooh, interesting..nice [22:05:26] mutante: but when manifests/role moves into modules/role, a lot of thins in file/ and templates/ can also move [22:05:40] i like how that fixes all the "not in autoload layout" lint [22:05:51] I think we'll be left with just site.pp and monitoring_groups.pp [22:06:07] cool [22:06:16] https://gerrit.wikimedia.org/r/#/c/255080/ for the monitoring groups patch [22:06:25] mutante: misc/ is still there, but I intend to move limn out of it next week [22:07:37] YuviPanda: yes, i removed as much as possible. the monitoring.pp i tried to move before but i think abandoned again after gerrit discussion [22:07:56] mutante: oh, do you have a link? [22:07:57] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [22:08:03] it looks like it should be in role ganglia or something [22:08:12] hmmm [22:08:29] yeah, eventually probably. this is an intermediary step though - since you can't have them in modules/role/ - they have to be directly imported [22:08:40] moving it out of monitoring_groups is for someone else maybe :D [22:09:03] bd808: do you have objections to https://gerrit.wikimedia.org/r/#/c/255073/1 [22:09:14] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#1829862 (10chasemp) 5Open>3stalled [22:09:18] ebernhardson: looks like I can't get to it, would appreciate if you can put up a patch after your meeting [22:09:55] * bd808 looks [22:09:58] YuviPanda: https://gerrit.wikimedia.org/r/#/c/249345/ [22:09:58] (03CR) 10Rush: [C: 032] labtest: allow lookup by realm [puppet] - 10https://gerrit.wikimedia.org/r/255189 (owner: 10Rush) [22:10:22] YuviPanda: _joe said "I think we should remove this instead of moving it around." hmm [22:10:43] so my attempt then was mediawiki module with the maintenance stuff [22:10:49] mutante: oh, the stuff I'm doing isn't related to monitoring.pp [22:11:05] this was just a response to misc/monitoring.pp still existing [22:11:07] it's https://gerrit.wikimedia.org/r/#/c/255080/ [22:11:10] yeah right [22:11:11] (03CR) 10BryanDavis: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/255073 (owner: 10Yuvipanda) [22:11:13] since you asked about a link [22:11:14] I agree that needs some work [22:11:28] I'll read through comments on it before I touch it, mutante [22:11:31] yea, 2 separate things [22:11:38] thanks [22:13:45] (03PS2) 10Yuvipanda: archiva: Move nginx proxy template into module [puppet] - 10https://gerrit.wikimedia.org/r/255075 [22:13:48] (03PS2) 10Yuvipanda: role: Move quarry to use autolayout [puppet] - 10https://gerrit.wikimedia.org/r/255082 [22:13:49] (03PS2) 10Yuvipanda: k8s: Remove standalone role [puppet] - 10https://gerrit.wikimedia.org/r/255081 [22:13:51] (03PS2) 10Yuvipanda: Move all monitoring groups to one file [puppet] - 10https://gerrit.wikimedia.org/r/255080 [22:15:53] (03PS3) 10Yuvipanda: role: Move quarry to use autolayout [puppet] - 10https://gerrit.wikimedia.org/r/255082 [22:15:55] (03PS3) 10Yuvipanda: k8s: Remove standalone role [puppet] - 10https://gerrit.wikimedia.org/r/255081 [22:15:57] (03PS3) 10Yuvipanda: Move all monitoring groups to one file [puppet] - 10https://gerrit.wikimedia.org/r/255080 [22:16:02] mutante: I guess the @monitoring_groups stuff will be all realized on neon? [22:16:25] (03CR) 10Yuvipanda: [C: 032 V: 032] archiva: Move nginx proxy template into module [puppet] - 10https://gerrit.wikimedia.org/r/255075 (owner: 10Yuvipanda) [22:17:37] (03CR) 10jenkins-bot: [V: 04-1] role: Move quarry to use autolayout [puppet] - 10https://gerrit.wikimedia.org/r/255082 (owner: 10Yuvipanda) [22:17:44] (03CR) 10jenkins-bot: [V: 04-1] k8s: Remove standalone role [puppet] - 10https://gerrit.wikimedia.org/r/255081 (owner: 10Yuvipanda) [22:17:45] YuviPanda: @monitoring::group yes [22:17:50] yea, that's a neon thing [22:17:51] 7Blocked-on-Operations, 10Flow, 3Collaboration-Team-Current, 5Patch-For-Review, 7WorkType-Maintenance: Migrate Flow content to new separate logical External Store - https://phabricator.wikimedia.org/T106363#1829904 (10Mattflaschen) The script will only be run in production after careful testing elsewhere... [22:21:54] thanks bd808 [22:22:11] (03CR) 10Yuvipanda: [C: 032] Move all monitoring groups to one file [puppet] - 10https://gerrit.wikimedia.org/r/255080 (owner: 10Yuvipanda) [22:22:18] 6operations, 10Traffic, 5Patch-For-Review: Convert misc cluster to 2-layer - https://phabricator.wikimedia.org/T119394#1829922 (10BBlack) The misc cluster is now in a 2-layer multi-DC configuration (including ipsec), but with a few missing pieces to go for full functionality: 1) Defining dynamic_directors s... [22:22:37] YuviPanda: Don't accidentally forget the associated ldap cleanup. :) [22:22:48] bd808: yeah, have been doing those for a few days now [22:22:58] bd808: the old mediawiki singlenode is now role::deprecated::mediawiki::install [22:23:06] nice [22:23:13] bd808: most of those nodes already fail puppet though :( [22:23:16] so not much that did to help [22:23:19] that poor role never worked any time I tried it [22:23:42] it was before my time I guess [22:23:45] I've never had it work either [22:23:57] I think I might've tried it once and gotten frustrated and wrote labs-vagrant [22:24:00] not sure [22:24:04] I found an old vm of mine today that is still using labs-vagrant on 12.04 [22:24:24] heh wow [22:24:30] Niharika or I will be building a replacement very soon for it [22:24:40] wheee [22:24:46] I plan on doing a VM audit [22:24:49] like the NFS audit [22:25:03] I feel there will be a bunch of VMs that people forgot existed [22:25:15] * bd808 hides his over sized ad underused VMs [22:25:18] this isn't a 'give back your VMs!' but more of a 'hey do you still need this? do cleanup if you do not, thank you' [22:25:20] * ebernhardson deleted some several year old VM's when taking over the `search` project [22:25:30] they were used for lsearchd :) [22:25:35] bd808: I don't think that's a problem since if you don't use things it doesn't get allocated, etc. [22:25:43] but I bet tehre are several 'oh I did not know we had it!' [22:25:56] ebernhardson: yeah this is why I hate generic and team-name vms [22:26:06] the 'editor engagement' project is :( [22:26:07] yeah. I killed a bunch of older vms in the core-team project this summer [22:26:35] there's a lot of VMs in that project that no person is responsible for in a way they can say 'we use it' nor 'we do not' [22:26:56] i wonder if a VM should report something like "nobody ever logged in to me in the last X days" [22:27:07] mutante: it does now. I've been collecting that info for a week now [22:27:12] aha [22:27:16] it's in graphite. [22:27:24] although I can write a script that reports the output of 'last' too [22:27:26] there are some machines i use but don't log directly into unless they break [22:27:33] luckily they broke in the last week :) [22:27:39] ebernhardson: yeah, so that's why it's gotta be a manual audit I guess [22:27:45] all of these are just useful heuristics [22:28:03] yup [22:28:16] (03CR) 10Yuvipanda: [C: 032] k8s: Remove standalone role [puppet] - 10https://gerrit.wikimedia.org/r/255081 (owner: 10Yuvipanda) [22:28:17] if there are unattended upgrades that might change how much a login is needed [22:28:43] we do have unattended security upgrades for labs [22:29:09] the login is more like 'if someone logged in into this in X time period, we need not even count it in a manual audit as a thing to be looked at' [22:29:27] (03PS1) 10Rush: labtest: bump up lookup priority [puppet] - 10https://gerrit.wikimedia.org/r/255264 [22:29:38] (03PS2) 10Rush: labtest: bump up lookup priority [puppet] - 10https://gerrit.wikimedia.org/r/255264 [22:30:10] What's tokipona.cdb [22:30:17] sync-file was giving me an error for a seemingly unrelated file [22:31:00] Krinkle: if the error was from mira (first sync step) then it is probably our mismatched uid problem [22:31:43] 6operations, 10Deployment-Systems: sync-file reports "Permission denied: /srv/mediawiki-staging/php-1.27.0-wmf.7/cache/l10n/l10n_cache-tokipona.cdb.tmp" - https://phabricator.wikimedia.org/T119573#1829966 (10Krinkle) 3NEW [22:31:50] the master-master sync step syncs the full /src/mediawiki-staging directory and it known to be broken [22:32:08] I just did sync-file wmf-config/foo [22:32:23] Can we just disable mira if it's not working? It's not being used, right? [22:32:52] 6operations, 10Deployment-Systems: sync-file reports "Permission denied: /srv/mediawiki-staging/php-1.27.0-wmf.7/cache/l10n/l10n_cache-tokipona.cdb.tmp" - https://phabricator.wikimedia.org/T119573#1829973 (10bd808) [22:32:54] 6operations, 10Deployment-Systems: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1829974 (10bd808) [22:33:00] Getting all kinds of random confusing mira errors for the pat few weeks when deploying. It's confusing and unproductive [22:33:00] i would like to see it being used and understand what is blocking that [22:33:01] (03PS2) 10Yuvipanda: labsvagrant: Move role to deprecated [puppet] - 10https://gerrit.wikimedia.org/r/255073 [22:33:08] #not-my-problem [22:33:11] afaik its the scap uid issue [22:33:13] * YuviPanda relegates labsvagrant to the deprecated pit [22:33:24] (03CR) 10Yuvipanda: [C: 032 V: 032] labsvagrant: Move role to deprecated [puppet] - 10https://gerrit.wikimedia.org/r/255073 (owner: 10Yuvipanda) [22:33:26] mutante: jsut this part now -- https://phabricator.wikimedia.org/T119165 [22:33:36] notmyproblem is nice.. [22:33:43] bd808: thanks [22:34:04] What I mean is, it's *obviously* not working now. Why dont' we revert whatever patch added it to the scap config until it is resolved. [22:34:07] PROBLEM - salt-minion processes on cp3022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [22:34:13] This error (which keeps mutating) is just wasting everyone's time [22:34:56] * bd808 defers to the releng team on that decision [22:34:58] (03PS3) 10Rush: labtest: bump up lookup priority [puppet] - 10https://gerrit.wikimedia.org/r/255264 [22:35:03] revert/comment out the sync-co-master thing [22:35:31] it would be easy enough to stub out [22:35:52] it all flows through one shared function in scap.main [22:36:09] (03CR) 10Rush: [C: 032] labtest: bump up lookup priority [puppet] - 10https://gerrit.wikimedia.org/r/255264 (owner: 10Rush) [22:36:39] YuviPanda: caught your labsvagrant change [22:36:41] Krinkle: understood, what bd808 said [22:36:41] cool to merge? [22:36:56] let me check if i can check the uid on mira thing though [22:36:56] Is anywhere using mira as a sync host? [22:36:58] (03PS1) 10Yuvipanda: labsvagrant: Do not require secondary disk [puppet] - 10https://gerrit.wikimedia.org/r/255265 [22:36:59] chasemp: yeah go on [22:37:03] s/check/fix [22:37:17] chasemp: so puppet-merge isn't needed for labs instances (they just autopull every minute) so I forget to merge now and then [22:37:30] (03PS2) 10Yuvipanda: labsvagrant: Do not require secondary disk [puppet] - 10https://gerrit.wikimedia.org/r/255265 [22:38:46] so, would it help if i changed the uid of that user [22:38:53] and then fix all the files on mira [22:38:55] using find [22:39:06] find -uid .. - exec chown .. bla [22:39:22] (03CR) 10Yuvipanda: [C: 032] labsvagrant: Do not require secondary disk [puppet] - 10https://gerrit.wikimedia.org/r/255265 (owner: 10Yuvipanda) [22:39:27] had to do it before for messed up UIDs [22:41:18] 6operations, 10Deployment-Systems: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1829980 (10Dzahn) Let's pick the UID that we have on tin and make it "reserved" by adding it here: https://wikitech.wikimedia.org/wiki/UID then i'd change the UID on mira and use f... [22:42:26] bd808: btw, did the LDAP change. 45 hosts! [22:42:31] (and verified too) [22:42:39] thanks for the review [22:42:47] i'll try to fix the mira uid thing after i get some food, bbiaw [22:52:02] I wonder what this daily cycle in NOTICES in apache is. https://grafana.wikimedia.org/dashboard/db/production-logging?panelId=11&fullscreen&from=1447800581605&to=1448405201605 [23:04:16] (03PS1) 10EBernhardson: Allow POST to _search endpoints in labs elasticsearch replica [puppet] - 10https://gerrit.wikimedia.org/r/255271 [23:04:56] ebernhardson: can you add a comment explaining why POST is safe in this case? [23:06:09] bah, and suddenly I cant get to any host in the US.... maybe time to sleep... [23:06:37] (03PS2) 10EBernhardson: Allow POST to _search endpoints in labs elasticsearch replica [puppet] - 10https://gerrit.wikimedia.org/r/255271 [23:06:59] (03PS3) 10EBernhardson: Allow POST to _search endpoints in labs elasticsearch replica [puppet] - 10https://gerrit.wikimedia.org/r/255271 [23:07:12] YuviPanda: done [23:07:37] oh, i put it in the commit message :P sec i can put it in the nginx file too [23:07:45] Why would you post to _search? [23:07:53] ostriches: aparently thats what mwgrep does to send it's payload [23:07:53] (03CR) 10Yuvipanda: [C: 032 V: 032] Allow POST to _search endpoints in labs elasticsearch replica [puppet] - 10https://gerrit.wikimedia.org/r/255271 (owner: 10EBernhardson) [23:08:04] ostriches: via wahtever library it uses [23:08:24] What a silly thing to do :p [23:08:31] yea... [23:08:51] ebernhardson: whoops :D wanna make anothe rpatch? [23:09:12] ebernhardson: ostriches it has no library, just directly constructs a URL and curls [23:09:24] With -XPOST? [23:09:25] :P [23:09:38] fun [23:09:40] urllib2.HTTPError: HTTP Error 400: Bad Request [23:09:42] err [23:09:44] not actually curl [23:09:46] but urlopen [23:09:49] still [23:10:53] ebernhardson: where to look to see why it's a 400? [23:11:03] hmm [23:11:09] nothing in nginx [23:11:14] it's passing it through [23:11:16] (03PS1) 10EBernhardson: Document why POST to _search is allowed [puppet] - 10https://gerrit.wikimedia.org/r/255274 [23:11:41] data may be a string specifying additional data to send to the server, or None if no such data is needed. Currently HTTP requests are the only ones that use data; the HTTP request will be a POST instead of a GET when the data parameter is provided. data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.urlencode() [23:11:42] function takes a mapping or sequence of 2-tuples and returns a string in this format. urllib2 module sends HTTP/1.1 requests with Connection:close header included. [23:11:44] (03CR) 10Yuvipanda: [C: 032 V: 032] Document why POST to _search is allowed [puppet] - 10https://gerrit.wikimedia.org/r/255274 (owner: 10EBernhardson) [23:12:13] YuviPanda: nobelium:/var/log/elasticsearch/labsearch.log [23:12:24] failed to parse source [23:12:30] which generally means invalid json :S [23:12:44] lol pages and pages of stacktrace on a tail -f [23:13:05] but the json is fine [23:13:18] yes, each index (1800 of them) gave its own stack trace :P [23:14:18] hmm, the request looks sane... [23:14:45] ebernhardson: I switched to latest version of mwgrep [23:14:49] and it hasn't crashed yet [23:14:52] let's see if it just times out [23:15:15] oh duh, yea it failed because it was attempting to use inline groovy [23:15:30] * ebernhardson bets on timeout [23:15:56] ebernhardson: I see a lot of stacktraces still floating past [23:16:07] (03PS1) 10Jhobs: Increase survey coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255276 [23:16:26] ebernhardson: woo it worked [23:16:30] kaldari: ^ [23:16:50] but note you only have content pages, not the general indices [23:16:53] https://dpaste.de/HXiX [23:17:06] what do you mean by 'general indices'? [23:17:14] I guess this found all the things in Mediawiki: namespace [23:17:24] legoktm: Reedy ^ too [23:17:51] hmm, aparently i did copy the general indices here [23:17:53] ooh! [23:17:54] neat :D [23:18:00] heh some php programmer has touched mwgrep [23:18:03] tabs >_> [23:18:07] now labs just needs to have all the wikis [23:18:11] ebernhardson: so Mediawiki: namespace isn't present? [23:18:16] legoktm: i just need 50k for servers :P [23:18:33] legoktm: this is a temporary thing, so we need to load test and see how useful it is and then also scrounge budget [23:18:34] YuviPanda: it should be, i had thought i skipped them due to size concerns but they are there [23:18:48] ebernhardson: so are the updates coming n through? [23:18:52] YuviPanda: yes [23:18:57] ah cool [23:19:11] * Vito has a question for YuviPanda [23:19:19] hi Vito [23:20:07] 6operations, 10hardware-requests: Additional diskspace of wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T119579#1830123 (10Smalyshev) 3NEW [23:20:12] hi YuviPanda, I have a simple question: did you use abusefilters to get (in some way) badwords and heuristics for ORES? [23:20:46] Vito: ah, you should ask in #wikimedia-ai :) I've no idea [23:20:55] Can you buy a 100GB spinning disk now? ;) [23:21:08] Vito: ToAruShiroiNeko or halfa.k can answer [23:21:25] I was about to drop a line to DarTar [23:21:32] Vito: he wouldn't know probably [23:21:42] but well, I'll copypaste line above there then ;) [23:23:10] 6operations: move torrus behind misc-web - https://phabricator.wikimedia.org/T119582#1830152 (10Dzahn) 3NEW a:3Dzahn [23:23:42] 6operations: move torrus behind misc-web - https://phabricator.wikimedia.org/T119582#1830161 (10Dzahn) p:5Triage>3Low [23:24:39] (03PS1) 10Yuvipanda: scap: Fix mwgrep pep8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/255279 [23:26:17] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Give contint-admins sudo rights to start/stop zuul-merger - https://phabricator.wikimedia.org/T119526#1830165 (10Dzahn) >>! In T119526#1829731, @hashar wrote: > Note journalctl is problematic, it has a bunch of super useful options but accepting a wildca... [23:26:21] YuviPanda: Cool! Can I try it? [23:26:44] kaldari: sure! it's in /tmp/mwgrep on tools-login [23:28:22] YuviPanda: heh, i wonder if this lines up with your search :) http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1448407688.689&from=-3hours&target=servers.nobelium.loadavg.01 [23:28:40] ebernhardson: hahah [23:28:42] :D [23:28:44] we should run it in a loop :D [23:28:52] ebernhardson: let's see if kaldari's search also causes a spike [23:28:58] re: l10nupdate user.. so the uid on tin.. <1000 ... on mira >12000 ..duh [23:29:10] since when are the UIDs at over 10k [23:29:35] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [23:29:51] fine fine icinga-wm [23:30:04] ebernhardson, YuviPanda: works great! [23:30:22] ebernhardson: lol, http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1448407688.689&from=-3hours&target=servers.nobelium.loadavg.01 has no more info?! [23:30:25] seems pretty fast too [23:30:26] did we kill diamond? [23:30:50] it has cutout's all the time [23:30:50] crap, the UID is already in use by trebuchet [23:30:51] kaldari: nice. I'll just want to re-stress that this is still a test host on hardware that'll go away in 6 weeks [23:31:07] YuviPanda: understood [23:31:25] kaldari: but if we can get enough use cases and support I guess we can scrounge around for budget :) [23:31:36] kaldari: so feel free to use it in the meantime [23:31:36] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [23:32:19] kaldari: and let us know what you end up using it for too, etc. [23:33:39] ok [23:34:01] 6operations, 10Deployment-Systems: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1830173 (10Dzahn) ok, so we have defined that it is supposed to be: 10002/10002 the actual situation is: tin: 997/10002 mira: 12162/10002 so we have to fix both servers [23:34:43] (03PS1) 10Yuvipanda: scap: Allow customizing search host in mwgrep [puppet] - 10https://gerrit.wikimedia.org/r/255282 [23:35:04] YuviPanda: Should that live under scap? [23:35:09] Reedy: probably not [23:35:17] Reedy: but let's merge these now and then bikeshed :D [23:35:18] jouncebot: next [23:35:18] In 0 hour(s) and 24 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151125T0000) [23:35:20] "Thanks Reedy for creating more work for me" [23:35:29] ah, heh [23:35:58] resists changing stuff on tin then [23:36:26] starts with mira [23:36:53] (03CR) 10BryanDavis: [C: 04-1] scap: Allow customizing search host in mwgrep (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/255282 (owner: 10Yuvipanda) [23:36:54] !log phabricator: restarted daemons, were complaining about out of sync config [23:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:37:25] YuviPanda: It's gotta be OVER 9000 [23:37:30] (03PS2) 10Yuvipanda: scap: Allow customizing search host in mwgrep [puppet] - 10https://gerrit.wikimedia.org/r/255282 [23:37:32] fine fine :D [23:37:34] updated [23:38:44] (03CR) 10BryanDavis: [C: 031] scap: Allow customizing search host in mwgrep [puppet] - 10https://gerrit.wikimedia.org/r/255282 (owner: 10Yuvipanda) [23:38:44] Technically it was to be [1-65535] :p [23:38:48] *has [23:39:38] well it would be inconvenient for the new defaults to not actually work in either environment the tool is headed for [23:40:00] bd808: indeed... [23:40:11] * YuviPanda spins up new cluster and sets port to 9000 [23:40:32] Actually, the way puppet is setup right now it can only be 9200, I think [23:40:48] (at least it can't run 9200 and a second process on 9201. We fixed that :p) [23:40:48] !log deployed patch for T119309 [23:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:40:55] on nobelium we proxy it through nginx [23:41:40] bd808: I also did a pep8 cleanup btw https://gerrit.wikimedia.org/r/#/c/255279/1 [23:42:36] darn MW spacey syntax creeping into python ;) [23:42:51] bd808: heh :D it also had tabs! [23:43:05] !log fixinx l10nupdate UID / file ownership on mira [23:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:45:55] (03PS1) 10Chad: pep8 fixes for elasticsearch_monitoring.py [puppet] - 10https://gerrit.wikimedia.org/r/255288 [23:46:05] PROBLEM - HTTP 5xx reqs/min anomaly on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [23:53:23] YuviPanda: looks like load did spike again, but not nearly as much on the second query around 23:30. probably fine. i'm not seeing though, whats the dns to talk to this from inside labs? [23:53:44] (03PS1) 10Chad: pep8: fix up webperf python files [puppet] - 10https://gerrit.wikimedia.org/r/255295 [23:53:44] ebernhardson: just nobelium.eqiad.wmnet [23:53:55] Hehe, pep8 fixes are easy! [23:54:04] ebernhardson: I can add an alias if we want but maybe not to insist on the ephemeral nature of this [23:54:30] YuviPanda: ahh, ok