[00:00:05] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151119T0000). [00:00:05] MaxSem: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:12] Oh crap [00:00:16] I meant to schedule a patch [00:00:17] * RoanKattouw adds [00:00:24] * MaxSem is here [00:01:43] hey... I'm trying to create accounts on voteWiki and just got a very weird error... "Exception encountered, of type "LogicException"" [00:01:51] anyone have thoughts? [00:01:56] I'll do the SWAT today [00:02:12] Jamesofur: Was there a hexadecimal exception number in [brackets] ? [00:02:16] oh I added mine to 1am instead of 20 instead of 19 [00:02:26] Dereckson: OK, move it up [00:02:31] I'm still adding mine anyways :D [00:02:54] RoanKattouw: nope sadly, goes to blank page with that error http://prntscr.com/94ckfj [00:02:57] it's tricky the new presentation of the array with the 00:00 on the next day :/ [00:03:13] Right [00:03:29] I am in SF time so I just to Ctrl+F '18 16' to find the one on the 18th ad 16:00 [00:03:32] *at [00:04:08] In CET/CEST time I found that also unnatural as days tend to start at morning, after a sleep period, not at midnight. [00:04:13] Jamesofur: OK, well if it was votewiki it should stand out in the logs anway [00:04:19] (or 1am here) [00:04:19] Lemme look [00:05:23] Jamesofur: Looks like you tried twice, once at 16:00:30 PST and once at 16:02:07 PST [00:05:30] yup [00:05:31] correct [00:05:41] (wanted to make sure it wasn't just a one time thing) [00:05:42] I'll PM you with details [00:05:53] (03CR) 10Aaron Schulz: [C: 031] redis: prohibit commands CONFIG, SLAVEOF and DEBUG by default [puppet] - 10https://gerrit.wikimedia.org/r/251800 (owner: 10Ori.livneh) [00:09:20] (03CR) 10Catrope: [C: 032] Whitelist domains for server-side upload - Coding Da Vinci Hackathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253963 (https://phabricator.wikimedia.org/T118844) (owner: 10Dereckson) [00:09:42] (03Merged) 10jenkins-bot: Whitelist domains for server-side upload - Coding Da Vinci Hackathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253963 (https://phabricator.wikimedia.org/T118844) (owner: 10Dereckson) [00:09:58] (03CR) 10Catrope: [C: 032] Namespace configuration on es.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254072 (https://phabricator.wikimedia.org/T119006) (owner: 10Dereckson) [00:10:21] (03Merged) 10jenkins-bot: Namespace configuration on es.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254072 (https://phabricator.wikimedia.org/T119006) (owner: 10Dereckson) [00:10:57] (03CR) 10Catrope: [C: 032] Bump portals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254065 (owner: 10MaxSem) [00:11:36] (03Merged) 10jenkins-bot: Bump portals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254065 (owner: 10MaxSem) [00:11:36] Testing 253963. 254072 [00:11:42] Not deployed yet, hold on [00:11:59] 10Ops-Access-Requests, 6operations: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1815640 (10Jdrewniak) ``` ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDpY8H5wj5hwRngoX6xScBN9RTPvnpKoxq7lqXj7QjTXNy... [00:12:00] (sorry, I've pressed enter too soon) [00:12:23] !log catrope@tin Synchronized portals: SWAT (duration: 00m 19s) [00:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:12:31] MaxSem: ---^^ [00:12:51] thanks RoanKattouw, we'll test [00:13:17] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: T118844, T119006 (duration: 00m 19s) [00:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:13:45] Dereckson: ---^^ [00:13:46] Testing 253963. Meanwhile, 254072 requires to run `mwscript namespaceDupes.php --wiki=enwiktionary --fix`. [00:13:53] Yup, on it [00:13:54] hold on [00:13:56] eswiktionary [00:14:23] Yeah, I got it right :) [00:15:17] Grmbl, the script crashed with a DB error [00:15:21] A database query error has occurred. [00:15:22] Query: UPDATE `pagelinks` SET pl_namespace = '4',pl_title = 'Estructura' WHERE pl_namespace = '0' AND pl_title = 'WN:Estructura' AND pl_from = '909266' [00:15:24] Function: NamespaceConflictChecker::checkLinkTable [00:15:25] Error: 1062 Duplicate entry '909266-4-Estructura' for key 'pl_from' (10.64.16.27) [00:15:31] Thanks MediaWiki :S [00:15:50] Lemme try editing the page to fix the problem [00:16:39] lol, it's the proposal itself on [[Wikcionario:Café/2015 09]] that's causing this [00:17:02] (03CR) 10Alex Monk: [C: 04-1] "Isn't this domain set up for mail?" [dns] - 10https://gerrit.wikimedia.org/r/254057 (owner: 10Dzahn) [00:18:00] T118844 works. [00:19:14] OK, null editing that page worked [00:19:17] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL: CRITICAL: 29.17% of data above the critical threshold [100000000.0] [00:19:20] And rerunning it fixed the rest [00:19:27] !log Ran namespaceDupes.php on eswiktionary [00:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Mr. Obvious [00:19:39] That script needs ON DUPLICATE KEY IGNORE [00:20:55] Seems to have been reported at https://phabricator.wikimedia.org/T115824 [00:22:58] I have a patch [00:27:28] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Requests for addition to the #project-creators group (in comments) - https://phabricator.wikimedia.org/T706#1815740 (10JAufrecht) I would like to join Project-Creators as a Team Practices Group Agile Coach who creates projects for teams doing work tr... [00:29:12] !log catrope@tin Synchronized php-1.27.0-wmf.7/extensions/Flow/: SWAT: better permissions errors for opt-in (duration: 00m 21s) [00:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:30:07] RECOVERY - cassandra-a CQL 10.192.16.165:9042 on restbase2002 is OK: TCP OK - 0.036 second response time on port 9042 [00:40:07] RECOVERY - Outgoing network saturation on labstore1001 is OK: OK: Less than 10.00% above the threshold [75000000.0] [00:41:17] PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string Wikimedia and MediaWiki not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 14241 bytes in 0.153 second response time [00:41:54] hrmm...., and thats likely not planned [00:41:59] uhm [00:42:05] it seems to be working for me [00:42:12] me as well [00:42:30] that check seems possibly borked [00:42:44] bad url there.... [00:44:29] robh: it actually did detect a problem, that is, I accidentally made the 'homepage' not public, so it changed the html on the page that the check was looking for [00:44:49] ahh [00:44:50] but I've just fixed that [00:45:07] RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 23225 bytes in 0.165 second response time [00:45:09] i was looking in git logs on who touched possibly the firewall rules [00:45:10] heh [00:45:11] yep there it is [00:45:17] sorry about that [00:45:29] btw we do have a real scheduled downtime in about 15 minutes ;) [00:45:37] no worries, its a critical service so it pages the awake opsen ;] [00:45:46] you have icinga ack permissions right? [00:45:51] if not, i can put it into maint for ya [00:45:53] yep, I got paged too [00:46:07] I can put it in maintenance, I remember how to do it [00:46:09] thanks [00:46:25] no worries, best pages are the kind that someone else fixed (like this one ;) [01:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151119T0100). [01:01:09] ok gonna take phabricator offline shortly... [01:30:25] !log finished phabricator upgrade [01:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:43:46] PROBLEM - puppet last run on mw2197 is CRITICAL: CRITICAL: puppet fail [01:43:47] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000000.0] [01:57:07] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [01:58:41] 6operations, 10ops-eqiad, 10netops: cr1-eqiad PEM 0 fan failed - https://phabricator.wikimedia.org/T118721#1816065 (10Cmjohnson) Customer Tracking #: The RMA R405131-1, from Case Number 2015-1118-0268, has been shipped. Your replacement part model number PWR-MX480-1200-AC-S is being shipped via UPS. The ca... [01:58:47] twentyafterfour: any interesting goodies in this update? [02:02:11] ori: https://secure.phabricator.com/w/changelog/ [02:02:21] * ori bookmarks [02:02:22] twentyafterfour: thanks :) [02:02:24] mostly drydock and harbormaster stuff which will be cool for CI experimentation [02:02:50] are we on the latest release? [02:04:38] I just pulled in the newest stable as of today [02:05:34] so the last 3 weeks or so of updates are relevant [02:07:48] cool [02:07:56] The nicest improvements as of late are the new `arc land` in arcanist. Landing a patch works much better than before, they addressed every complaint we had from using arcanist for a couple of weeks. [02:12:01] ahahahaha https://secure.phabricator.com/w/changelog/2015.46/ [02:12:37] RECOVERY - puppet last run on mw2197 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [02:15:23] (03CR) 10Eevans: [WIP] hooks-based event production (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254033 (owner: 10Eevans) [02:18:36] (03CR) 10Eevans: "An extension makes sense, so (with Ori's help) I've put one up at: https://gerrit.wikimedia.org/r/#/c/254086/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254033 (owner: 10Eevans) [02:18:56] lol [02:19:20] (03Abandoned) 10Eevans: [WIP] hooks-based event production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254033 (owner: 10Eevans) [02:23:00] !log l10nupdate@tin Synchronized php-1.27.0-wmf.6/cache/l10n: l10nupdate for 1.27.0-wmf.6 (duration: 06m 58s) [02:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:07] PROBLEM - Hadoop NodeManager on analytics1055 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [02:34:06] RECOVERY - Hadoop NodeManager on analytics1055 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [03:07:06] PROBLEM - puppet last run on mw2159 is CRITICAL: CRITICAL: puppet fail [03:17:14] jhobs: email is best right now [03:35:57] RECOVERY - puppet last run on mw2159 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:02:37] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [5000000.0] [04:08:37] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 20.83% of data above the critical threshold [100000000.0] [04:18:42] (03CR) 10MZMcBride: "This changeset strikes me as hasty and consequently sloppy, which is coincidentally similar to how I felt about the initial incident in th" [puppet] - 10https://gerrit.wikimedia.org/r/253951 (owner: 10Chad) [04:19:47] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [04:24:53] (03Abandoned) 10Yuvipanda: Revoke my shell access [puppet] - 10https://gerrit.wikimedia.org/r/253951 (owner: 10Chad) [04:31:56] 6operations, 6Labs: Security setting changes are not applied - https://phabricator.wikimedia.org/T118936#1816325 (10yuvipanda) [04:32:27] 6operations, 6Labs: Security setting changes are not applied - https://phabricator.wikimedia.org/T118936#1813172 (10yuvipanda) Happened again to @niharika with commtech-1, which was also on labvirt1010. Restarting nova-compute fixed it again. [04:32:52] 6operations, 6Labs: Security setting changes are not applied - https://phabricator.wikimedia.org/T118936#1816327 (10yuvipanda) p:5Triage>3High [04:33:27] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: puppet fail [04:35:44] ^ is ok [04:35:50] I accidentally typed sudo puppet agent -tv [04:36:05] when what I really meant was source ~/novaenv.sh [04:38:37] PROBLEM - puppet last run on snapshot1003 is CRITICAL: CRITICAL: Puppet has 1 failures [04:38:37] (03PS1) 10Yuvipanda: labs: Kill the old labslamp role [puppet] - 10https://gerrit.wikimedia.org/r/254101 (https://phabricator.wikimedia.org/T118784) [04:39:17] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 10.71% of data above the critical threshold [100000000.0] [04:41:05] (03CR) 10Yuvipanda: [C: 032] labs: Kill the old labslamp role [puppet] - 10https://gerrit.wikimedia.org/r/254101 (https://phabricator.wikimedia.org/T118784) (owner: 10Yuvipanda) [04:45:36] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000000.0] [04:46:36] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000000.0] [04:46:47] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:46:47] 7Puppet, 6operations, 5Patch-For-Review: Remove the webserver module - https://phabricator.wikimedia.org/T118786#1816333 (10yuvipanda) [04:49:39] (03PS1) 10Yuvipanda: Remove old and terrible labs-mysql-server role [puppet] - 10https://gerrit.wikimedia.org/r/254102 [04:51:31] (03CR) 10Yuvipanda: [C: 032] "BEGONE YOU ANCIENT SCOURGE" [puppet] - 10https://gerrit.wikimedia.org/r/254102 (owner: 10Yuvipanda) [04:53:16] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [5000000.0] [04:53:58] (03PS1) 10Yuvipanda: labs: Use mysql directly in mediawiki singlenode [puppet] - 10https://gerrit.wikimedia.org/r/254103 [04:54:27] (03CR) 10Yuvipanda: [C: 032] "I MIGHT NEVER GET RID OF YOU, DIFFERENT ANCIENT SCOURGE" [puppet] - 10https://gerrit.wikimedia.org/r/254103 (owner: 10Yuvipanda) [05:00:58] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [05:02:07] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [05:05:36] RECOVERY - puppet last run on snapshot1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:12:11] (03PS1) 10Yuvipanda: chromium: Remove module [puppet] - 10https://gerrit.wikimedia.org/r/254104 [05:15:07] (03Abandoned) 10Yuvipanda: chromium: Remove module [puppet] - 10https://gerrit.wikimedia.org/r/254104 (owner: 10Yuvipanda) [05:17:46] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [05:33:36] 6operations, 6Commons: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1816393 (10ori) [05:43:56] (03PS1) 10Yuvipanda: Move most of 'role' module to manifests/role [puppet] - 10https://gerrit.wikimedia.org/r/254106 [05:46:27] PROBLEM - Hadoop NodeManager on analytics1053 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [05:48:49] 6operations, 6Labs: Security setting changes are not applied - https://phabricator.wikimedia.org/T118936#1816397 (10chasemp) Do we think this affects labvirt1010 specifically then? [05:50:38] PROBLEM - puppet last run on db2047 is CRITICAL: CRITICAL: puppet fail [05:50:52] 6operations, 6Labs: Security setting changes are not applied - https://phabricator.wikimedia.org/T118936#1816398 (10yuvipanda) Not sure - it just so happened that the new instances I ran into were on 1010, but that could just be because that's where new instances are going now. [05:53:58] RECOVERY - Hadoop NodeManager on analytics1053 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [05:55:35] (03PS1) 10Yuvipanda: Move puppetmaster roles to manifests/role [puppet] - 10https://gerrit.wikimedia.org/r/254107 [05:56:48] 6operations, 6Commons: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1816402 (10Stemoc) [05:57:54] (03PS2) 10Yuvipanda: Remove manifests/stages.pp [puppet] - 10https://gerrit.wikimedia.org/r/231143 (owner: 10Faidon Liambotis) [05:58:08] (03CR) 10Yuvipanda: [C: 031] Remove manifests/stages.pp [puppet] - 10https://gerrit.wikimedia.org/r/231143 (owner: 10Faidon Liambotis) [05:58:23] (03CR) 10Yuvipanda: [C: 032] Move puppetmaster roles to manifests/role [puppet] - 10https://gerrit.wikimedia.org/r/254107 (owner: 10Yuvipanda) [06:00:42] mmmmmmmmmmm [06:00:45] that's really straaaang [06:02:04] ok it's a bug in our role function [06:03:47] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [06:04:22] yes that's me [06:06:40] (03PS1) 10Yuvipanda: Revert "Move puppetmaster roles to manifests/role" [puppet] - 10https://gerrit.wikimedia.org/r/254109 [06:06:53] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "Move puppetmaster roles to manifests/role" [puppet] - 10https://gerrit.wikimedia.org/r/254109 (owner: 10Yuvipanda) [06:08:06] PROBLEM - puppet last run on strontium is CRITICAL: CRITICAL: puppet fail [06:09:27] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:13:42] 7Puppet, 6operations: 'role' function doesn't find classess in autoload layout in manifests/role - https://phabricator.wikimedia.org/T119042#1816403 (10yuvipanda) 3NEW [06:14:06] 7Puppet, 6operations: 'role' function doesn't find classess in autoload layout in manifests/role - https://phabricator.wikimedia.org/T119042#1816410 (10yuvipanda) a:3Joe [06:14:17] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: puppet fail [06:17:18] RECOVERY - puppet last run on db2047 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:19:41] !log remove role::labs::instance puppetClass from all labs instances [06:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:19:57] PROBLEM - Apache HTTP on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:20:28] PROBLEM - HHVM rendering on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:21:06] PROBLEM - configured eth on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:21:08] PROBLEM - HHVM processes on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:21:26] PROBLEM - Disk space on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:21:28] PROBLEM - dhclient process on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:21:38] PROBLEM - nutcracker port on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:22:17] PROBLEM - puppet last run on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:22:18] PROBLEM - salt-minion processes on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:22:18] PROBLEM - SSH on mw1131 is CRITICAL: Server answer [06:22:26] PROBLEM - Check size of conntrack table on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:22:26] PROBLEM - RAID on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:22:28] PROBLEM - nutcracker process on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:22:46] PROBLEM - DPKG on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:26:27] RECOVERY - DPKG on mw1131 is OK: All packages OK [06:26:37] RECOVERY - configured eth on mw1131 is OK: OK - interfaces up [06:26:46] RECOVERY - HHVM processes on mw1131 is OK: PROCS OK: 6 processes with command name hhvm [06:26:58] RECOVERY - Disk space on mw1131 is OK: DISK OK [06:26:59] RECOVERY - dhclient process on mw1131 is OK: PROCS OK: 0 processes with command name dhclient [06:27:07] RECOVERY - nutcracker port on mw1131 is OK: TCP OK - 0.000 second response time on port 11212 [06:27:56] RECOVERY - salt-minion processes on mw1131 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:27:58] RECOVERY - Check size of conntrack table on mw1131 is OK: OK: nf_conntrack is 0 % full [06:27:58] RECOVERY - RAID on mw1131 is OK: OK: no RAID installed [06:28:06] RECOVERY - SSH on mw1131 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [06:28:07] RECOVERY - nutcracker process on mw1131 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [06:30:47] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:26] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:38] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:07] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:17] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:17] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:27] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:37] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:46] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:57] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:38] RECOVERY - puppet last run on strontium is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:40:46] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/ubuntu is over 12 hours old. [06:42:47] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:44:27] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/mirrors/ubuntu is over 0 hours old. [06:45:28] puppet always chokes at this time [06:56:08] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:56:38] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:57:56] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:14:38] PROBLEM - HHVM rendering on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:15:37] PROBLEM - Apache HTTP on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:16:16] PROBLEM - puppet last run on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:16] PROBLEM - RAID on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:26] PROBLEM - configured eth on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:37] PROBLEM - dhclient process on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:57] PROBLEM - nutcracker port on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:16] PROBLEM - Check size of conntrack table on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:17] PROBLEM - Disk space on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:17] RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [07:17:17] PROBLEM - SSH on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:17:18] PROBLEM - nutcracker process on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:46] PROBLEM - HHVM processes on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:56] PROBLEM - DPKG on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:56] PROBLEM - salt-minion processes on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:38] RECOVERY - salt-minion processes on mw1147 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:19:38] RECOVERY - DPKG on mw1147 is OK: All packages OK [07:19:57] RECOVERY - RAID on mw1147 is OK: OK: no RAID installed [07:20:19] RECOVERY - dhclient process on mw1147 is OK: PROCS OK: 0 processes with command name dhclient [07:21:17] RECOVERY - HHVM processes on mw1147 is OK: PROCS OK: 6 processes with command name hhvm [07:21:47] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 38 minutes ago with 0 failures [07:21:57] RECOVERY - configured eth on mw1147 is OK: OK - interfaces up [07:22:27] RECOVERY - nutcracker port on mw1147 is OK: TCP OK - 0.000 second response time on port 11212 [07:22:47] RECOVERY - Check size of conntrack table on mw1147 is OK: OK: nf_conntrack is 0 % full [07:22:47] RECOVERY - Disk space on mw1147 is OK: DISK OK [07:22:48] RECOVERY - SSH on mw1147 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [07:22:48] RECOVERY - nutcracker process on mw1147 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [07:23:23] (03PS1) 10Yuvipanda: diamond: Add an sshsessions collector [puppet] - 10https://gerrit.wikimedia.org/r/254111 (https://phabricator.wikimedia.org/T118827) [07:24:32] (03CR) 10jenkins-bot: [V: 04-1] diamond: Add an sshsessions collector [puppet] - 10https://gerrit.wikimedia.org/r/254111 (https://phabricator.wikimedia.org/T118827) (owner: 10Yuvipanda) [07:25:47] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [07:26:58] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [07:27:08] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:17] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:26] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [07:27:26] PROBLEM - puppet last run on mw1147 is CRITICAL: CRITICAL: Puppet has 1 failures [07:27:37] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [07:27:46] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [07:29:56] (03PS2) 10Yuvipanda: diamond: Add an sshsessions collector [puppet] - 10https://gerrit.wikimedia.org/r/254111 (https://phabricator.wikimedia.org/T118827) [07:33:08] (03CR) 10Yuvipanda: [C: 032] diamond: Add an sshsessions collector [puppet] - 10https://gerrit.wikimedia.org/r/254111 (https://phabricator.wikimedia.org/T118827) (owner: 10Yuvipanda) [07:33:55] 6operations, 6Labs, 10Tool-Labs, 5Patch-For-Review: Write a diamond collector to collect active ssh sessions - https://phabricator.wikimedia.org/T118827#1816444 (10yuvipanda) We should be collecting active list of open ssh (and mosh) sessions now. This should enable us to see how active individual instance... [07:48:02] (03PS1) 10Yuvipanda: diamond: Handle no sessions in sshsessions collector [puppet] - 10https://gerrit.wikimedia.org/r/254113 (https://phabricator.wikimedia.org/T118827) [08:09:22] (03CR) 10Yuvipanda: [C: 032] diamond: Handle no sessions in sshsessions collector [puppet] - 10https://gerrit.wikimedia.org/r/254113 (https://phabricator.wikimedia.org/T118827) (owner: 10Yuvipanda) [08:12:48] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [08:23:06] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [08:24:48] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [08:33:27] PROBLEM - RAID on pybal-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:33:36] PROBLEM - Disk space on pybal-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:33:56] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:34:08] PROBLEM - salt-minion processes on pybal-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:34:27] PROBLEM - configured eth on pybal-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:34:48] PROBLEM - dhclient process on pybal-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:34:57] PROBLEM - DPKG on pybal-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:37:09] (03PS1) 10Yuvipanda: diamond: Restart service when the .py files change [puppet] - 10https://gerrit.wikimedia.org/r/254116 [08:38:40] (03PS1) 10KartikMistry: CX: Enable article-recommender-1 campaign as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254117 (https://phabricator.wikimedia.org/T118033) [08:40:36] (03CR) 10Yuvipanda: [C: 032] diamond: Restart service when the .py files change [puppet] - 10https://gerrit.wikimedia.org/r/254116 (owner: 10Yuvipanda) [08:47:33] 6operations, 6Labs, 10Tool-Labs, 5Patch-For-Review: Write a diamond collector to collect active ssh sessions - https://phabricator.wikimedia.org/T118827#1816505 (10yuvipanda) 5Open>3Resolved [08:52:40] (03PS4) 10Yuvipanda: quarry: Move to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253531 [08:52:57] (03PS5) 10Yuvipanda: quarry: Move to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253531 [08:54:57] (03PS1) 10Yuvipanda: ores: Move to using redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/254119 [08:55:06] ori: ^ fixed the quarry redis stuff and also moved ores over. [08:55:24] YuviPanda: diamond collector for ssh sessions seems a bit bizarre [08:55:26] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [08:55:26] ori: we should fix the multiple-rename-commands problem before I can move tools [08:55:32] ori: why so [08:55:51] what is the range of values that you are expecting to see? [08:56:17] 0 to (right now) about 25 (on tools-bastion-01) [08:56:28] I suspect a lot of instances will be 0 for a long time [08:56:41] right now we have very little aggregate way to see which instances have any people using them at all... [08:56:53] it sounds like what you want is `last` [08:57:14] I can't make a dashboard out of last no [08:57:25] if you say salt I *will* laugh. [08:57:32] dashboard doesn't seem like the right solution either [08:57:37] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [08:57:40] hmm, fair enough. [08:57:53] you don't really care how many ssh sessions a VM had on nov 17 vs nov 16 [08:57:59] you just want to know if it's getting used or not [08:58:05] basically, yeah. [08:58:14] so you need some kind of job to iterate through VMs and flag the ones that have had no logins for a while [08:59:01] that's one way to do it [08:59:28] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [08:59:41] and I could do it with salt if it worked (it's a lot worse in labs than in prod) [09:00:09] I can write something that ssh's to all instances with my root key [09:00:14] but then only someone with root key can know that info [09:00:23] this way it's a lot more public, etc. [09:00:28] the script can publish its output somewhere [09:00:52] a dashboard is made up of graphs, but you don't need/want graphs, and it live-updates, but you don't need live-updates [09:01:05] and in the end you'll need to visually scan it for hosts that have not had recent logins anyway [09:01:16] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [09:01:29] I'll submit that a 10line(?) diamond collector is a better solution than: 1. writing a script that, 2. needs a root ssh key to run and many minutes /time periods, 3. that I'll have to run it manually and 3. publish results periodically [09:01:56] s/better/easier/ I can definitely argue for. [09:02:22] I know we're abusing graphite - we also use it to track puppet failures, for example [09:02:24] it's your show, but i bet you anything that you end up spending more time on it this way [09:03:06] we'll see! I'll probably come back to this in a month or something only now that I know it works [09:03:16] and it did take me less than an hour to set this up... [09:03:19] i don't know what to say about salt, to me it sounds like: "how should i eat my rice? don't say 'with a fork' because my fork broke a long time ago" [09:03:31] the answer is still "with a fork", imo [09:03:40] sure, but I do not want to become a farmer [09:03:53] or a miner, to be more accurate... [09:04:37] salt and command execution is toxic technically (and maybe) politically [09:04:44] so I'm going to stay as far away from it as possioble [09:04:47] *possible [09:04:53] heh, fair enough [09:04:57] I've pssh scripts that let me target per-project / all-of labs [09:05:03] and that's good enough for me [09:05:06] I have! :D [09:05:14] gaaah [09:05:17] dammit :P [09:05:49] it's interesting - I've definitely become complacent about my english [09:05:56] I *think* that's a correct 've? [09:06:33] ori: unrelated, but eating rice with hands >> eating it with a fork :) [09:06:46] haha, okay, you win there [09:07:51] ori: thanks for the EventBus extension CI configuration :-} [09:08:11] np! [09:08:25] ori: I'll remember to follow up in a month or something about how this turns up :) [09:08:27] *out [09:08:55] (03CR) 10Ori.livneh: [C: 04-1] ores: Move to using redis::instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/254119 (owner: 10Yuvipanda) [09:09:26] ori: ah nice [09:10:04] the rename_command is just a matter of finding out a bit of time to squint at the erb template and get it right [09:10:35] (03PS2) 10Yuvipanda: ores: Move to using redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/254119 [09:10:41] i'll try to get to it before the weekend [09:10:44] ori: thanks [09:11:03] I'm not fully sure how to design the 'input' datastructure there. currently it's a hash but you can't have multiple keys... [09:11:12] so one possibiltiy is to have the value be a list and if so 'repeat' the key [09:11:17] but that seems a bit... strange. idk [09:11:45] not that I've an alternative suggestion [09:11:46] no, i think lists should be lists [09:11:54] because take for example: client-output-buffer-limit slave 512mb 200mb 60 [09:12:05] that is an actual line from the config [09:12:12] that is clearly a list of space-separated values [09:12:16] right [09:12:22] so how'd you represent [09:12:28] rename command SLAVEOF '' [09:12:30] err [09:12:32] rename-command SLAVEOF '' [09:12:35] rename-command CONFIG '' [09:12:41] so if you keep rename-command as key [09:12:47] you can't since keys gotta be unique [09:12:54] and there can by N of these [09:12:55] one possibility, since this is probably an all-or-nothing config blob [09:13:22] is to just have rename_commands.conf in modules/redis/files, provision that in /etc/redis, and let individual config files import it [09:13:48] hmmm [09:14:20] different things want to rename different commands (tools probably a lot more restrictive than prod, for example). I guess we can provision those manually, sure... [09:14:37] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [09:14:40] is a bit icky, and we should find out if rename-command is an outlier or not [09:14:42] or we could just make it be a hash [09:14:45] yeah [09:14:55] (03PS1) 10Filippo Giunchedi: cassandra: add restbase2002-b instance [puppet] - 10https://gerrit.wikimedia.org/r/254120 [09:15:04] ori: actually a hash sounds nice and is the proper abstraction [09:15:06] if it's a weird outlier than we can just special-case it and not feel too bad [09:15:20] yeah, since it is mapping an original name to a new name [09:15:44] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase2002-b instance [puppet] - 10https://gerrit.wikimedia.org/r/254120 (owner: 10Filippo Giunchedi) [09:15:58] ori: you can have multiple 'bind' instances (although... whyyy) [09:16:18] ori: oh, and multiple 'save' config entries, which actually makes sense [09:16:21] (https://raw.githubusercontent.com/antirez/redis/3.0/redis.conf [09:16:23] ) [09:16:27] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [09:16:31] i'm pro-rez [09:17:11] yeah. i think special casing these two is a bit yucky but if we get it right it's a nice API and no one has to know our secret [09:17:14] took me a while :) [09:17:22] I'm reading through everything lese [09:17:25] *else [09:18:13] ori: and client-output-buffer-limit [09:19:24] ori: I think only these three have behavior where you can have multiple number of these and it is not just 'last one wins' [09:23:18] so I've saved the foundation around $10,000 dollars in disk space already [09:23:58] * YuviPanda starts torrenting to fill that up [09:24:01] \o/ jynus [09:24:57] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [09:25:04] also less db pages == more will fit on memory == more performance [09:25:19] please avoid keeping your mp3 collection on the cluster in the future [09:25:28] i hope you have learned your lesson [09:25:37] ha [09:25:48] (but seriously, that's pretty cool.) [09:26:00] 'mp3' [09:26:03] mostly, it was used for deleted rows [09:26:20] plus the old table cleanup is also ongoing [09:26:24] * YuviPanda slides off to sleep with his panda pillow [09:26:30] good night everyone! \o/ [09:26:43] I believe that will extend the life of some servers a little longer [09:26:44] ori: I update the ores patch too :) (and the quarry patch too) [09:27:12] YuviPanda: how much would you hate me if i asked to replace the 'yes' with true? [09:27:22] the template will render it to a yes [09:27:39] i think it's cleaner that way but i may be on the eccentric side here [09:27:43] ori: not very much, but I'll point out that *small* amounts of magic are sometimes more annoying than large amounts or no amount of magic :P [09:27:57] yes you are right [09:28:04] this sort of stuff is my kryptonite [09:28:17] so IMO true to yes auto conversion is kindof confusing [09:28:30] ori, true is cleaner, but in practice strings are transparently converted to true/false [09:28:30] (03PS3) 10Ori.livneh: ores: Move to using redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/254119 (owner: 10Yuvipanda) [09:28:33] since if you put an actual 'true' there it'll blow up the config so you gotta read 'yes' in the redis config [09:28:40] and that has already caused issues [09:28:41] (03CR) 10Ori.livneh: [C: 031] ores: Move to using redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/254119 (owner: 10Yuvipanda) [09:28:45] and mentally convert it to 'true' [09:28:47] and vice versa [09:28:47] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [09:29:07] yes [09:29:11] i mean, true. [09:29:12] so I think we should just use 'yes' and not even translate true / false to it. [09:29:14] haha :P [09:30:28] it could also cause issues in the other way, so I am undecided [09:30:43] you can make it barf on booleans if you want :) [09:30:49] that'll enforce 'yes' vs true [09:30:56] "if (string)" [09:31:16] nah, it's the value side of a hash [09:31:24] so you allow whatever, just not bools [09:31:38] just imagine if james joyce cast 'yes' to true [09:31:44] then the ending of ulysses would be like this: [09:31:56] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [09:32:00] "I asked him with my eyes to ask again true and then he asked me would I true to say true my mountain flower and first I put my arms around him true and drew him down to me so he could feel my breasts all perfume true and his heart was going like mad and true I said true I will True. " [09:32:02] nice ending there, icinga-wm [09:32:25] doesn't have the same panache [09:32:37] true :P [09:33:04] (03PS1) 10Jcrespo: Repool db1044 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254121 [09:35:06] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [09:35:33] I should sleep now. thanks for the reviews, ori [09:35:44] good night! [09:35:51] actually... [09:36:04] moritzm: are you considering fixing up our ldap module along with the move to openldap? [09:36:09] (03CR) 10Jcrespo: [C: 04-1] "Not repool yet, forgot to upgrade." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254121 (owner: 10Jcrespo) [09:36:09] it's a complete mess [09:36:35] no, it will be replaced by a role based on the exsting openldap module [09:36:37] PROBLEM - cassandra-b CQL 10.192.16.166:9042 on restbase2002 is CRITICAL: Connection refused [09:36:47] WIP is here: https://gerrit.wikimedia.org/r/#/c/253347/ [09:37:06] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [09:37:26] moritzm: the ldap module is unfortunately not solely opendj (although that's partially there too) but all the ldap client stuff [09:38:46] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [09:38:56] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [09:39:44] YuviPanda: yeah, but I'm currently not planning to touch the client bits, that can be done at a later point [09:39:52] ok [09:40:05] it's probably some of the messiest code (both python and puppet) left in our repo [09:41:47] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.16.166:9042 on restbase2002 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [09:44:45] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Test web site in alternative language returned the unexpected status 520 (expecting: 200) [09:45:45] (03CR) 10Alexandros Kosiaris: "puppet compiler would have told you that (works fine these days). Not sure why move them back into manifests/role though ? we are trying t" [puppet] - 10https://gerrit.wikimedia.org/r/254109 (owner: 10Yuvipanda) [09:47:07] (03CR) 10Yuvipanda: "It did tell me for https://gerrit.wikimedia.org/r/#/c/254106/ and I was like 'wat that makes no sense' and tried it on this smaller change" [puppet] - 10https://gerrit.wikimedia.org/r/254109 (owner: 10Yuvipanda) [09:48:47] (03CR) 10Alexandros Kosiaris: [C: 04-2] "actually, no. This layout is deprecated and will not work with puppet 4 (import is deprecated). Moving them into the role module with auto" [puppet] - 10https://gerrit.wikimedia.org/r/254106 (owner: 10Yuvipanda) [09:48:56] PROBLEM - NTP on pybal-test2001 is CRITICAL: NTP CRITICAL: No response from NTP server [09:49:13] (03Abandoned) 10Yuvipanda: Move most of 'role' module to manifests/role [puppet] - 10https://gerrit.wikimedia.org/r/254106 (owner: 10Yuvipanda) [09:53:15] !log delete old mobileapps CF stats from graphite1001 / graphite2001 [10:02:34] (03PS7) 10Filippo Giunchedi: restbase: move to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) [10:03:51] (03PS1) 10Muehlenhoff: Bump to kernel ABI 3.19.0-2 / version 3.19.3-9 [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/254123 [10:05:25] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1816638 (10Aklapper) [10:08:24] !log going to restart Jenkins a few times [10:08:29] deadlock deadlock [10:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:11:26] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: puppet fail [10:11:41] (03PS4) 10Yuvipanda: ores: Move to using redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/254119 [10:11:43] (03PS6) 10Yuvipanda: quarry: Move to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253531 [10:11:45] (03PS1) 10Yuvipanda: labs: Cleanup role classes (part #1) [puppet] - 10https://gerrit.wikimedia.org/r/254124 [10:11:47] (03PS1) 10Yuvipanda: labs: Removed unused 'swiftpartition' role [puppet] - 10https://gerrit.wikimedia.org/r/254125 [10:12:00] akosiaris: ^ this cleans up all the things that IMO should have role::labs prefix into autolayout [10:12:20] akosiaris: I'll do part #2 later on that removes roles::labs from things that shouldn't [10:12:38] akosiaris: FYI I'm going ahead with https://gerrit.wikimedia.org/r/#/c/244647/ [10:13:03] * YuviPanda goes to bed before he's nerdsniped again [10:13:10] ciao YuviPanda [10:13:32] !log disable puppet on deployment_target:restbase/deploy [10:13:36] (03CR) 10jenkins-bot: [V: 04-1] labs: Cleanup role classes (part #1) [puppet] - 10https://gerrit.wikimedia.org/r/254124 (owner: 10Yuvipanda) [10:13:49] godog: cool! [10:13:51] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: move to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) (owner: 10Filippo Giunchedi) [10:13:55] YuviPanda: ciao, have a nice night [10:14:43] (03PS5) 10Yuvipanda: ores: Move to using redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/254119 [10:14:44] bah jenkins [10:14:45] (03PS2) 10Yuvipanda: labs: Cleanup role classes (part #1) [puppet] - 10https://gerrit.wikimedia.org/r/254124 [10:14:47] (03PS2) 10Yuvipanda: labs: Removed unused 'swiftpartition' role [puppet] - 10https://gerrit.wikimedia.org/r/254125 [10:14:48] * YuviPanda fixes [10:14:49] (03PS7) 10Yuvipanda: quarry: Move to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253531 [10:15:02] (03PS3) 10Yuvipanda: labs: Removed unused 'swiftpartition' role [puppet] - 10https://gerrit.wikimedia.org/r/254125 [10:15:59] (03CR) 10Muehlenhoff: [C: 032 V: 032] Bump to kernel ABI 3.19.0-2 / version 3.19.3-9 [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/254123 (owner: 10Muehlenhoff) [10:16:36] (03CR) 10Yuvipanda: [C: 032 V: 032] "DIE DIE DIE DIE DIE DIE" [puppet] - 10https://gerrit.wikimedia.org/r/254125 (owner: 10Yuvipanda) [10:17:05] now I go away for reaaaaal [10:17:15] (03PS1) 10Muehlenhoff: Add a .gitreview file [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/254126 [10:17:34] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add a .gitreview file [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/254126 (owner: 10Muehlenhoff) [10:19:45] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [10:19:46] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [10:27:36] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [10:33:53] mobrovac: I've moved restbase to systemd in the test cluster and restarted, LGTM but I'd like another set of eyes too [10:34:47] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [10:37:16] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [10:38:26] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:38:36] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [10:40:20] !log hhvm-dump-debug --full and bounce hhvm on mw1131 [10:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:40:56] RECOVERY - HHVM rendering on mw1131 is OK: HTTP OK: HTTP/1.1 200 OK - 66289 bytes in 2.806 second response time [10:42:05] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.057 second response time [10:42:56] !log hhvm-dump-debug --full and bounce hhvm on mw1147 [10:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:44:16] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [10:44:35] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.093 second response time [10:44:45] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [10:45:26] RECOVERY - HHVM rendering on mw1147 is OK: HTTP OK: HTTP/1.1 200 OK - 66288 bytes in 0.208 second response time [10:48:26] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [10:49:57] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [10:55:18] (03CR) 10Filippo Giunchedi: RESTBase configuration for scap3 deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/252887 (owner: 10Thcipriani) [10:58:06] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [10:59:35] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [11:00:43] !log modifying filters on all labs db replicas (sanitarium), small amounts of lag could be created [11:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:01:57] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [11:02:15] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [11:04:07] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [11:05:26] (03PS1) 10Filippo Giunchedi: deployment: add redis socket_connect_timeout [puppet] - 10https://gerrit.wikimedia.org/r/254128 (https://phabricator.wikimedia.org/T118380) [11:07:16] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [11:07:55] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [11:09:36] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [11:20:16] RECOVERY - zuul_merger_service_running on scandium is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [11:20:32] scandium / gallium alarms are mine [11:20:35] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [11:21:09] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/253925 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [11:21:24] !log stopped zuul-merger on gallium, started the one on scandium [11:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:23:08] !log restarted zuul-merger on gallium. We now have two instances running (gallium and scandium) [11:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:24:37] (03CR) 10Hashar: "I copy pasted the key from gallium to scandium. Manually established a ssh connection to gerrit for the known key:" [puppet] - 10https://gerrit.wikimedia.org/r/253925 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [11:25:29] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1816724 (10hashar) I copy pasted the key from gallium to scandium. Manually established a ssh connection to gerrit for the known key: zuul@scandi... [11:26:15] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [11:27:13] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1816727 (10hashar) [11:30:01] 6operations, 7Database: External Storage on codfw (es2005-2010) is consuming 100-90GB of disk space per server and per month and it has 370GB available - https://phabricator.wikimedia.org/T119056#1816735 (10jcrespo) 3NEW [11:34:05] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [11:35:56] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [11:36:24] (03PS1) 10Hashar: contint: grant zuul-merger sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/254129 (https://phabricator.wikimedia.org/T116921) [11:36:26] (03PS1) 10Hashar: admin: remove CI root access from scandium [puppet] - 10https://gerrit.wikimedia.org/r/254130 (https://phabricator.wikimedia.org/T116921) [11:39:13] (03PS1) 10Hashar: Remove lanthanum.eqiad.wmnet hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/254132 (https://phabricator.wikimedia.org/T86658) [11:39:40] lunch [11:40:35] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [11:42:06] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [5000000.0] [11:44:17] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [11:49:56] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [5000000.0] [11:51:36] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [5000000.0] [11:57:26] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [11:57:45] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [11:59:20] (03PS4) 10Muehlenhoff: WIP: labs openldap role [puppet] - 10https://gerrit.wikimedia.org/r/253347 [12:00:37] 6operations, 7Database: mysql privs: restrict access to racktables to krypton - https://phabricator.wikimedia.org/T118816#1816782 (10jcrespo) p:5Normal>3Low With the new title, where I do not block anybody, I will set it to low (good to have, but not so important). [12:04:36] (03PS5) 10Muehlenhoff: labs openldap role [puppet] - 10https://gerrit.wikimedia.org/r/253347 [12:05:35] (03CR) 10Muehlenhoff: [C: 031] redis: prohibit commands CONFIG, SLAVEOF and DEBUG by default [puppet] - 10https://gerrit.wikimedia.org/r/251800 (owner: 10Ori.livneh) [12:08:45] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [12:11:55] db1060 is experiencing some issues [12:12:43] godog: looking [12:14:26] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [12:17:30] godog: can't find evidence of rb being on systemd on cerium [12:17:40] godog: where's the service file? [12:18:12] but then again, there's no /etc/init.d/ file either [12:18:12] hm [12:19:27] mobrovac: try systemctl status restbase [12:19:36] Loaded: loaded (/lib/systemd/system/restbase.service; static) [12:19:43] ha damn [12:19:50] did that but missed the file loc [12:19:51] haha [12:19:56] pages? [12:19:59] thnx godog [12:19:59] but not here? [12:20:02] jynus? [12:20:03] pages [12:20:09] I see [12:20:11] db1054 [12:20:13] why aren't they here? [12:21:07] think we 've seen that before [12:21:14] its been broken for awhile, unless someone fixed it [12:21:29] (page class alerts not appearing in irc) [12:21:30] I think it's that the icinga alerts go to a database irc channel [12:21:55] it seems mismatched that the pages are general and the irc is subteam/channel [12:23:16] it is https://phabricator.wikimedia.org/T118186 [12:25:15] oh hmm I don't see a db channel in icinga-wm config, maybe I'm wrong [12:26:43] there have been some unusually-large spikes of PURGE traffic the past few hours, probably unrelated to the above, but just noting [12:28:56] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [12:29:39] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Comments inline. Approach seems sane." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/253347 (owner: 10Muehlenhoff) [12:29:59] godog: you have applied the patch in staging I assume? [12:30:46] LGTM overall [12:30:54] mobrovac: yeah only staging ATM, I'll reenable puppet in production and do a rolling restart if it looks good to you [12:31:16] can we block a user, that would be easier for now? [12:31:26] godog: i'll try service restart after killing rb to see if it'll bring it up on cerium [12:31:32] anyone with zhwiki rights? [12:33:43] mobrovac: ok! [12:34:55] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [12:36:47] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [12:37:33] (03CR) 10Muehlenhoff: labs openldap role (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/253347 (owner: 10Muehlenhoff) [12:37:47] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [12:38:45] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [12:39:08] !log failing over ulsfo traffic to lvs400[34] (for 1.13.1 testing) [12:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:43:35] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [12:44:16] PROBLEM - pybal on lvs4001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [12:45:25] jynus: the staff global group [12:45:26] PROBLEM - pybal on lvs4002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [12:45:43] ACKNOWLEDGEMENT - pybal on lvs4001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Brandon Black intentional [12:45:43] ACKNOWLEDGEMENT - pybal on lvs4002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Brandon Black intentional [12:45:52] godog: also, re switch to systemd, related is https://phabricator.wikimedia.org/T118401 [12:47:02] 6operations, 10RESTBase, 6Services: Switch RESTBase to use service::node - https://phabricator.wikimedia.org/T118401#1816875 (10mobrovac) [12:47:55] (03CR) 10Mobrovac: RESTBase configuration for scap3 deployment (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/252887 (owner: 10Thcipriani) [12:51:56] RECOVERY - pybal on lvs4001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [12:53:07] RECOVERY - pybal on lvs4002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [12:55:32] p858snake, thanks, I am going to not ask to block as a first measure :-), I will try contacting first [13:00:56] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [13:06:47] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [13:22:06] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [13:23:12] 您好,我接到了消息,JCrespo正在找我 [13:25:58] @jynus [13:26:25] Hi, there. I see your post about api and all on zhwp. [13:26:25] yes, I am talking in private with him [13:27:18] SzMithrandir, can you join #wikipedia-zh ? [13:27:34] I'm just wondering, whether gadgets (TW, HotCat, etc) is affected [13:27:42] OK [13:29:56] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [13:35:02] mobrovac: I'll start reenabling puppet in codfw prod and roll restart restbase btw [13:35:46] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [13:36:41] (03CR) 10Hashar: [C: 031] "Fine to me. We will want to train ops about how to use bundler which is really just about:" [puppet] - 10https://gerrit.wikimedia.org/r/252686 (https://phabricator.wikimedia.org/T117993) (owner: 10Zfilipin) [13:42:52] kk godog [13:48:37] (03CR) 10JanZerebecki: [C: 031] deactivate wikidata.pt [dns] - 10https://gerrit.wikimedia.org/r/254042 (owner: 10Dzahn) [13:52:09] what meaning "Issues with some bots/tools hitting recentchanges" on zh.wp? [13:53:48] PROBLEM - Hadoop NodeManager on analytics1047 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [13:55:16] (03PS6) 10Muehlenhoff: labs openldap role [puppet] - 10https://gerrit.wikimedia.org/r/253347 [13:55:37] RECOVERY - Hadoop NodeManager on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [13:56:37] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [14:00:17] 6operations, 6WMDE-Analytics-Engineering, 10Wikidata, 5Patch-For-Review: Push dumps.wm.o logs files to stat1002 - https://phabricator.wikimedia.org/T118739#1816979 (10Addshore) [14:00:31] 6operations, 6WMDE-Analytics-Engineering, 10Wikidata, 5Patch-For-Review: Push dumps.wm.o logs files to stat1002 - https://phabricator.wikimedia.org/T118739#1807973 (10Addshore) Changed the title and remove access requests per the discussion here [14:02:35] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [14:14:11] !log roll-restart restbase in codfw [14:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:17:46] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [14:20:08] PROBLEM - Restbase root url on restbase2003 is CRITICAL: Connection refused [14:20:27] 6operations, 6WMDE-Analytics-Engineering, 10Wikidata, 5Patch-For-Review: Push dumps.wm.o logs files to stat1002 - https://phabricator.wikimedia.org/T118739#1817044 (10Addshore) [14:20:45] PROBLEM - Restbase endpoints health on restbase2003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [14:20:55] mobrovac: I've restarted restbase on 2003 and all workers came up but no port listening yet, known? [14:21:21] hm no [14:21:27] * mobrovac taking a look [14:23:26] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [14:23:37] godog: a lot of failed start-up msgs in logstash [14:23:43] godog: will re-deploy there [14:25:09] mobrovac: ah yeah also Error: unknown module type spec (for module mobileapps-public). type of error [14:25:46] RECOVERY - Restbase root url on restbase2003 is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.117 second response time [14:26:06] godog: ^^ [14:26:18] \o/ [14:26:25] RECOVERY - Restbase endpoints health on restbase2003 is OK: All endpoints are healthy [14:26:42] will proceed with the rest, thanks [14:27:34] np [14:29:42] mobrovac: same thing on 2004 btw, startup messages on restart, Error: unknown module type spec (for module mobileapps-public). [14:30:11] mobrovac: though yesterday's deploy should have finished successfully (?) [14:30:34] yeah, this is bizarre [14:30:45] PROBLEM - Restbase root url on restbase2004 is CRITICAL: Connection refused [14:30:46] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [5000000.0] [14:30:56] godog: but this seems to be happening now [14:31:05] i.e. worked until now [14:31:08] wth? [14:31:16] mobrovac: yeah I restarted on 2004 and there were startup errors [14:31:28] PROBLEM - Restbase endpoints health on restbase2004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [14:31:44] note 2001 and 2002 restarted just fine [14:31:47] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Puppet has 1 failures [14:32:42] Could I get a review of https://gerrit.wikimedia.org/r/#/c/253467 from a fellow op? Reluctant to self-merge an access patch. [14:32:57] godog: rb2004 doesn't have the latest code, very strange [14:33:10] godog: will check 200[56] as well now [14:34:27] godog: yup, 5 & 6 suffer from the same problem, don't understand how ... [14:34:34] godog: will redeploy to codfw [14:34:56] mobrovac: kk, is it just the last deploy missing or more? [14:35:10] yup, just yesterday's [14:36:26] RECOVERY - Restbase root url on restbase2004 is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.113 second response time [14:37:16] RECOVERY - Restbase endpoints health on restbase2004 is OK: All endpoints are healthy [14:38:47] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [14:39:01] godog: done, all are up now [14:39:24] mobrovac: sweet, so all running systemd too now, will move to eqiad shortly [14:39:41] !log uploaded linux 3.19.3-9 to carbon [14:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:40:31] !log uploaded linux-meta 1.3 to carbon [14:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:41:07] !log roll-restart restbase on aqs100* [14:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:44:27] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [14:47:56] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [14:49:03] (03PS3) 10coren: toollabs: make sure /tmp and swap are large for all exec hosts [puppet] - 10https://gerrit.wikimedia.org/r/252506 (https://phabricator.wikimedia.org/T118419) (owner: 10Merlijn van Deen) [14:52:06] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [14:55:48] !log roll restart restbase in eqiad [14:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:58:27] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:59:55] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [15:02:33] 6operations, 10ops-esams: Replace cr2-knams MX80 MIC slot with a 2x10G MIC - https://phabricator.wikimedia.org/T111765#1817187 (10faidon) [15:05:45] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [15:11:25] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [15:13:48] PROBLEM - Hadoop NodeManager on analytics1050 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:15:45] RECOVERY - Hadoop NodeManager on analytics1050 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:18:56] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [15:19:43] 6operations, 10netops: Notify transits of new esams prefixes - https://phabricator.wikimedia.org/T81989#1817261 (10faidon) [15:21:09] I just got "MediaWiki internal error. Exception caught inside exception handler. Set $wgShowExceptionDetails = true; at the bottom of LocalSettings.php to show detailed debugging information." on wikitechwiki when I tried to login [15:21:19] Someone who can search for a stacktrace? [15:21:38] 7Blocked-on-Operations, 6operations, 6Commons, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1817275 (10Andrew) I have new backports ready for testing. Can I get a volunteer to install them on a beta-cluster box and verify that things work ok? [15:24:04] tgr|away or kaldari — I have new rsvg packages but would like to see them tested on the beta cluster. Can one of you help with that? [15:26:16] 7Blocked-on-Operations, 6operations, 6Commons, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1817285 (10Andrew) andrew@copper:~$ ls /var/cache/pbuilder/result/trusty-amd64/*rsvg* /var/cache/pbuilder/result/trusty-amd64/gir1.2-rsvg-2.0_2.40.11-1_amd6... [15:30:47] (03CR) 10Muehlenhoff: "I ran tcpdump on it and confirmed that all accesses were from the internal 10.x network." [puppet] - 10https://gerrit.wikimedia.org/r/240042 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [15:32:25] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1817294 (10Ottomata) FYI, the repo is here, waiting for some schemas! :) https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/event-schemas Fo... [15:33:35] (03CR) 10Jcrespo: [C: 031] Add ferm rules for role::mariadb::misc::eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/240042 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [15:37:26] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [5000000.0] [15:38:58] andrewbogott: yes, I can help test. [15:39:27] (03PS1) 10Filippo Giunchedi: swiftrepl: add main() [software] - 10https://gerrit.wikimedia.org/r/254144 [15:39:29] (03PS1) 10Filippo Giunchedi: swiftrepl: add setup.py [software] - 10https://gerrit.wikimedia.org/r/254145 [15:39:31] (03PS1) 10Filippo Giunchedi: swiftrepl: add 'container-set' selection [software] - 10https://gerrit.wikimedia.org/r/254146 [15:39:34] 7Blocked-on-Operations, 6operations, 6Commons, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1817302 (10MoritzMuehlenhoff) >>! In T112421#1814031, @Andrew wrote: > Giuseppe says: > > - The current package was a clean backport with the security patc... [15:39:38] kaldari: I assume that beta replicates the rendering bits, but not sure where to begin [15:39:46] Do you just need me to copy the debs someplace for you? [15:39:56] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [5000000.0] [15:40:36] (03PS2) 10Rush: hiera-ize labs openstack nova configuration [puppet] - 10https://gerrit.wikimedia.org/r/254056 [15:40:54] (03PS3) 10Muehlenhoff: Add ferm rules for role::mariadb::misc::eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/240042 (https://phabricator.wikimedia.org/T104699) [15:40:55] andrewbogott: debs? [15:41:48] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for role::mariadb::misc::eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/240042 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [15:42:47] (03Abandoned) 10Muehlenhoff: Enable ferm rules for role::mariadb::misc::eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/235429 (owner: 10Muehlenhoff) [15:43:06] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [15:43:19] kaldari: What I mean is: I’ve backported 2.40.11 to trusty, and have the new debian packages. I can install them on beta (if someone tells me where they go) or I can just give the packages to you and leave you to it. [15:44:14] I wouldn't know where to install them either :( [15:44:27] 6operations, 6Labs: Security setting changes are not applied - https://phabricator.wikimedia.org/T118936#1817325 (10Andrew) I see the same behavior on labvirt1005. And, indeed, restarting nova-compute helps, whereas restarting nova-network does not. [15:44:38] 6operations, 6Labs: Security setting changes are not applied - https://phabricator.wikimedia.org/T118936#1817326 (10Andrew) a:3Andrew [15:47:46] kaldari: I’ll ask in -releng [15:48:02] thanks! [15:49:27] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [15:49:28] hashar, have you written me from a non-wikimedia address? [15:49:34] 6operations, 10ops-eqiad: Reclaim einsteinium.eqiad.wmnet for spares - https://phabricator.wikimedia.org/T116252#1817339 (10Cmjohnson) 5Open>3Resolved Added to spares page -- resolving this task [15:49:56] PROBLEM - SSH on pybal-test2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:50:19] jynus: yeah [15:50:21] (I do not care, but I do not trust random email addresses) [15:50:27] jynus: sorry my work related email hashar at free dot fr [15:50:37] (03CR) 10Eevans: "Should we remove the threshold, or just set it to something higher? A moving average with a large enough window size, combined with a mor" [puppet] - 10https://gerrit.wikimedia.org/r/253942 (https://phabricator.wikimedia.org/T118976) (owner: 10Filippo Giunchedi) [15:50:48] jynus: I refrained from writing a wall of text to a couple tasks and went to direct email :-/ No urgency [15:51:27] it is ok, just double checking, one can never be too cautios these days [15:51:30] Message-ID: <564DED17.6080900@free.fr> [15:51:34] I should really use pgp [15:51:46] but I can't find my private key anymore. It is somewhere on one of my old HDD :-( [15:54:20] although to be fair, I would prefer phab or IRC :-) [15:55:01] I will forget your email almost imediately otherwise [15:56:26] jynus: IRC will do [15:56:48] jynus: I will let you pick up a time somewhere next week. My agenda on amusso@wikimedia.org is up to date :-} [15:57:15] merely want to clarify what needs to be done, then we can get the tasks updated and sub tasks opened as needed [15:59:51] sure [16:00:23] basically, reviewing where things are now and distributing work [16:00:26] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [16:01:15] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0] [16:08:51] 6operations, 6Labs: Security setting changes are not applied - https://phabricator.wikimedia.org/T118936#1817406 (10Andrew) 2015-11-19 16:07:59.145 56644 ERROR oslo_messaging.rpc.dispatcher [req-83c2d563-4f10-4f78-927d-f3e8482edade andrew testlabs - - -] Exception during message handling: 'metadata' 2015-11-19... [16:12:53] (03PS1) 10KartikMistry: CX should default to using rest.wm.o, not parsoid-lb [puppet] - 10https://gerrit.wikimedia.org/r/254151 (https://phabricator.wikimedia.org/T111562) [16:13:45] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 216, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/2: down - Core: cr1-ulsfo:xe-0/0/3 GTT/TiNet (02773-004-32) [2Gbps MPLS]BR [16:14:01] (03CR) 10jenkins-bot: [V: 04-1] CX should default to using rest.wm.o, not parsoid-lb [puppet] - 10https://gerrit.wikimedia.org/r/254151 (https://phabricator.wikimedia.org/T111562) (owner: 10KartikMistry) [16:17:16] PROBLEM - Hadoop NodeManager on analytics1046 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:17:18] (03CR) 10Filippo Giunchedi: "yep we can keep the alarm and raise the thresholds and see where that gets us too" [puppet] - 10https://gerrit.wikimedia.org/r/253942 (https://phabricator.wikimedia.org/T118976) (owner: 10Filippo Giunchedi) [16:17:20] (03CR) 10Eevans: restbase: higher sstable per read threshold (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/253941 (https://phabricator.wikimedia.org/T118976) (owner: 10Filippo Giunchedi) [16:18:00] 6operations, 6Labs: Security setting changes are not applied - https://phabricator.wikimedia.org/T118936#1817431 (10Andrew) This looks to be fixed by https://review.openstack.org/#/c/222023/ [16:19:12] (03PS2) 10KartikMistry: CX should default to using rest.wm.o, not parsoid-lb [puppet] - 10https://gerrit.wikimedia.org/r/254151 (https://phabricator.wikimedia.org/T111562) [16:19:14] !log replacing pem 0 cr1-eqiad [16:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:41] 6operations, 6Labs: Security setting changes are not applied - https://phabricator.wikimedia.org/T118936#1817456 (10Andrew) and... fixed in 2015.1.2. I will upgrade and we'll see what we get. https://bugs.launchpad.net/nova/+bug/1484738 [16:24:02] 6operations, 10ops-eqiad, 10netops: cr1-eqiad PEM 0 fan failed - https://phabricator.wikimedia.org/T118721#1817457 (10Cmjohnson) PEM 0 Replaced, the error cleared. Leaving open until I have shipping information to return the part. [16:24:38] !log repool restbase1002 [16:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:28:26] 6operations, 10EventBus, 10MediaWiki-Cache, 6Performance-Team, and 2 others: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1817477 (10GWicke) @robh, any updates on the timeline? Are we still on track for having this hardware r... [16:31:09] 6operations, 10EventBus, 10MediaWiki-Cache, 6Performance-Team, and 2 others: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1817486 (10RobH) I have the first vendors initial quotes in (they require a single correction, being su... [16:31:56] gwicke: i reviewed the misc quote yesterday, was out sick on tuesday. so its pushed back a day but otherwise i should have in both vendors for escalation early next week. (The second vendor can take 48 hours for a quote at times, so I plan to have it to them in the next hour but it may be late Friday, early Monday when they reply) [16:32:05] regards to ^ task udpates [16:35:54] 6operations, 6Labs: Security setting changes are not applied - https://phabricator.wikimedia.org/T118936#1817492 (10Andrew) 5Open>3Resolved I've updated nova-compute to 2015.1.2 on all labvirt nodes. Seems fixed. [16:38:25] RECOVERY - Hadoop NodeManager on analytics1046 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:40:40] 6operations, 10EventBus, 10MediaWiki-Cache, 6Performance-Team, and 2 others: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1817498 (10RobH) [16:42:27] PROBLEM - Hadoop NodeManager on analytics1057 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:44:25] RECOVERY - Hadoop NodeManager on analytics1057 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:45:58] 6operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#1817505 (10Mattflaschen) [16:47:29] (03PS1) 10Muehlenhoff: Exclude apport from toollabs genpp python list [puppet] - 10https://gerrit.wikimedia.org/r/254156 [16:54:04] (03PS3) 10Rush: hiera-ize labs openstack nova configuration [puppet] - 10https://gerrit.wikimedia.org/r/254056 [16:56:33] 6operations, 10Wikimedia-General-or-Unknown, 7user-notice: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1817549 (10jcrespo) @Superyetkin @IKhitron, thanks to @Krenair progress, I think the issue has improved, at least... [16:59:57] (03CR) 10Hashar: [C: 04-1] "Andrew uploaded the ssh key to the private git repo \O/" [puppet] - 10https://gerrit.wikimedia.org/r/253925 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [17:00:05] _joe_ moritzm: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151119T1700). [17:00:42] 6operations, 10Wikimedia-General-or-Unknown, 7user-notice: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1817562 (10IKhitron) Thank you, @jcrespo, but you should know that all three-days queries (you mentioned many of t... [17:02:06] robh: thanks! [17:02:27] I've also added the misc system refresh as a blocker and now there is a google sheet [17:02:34] so progress is more evident and you can follow =] [17:02:53] after doing the sheet comparison for restbase, its now standard (its useful) [17:02:56] (03PS2) 10Merlijn van Deen: Exclude apport from toollabs genpp python list [puppet] - 10https://gerrit.wikimedia.org/r/254156 (owner: 10Muehlenhoff) [17:03:46] (03CR) 10Merlijn van Deen: "I think removing this is probably OK, but there should probably be an e-mail to labs-l noting it is being removed." [puppet] - 10https://gerrit.wikimedia.org/r/254156 (owner: 10Muehlenhoff) [17:04:15] RECOVERY - cassandra-b CQL 10.192.16.166:9042 on restbase2002 is OK: TCP OK - 0.037 second response time on port 9042 [17:08:51] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1817596 (10kevinator) [17:35:28] (03PS1) 10Filippo Giunchedi: restbase: add restbase2002-c cassandra instance [puppet] - 10https://gerrit.wikimedia.org/r/254159 [17:37:14] (03CR) 10Dzahn: "@Alex Monk, yes, it looks like it is. thanks" [dns] - 10https://gerrit.wikimedia.org/r/254057 (owner: 10Dzahn) [17:37:18] (03Abandoned) 10Dzahn: deactivate wikimedia.community [dns] - 10https://gerrit.wikimedia.org/r/254057 (owner: 10Dzahn) [17:37:50] (03CR) 10Dzahn: "you reminded me why i had this pending patch to add a template for "mailonly" domains.. which i abandoned.. hrmm" [dns] - 10https://gerrit.wikimedia.org/r/254057 (owner: 10Dzahn) [17:39:57] (03PS2) 10Filippo Giunchedi: restbase: add restbase2002-c cassandra instance [puppet] - 10https://gerrit.wikimedia.org/r/254159 [17:40:03] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: add restbase2002-c cassandra instance [puppet] - 10https://gerrit.wikimedia.org/r/254159 (owner: 10Filippo Giunchedi) [17:52:59] (03CR) 10Alexandros Kosiaris: "Looks fine to me, but it probably needs to be coordinated with a cxserver update ? I can merged when you feel like it" [puppet] - 10https://gerrit.wikimedia.org/r/254151 (https://phabricator.wikimedia.org/T111562) (owner: 10KartikMistry) [17:59:35] (03CR) 10Filippo Giunchedi: RESTBase configuration for scap3 deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/252887 (owner: 10Thcipriani) [18:02:17] (03PS2) 10Filippo Giunchedi: restbase: raise threshold for pending compactions [puppet] - 10https://gerrit.wikimedia.org/r/253942 (https://phabricator.wikimedia.org/T118976) [18:02:18] (03PS2) 10Filippo Giunchedi: restbase: higher sstable per read threshold [puppet] - 10https://gerrit.wikimedia.org/r/253941 (https://phabricator.wikimedia.org/T118976) [18:03:47] (03CR) 10Filippo Giunchedi: restbase: higher sstable per read threshold (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/253941 (https://phabricator.wikimedia.org/T118976) (owner: 10Filippo Giunchedi) [18:06:27] ejegg: your access request https://phabricator.wikimedia.org/T118320 is blocked pending some things you need to do. No rush, just want to be clear that this is in your court. [18:07:32] PROBLEM - cassandra-c CQL 10.192.16.167:9042 on restbase2002 is CRITICAL: Connection refused [18:12:54] 10Ops-Access-Requests, 6operations: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1817856 (10Andrew) @JGirault and @Jdrewniak, please visit and sign the L3 document regarding responsibilities... [18:13:53] andrewbogott: sorry, i timed that request poorly - was out of town for a week. i'll get that form signed today! thanks for the reminder [18:14:07] np, thanks [18:17:43] (03CR) 10Dzahn: [C: 031] Add new group aqs-users for shell and cqlsh access only. [puppet] - 10https://gerrit.wikimedia.org/r/253467 (https://phabricator.wikimedia.org/T117473) (owner: 10Andrew Bogott) [18:18:22] (03PS2) 10Andrew Bogott: Add new group aqs-users for shell and cqlsh access only. [puppet] - 10https://gerrit.wikimedia.org/r/253467 (https://phabricator.wikimedia.org/T117473) [18:19:50] (03CR) 10Andrew Bogott: [C: 032] Add new group aqs-users for shell and cqlsh access only. [puppet] - 10https://gerrit.wikimedia.org/r/253467 (https://phabricator.wikimedia.org/T117473) (owner: 10Andrew Bogott) [18:21:15] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to rest base and cassandra nodes - https://phabricator.wikimedia.org/T117473#1817895 (10Andrew) 5Open>3Resolved a:3Andrew The change will role out over the next 20 minutes or so. Please re-open if you encounter any difficulties. [18:22:42] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 218, down: 0, dormant: 0, excluded: 0, unused: 0 [18:25:09] (03CR) 10KartikMistry: "Yes, along with cxserver and CX changes." [puppet] - 10https://gerrit.wikimedia.org/r/254151 (https://phabricator.wikimedia.org/T111562) (owner: 10KartikMistry) [18:27:57] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to rest base and cassandra nodes - https://phabricator.wikimedia.org/T117473#1817926 (10RobH) This patchset included adding a sudo line for the cql shell, which I thought I was clear we didn't need to merge right away? Also, since addi... [18:29:41] (03PS1) 10Andrew Bogott: Revert "Add new group aqs-users for shell and cqlsh access only." [puppet] - 10https://gerrit.wikimedia.org/r/254172 [18:29:57] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to rest base and cassandra nodes - https://phabricator.wikimedia.org/T117473#1817946 (10Andrew) 5Resolved>3Open My mistake, this needs further Ops review [18:30:23] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 216, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/2: down - Core: cr1-ulsfo:xe-0/0/3 GTT/TiNet (02773-004-32) [2Gbps MPLS]BR [18:30:36] (03CR) 10Andrew Bogott: [C: 032] Revert "Add new group aqs-users for shell and cqlsh access only." [puppet] - 10https://gerrit.wikimedia.org/r/254172 (owner: 10Andrew Bogott) [18:31:23] PROBLEM - Hadoop NodeManager on analytics1054 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:31:43] PROBLEM - Hadoop NodeManager on analytics1051 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:32:52] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [100000000.0] [18:33:22] RECOVERY - Hadoop NodeManager on analytics1054 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:34:56] (03PS1) 10Andrew Bogott: Add new group aqs-users for shell only. [puppet] - 10https://gerrit.wikimedia.org/r/254173 (https://phabricator.wikimedia.org/T117473) [18:35:23] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to rest base and cassandra nodes - https://phabricator.wikimedia.org/T117473#1817988 (10RobH) If my tone there seemed harsh it wasn't meant to be! This access request took a bit of figuring out, so it was easy to lose some points in th... [18:36:12] PROBLEM - Hadoop NodeManager on analytics1049 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:36:15] (03CR) 10Dzahn: [C: 031] Add new group aqs-users for shell only. [puppet] - 10https://gerrit.wikimedia.org/r/254173 (https://phabricator.wikimedia.org/T117473) (owner: 10Andrew Bogott) [18:36:31] (03CR) 10RobH: [C: 031] "Looks good to me, non-sudo request that can be merged (as its been over 3 days of review on the task)." [puppet] - 10https://gerrit.wikimedia.org/r/254173 (https://phabricator.wikimedia.org/T117473) (owner: 10Andrew Bogott) [18:36:42] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [18:37:19] (03CR) 10Andrew Bogott: [C: 032] Add new group aqs-users for shell only. [puppet] - 10https://gerrit.wikimedia.org/r/254173 (https://phabricator.wikimedia.org/T117473) (owner: 10Andrew Bogott) [18:40:01] RECOVERY - Hadoop NodeManager on analytics1049 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:44:34] greg-g: i would like to deploy a bug fix (aka fix compatibility of wikibase needed for core change) [18:44:37] https://gerrit.wikimedia.org/r/#/c/254175/ [18:44:48] ideally before we deploy wmf7 to all wikipedias [18:45:07] aude: kk doit [18:45:17] ok, thanks [18:45:18] twentyafterfour: ^^ fyi, aude's going before the train update [18:45:30] should be quick (as jenkins allows) [18:45:39] not a problem, thanks for the heads up [18:49:12] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to rest base and cassandra nodes - https://phabricator.wikimedia.org/T117473#1818065 (10Andrew) ok, sorry for confusion. Everyone mentioned in the ticket now has the shell access granted, but not the right to launch cqlsh. Please foll... [18:51:38] (03CR) 10GWicke: [C: 031] restbase: higher sstable per read threshold [puppet] - 10https://gerrit.wikimedia.org/r/253941 (https://phabricator.wikimedia.org/T118976) (owner: 10Filippo Giunchedi) [18:51:49] (03PS3) 10GWicke: restbase: higher sstable per read threshold [puppet] - 10https://gerrit.wikimedia.org/r/253941 (https://phabricator.wikimedia.org/T118976) (owner: 10Filippo Giunchedi) [18:52:11] (03CR) 10GWicke: [C: 031] restbase: raise threshold for pending compactions [puppet] - 10https://gerrit.wikimedia.org/r/253942 (https://phabricator.wikimedia.org/T118976) (owner: 10Filippo Giunchedi) [18:52:17] (03PS3) 10GWicke: restbase: raise threshold for pending compactions [puppet] - 10https://gerrit.wikimedia.org/r/253942 (https://phabricator.wikimedia.org/T118976) (owner: 10Filippo Giunchedi) [18:53:27] !log aude@tin Synchronized php-1.27.0-wmf.7/extensions/Wikidata: Adjust watchlist filter for core change (duration: 00m 30s) [18:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:53:38] done [18:53:42] * aude checks [18:54:40] looks good [18:54:50] coolio [18:55:13] 6operations, 6Project-Creators: create #ops-eqdfw & #ops-eqord projects - https://phabricator.wikimedia.org/T117585#1818102 (10RobH) 5Open>3Resolved I've created both #ops-eqord and #ops-eqdfw. I then logged into the admin account and added both of those to https://phabricator.wikimedia.org/herald/rule/15... [18:57:05] (03PS1) 10coren: Labs: switch labstore NFS server to explicit LDAP [puppet] - 10https://gerrit.wikimedia.org/r/254176 [18:57:11] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 218, down: 0, dormant: 0, excluded: 0, unused: 0 [18:57:45] (03CR) 10coren: [C: 04-2] "This can only be merged during the maintenance window." [puppet] - 10https://gerrit.wikimedia.org/r/254176 (owner: 10coren) [18:58:22] greg-g: twentyafterfour though i see there one 1 master sync error [18:58:30] idk if this is a problem [18:58:59] aude: master sync error is not anything you need to worry about [18:59:12] thanks for the heads up though! [19:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151119T1900). Please do the needful. [19:00:05] ok [19:02:48] 6operations, 7Graphite: diamond should send statsd metrics in batches - https://phabricator.wikimedia.org/T116033#1818141 (10fgiunchedi) upstream pull request https://github.com/python-diamond/Diamond/pull/327 [19:11:22] (03CR) 10Merlijn van Deen: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/252506 (https://phabricator.wikimedia.org/T118419) (owner: 10Merlijn van Deen) [19:12:21] I need to deploy 254121, I cannot leave s3 with reduced redundancy for 3 days [19:12:54] do I wait for the train? [19:15:47] (03CR) 10Jcrespo: [C: 031] Repool db1044 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254121 (owner: 10Jcrespo) [19:16:37] (03PS1) 1020after4: wikipedia wikis to 1.27.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254178 [19:17:02] (03CR) 1020after4: [C: 032] wikipedia wikis to 1.27.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254178 (owner: 1020after4) [19:17:30] (03Merged) 10jenkins-bot: wikipedia wikis to 1.27.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254178 (owner: 1020after4) [19:19:03] (03PS4) 10Rush: hiera-ize labs openstack nova configuration [puppet] - 10https://gerrit.wikimedia.org/r/254056 [19:19:20] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for ejegg - https://phabricator.wikimedia.org/T118320#1818221 (10Ejegg) OK @Andrew, I have signed L3. Sorry to take so long! [19:19:22] RECOVERY - Hadoop NodeManager on analytics1051 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [19:24:06] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: wikipedia wikis to 1.27.0-wmf.7 [19:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:32:26] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.16.167:9042 on restbase2002 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [19:32:56] (03PS5) 10Rush: hiera-ize labs openstack nova configuration [puppet] - 10https://gerrit.wikimedia.org/r/254056 [19:33:42] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Requests for addition to the #project-creators group (in comments) - https://phabricator.wikimedia.org/T706#1818287 (10Krenair) > IMPORTANT: all project creators are responsible of maintaining the [[ https://www.mediawiki.org/wiki/Phabricator/Creatin... [19:35:27] (03PS2) 10CSteipp: Set password policy for enwiki sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251678 (https://phabricator.wikimedia.org/T119100) [19:37:29] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Requests for addition to the #project-creators group (in comments) - https://phabricator.wikimedia.org/T706#1818300 (10Krenair) Actually, I just noticed 4 others from the same user. I suggest their permissions be removed. [19:48:03] twentyafterfour can I deploy something really quick? [19:48:14] jynus: yep, I'm all done [19:48:21] thanks! [19:48:41] (03PS2) 10Jcrespo: Repool db1044 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254121 [19:49:37] (03CR) 10Jcrespo: [C: 032] Repool db1044 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254121 (owner: 10Jcrespo) [19:51:39] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1044 after maintenance (duration: 00m 22s) [19:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:51:53] done, now to check the logs [19:52:19] (03CR) 10Rush: [C: 032] "here goes nothing" [puppet] - 10https://gerrit.wikimedia.org/r/254056 (owner: 10Rush) [19:55:22] (03PS2) 10Andrew Bogott: designate: Stop populating default classes / variables [puppet] - 10https://gerrit.wikimedia.org/r/253807 (https://phabricator.wikimedia.org/T101447) (owner: 10Yuvipanda) [19:56:23] thinks look ok, no query pileups, maybe a bit longer than average mean time, but way beter availability [20:06:17] twentyafterfour: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#Blank_spaces_left_when_second-level_headings_follow_floating_boxes_redux [20:07:12] legoktm: hmm [20:08:48] (03PS1) 10Merlijn van Deen: [DO NOT SUBMIT] test for tool labs puppet compiler [puppet] - 10https://gerrit.wikimedia.org/r/254183 [20:09:07] twentyafterfour: last time it was QuickSurveys iirc [20:09:37] is there a phabricator task? [20:09:40] * twentyafterfour searches [20:09:47] https://phabricator.wikimedia.org/T118475 [20:09:50] https://gerrit.wikimedia.org/r/#/c/252736/ [20:09:53] it's not in master WTF [20:10:19] twentyafterfour: can you deploy https://gerrit.wikimedia.org/r/#/c/254184/ ? [20:10:36] legoktm: ok [20:11:13] (03PS1) 10Rush: updates for nova hiera [puppet] - 10https://gerrit.wikimedia.org/r/254187 [20:13:03] !log upgrading pybal to 1.13.1 on lvs400[12] and failing back traffic [20:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:13:31] (03CR) 10Andrew Bogott: [C: 031] updates for nova hiera [puppet] - 10https://gerrit.wikimedia.org/r/254187 (owner: 10Rush) [20:13:32] twentyafterfour: ugh, and the tests are broken. Just bypass jenkins for now? [20:13:36] (03CR) 10Rush: [C: 032] updates for nova hiera [puppet] - 10https://gerrit.wikimedia.org/r/254187 (owner: 10Rush) [20:14:07] 6operations, 5Patch-For-Review: move racktables to a VM - https://phabricator.wikimedia.org/T105555#1818415 (10Dzahn) [20:14:56] 6operations, 5Patch-For-Review: move racktables to a VM - https://phabricator.wikimedia.org/T105555#1446410 (10Dzahn) [20:15:20] legoktm: weird, it doesn't even seem like that test should be applied to this? [20:15:24] yeah I'll force submit [20:15:39] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/254193/ also wasn't merged into master [20:15:43] 6operations, 5Patch-For-Review: move racktables to a VM - https://phabricator.wikimedia.org/T105555#1446410 (10Dzahn) [20:15:59] (that doesn't need to be deployed though, just tests >.>) [20:16:24] ah so that would fix the test? [20:17:12] 6operations: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#1818428 (10Dzahn) 3NEW [20:17:31] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/253942 (https://phabricator.wikimedia.org/T118976) (owner: 10Filippo Giunchedi) [20:17:57] damn I thought I could verified+2 to override jenkins-bot [20:18:01] but it's not overriding [20:18:29] is palladium the prod puppetmaster? [20:18:31] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/253941 (https://phabricator.wikimedia.org/T118976) (owner: 10Filippo Giunchedi) [20:18:36] twentyafterfour: you have to remove the V-1 first. It's merged now [20:18:45] I'm trying to figure out the toollabs/labs equivalent of https://wikitech.wikimedia.org/wiki/User:Giuseppe_Lavagetto/RefreshPuppetCompiler [20:18:54] valhallasw`cloud: hi [20:18:57] valhallasw`cloud: it is [20:19:01] ok! [20:19:02] legoktm: how did you force that to merge? [20:19:18] twentyafterfour: I deleted the jenkins-bot vote, refreshed, and hit "submit" [20:20:10] (03PS1) 10Dzahn: racktables: switch to new backend and version [puppet] - 10https://gerrit.wikimedia.org/r/254194 (https://phabricator.wikimedia.org/T105555) [20:22:43] !log twentyafterfour@tin Synchronized php-1.27.0-wmf.7/extensions/QuickSurveys/: Deploy https://gerrit.wikimedia.org/r/#/c/254184/ (duration: 00m 19s) [20:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:23:37] (03PS2) 10Dzahn: racktables: switch to new backend and version [puppet] - 10https://gerrit.wikimedia.org/r/254194 (https://phabricator.wikimedia.org/T105555) [20:24:02] twentyafterfour: looks fixed now, thanks [20:25:13] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1818461 (10GWicke) [20:26:32] (03CR) 10Dzahn: [C: 032] racktables: switch to new backend and version [puppet] - 10https://gerrit.wikimedia.org/r/254194 (https://phabricator.wikimedia.org/T105555) (owner: 10Dzahn) [20:26:32] PROBLEM - puppet last run on mw1132 is CRITICAL: CRITICAL: Puppet has 52 failures [20:28:50] !log racktables - in migration - brb [20:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:38:42] (03PS4) 10Filippo Giunchedi: restbase: higher sstable per read threshold [puppet] - 10https://gerrit.wikimedia.org/r/253941 (https://phabricator.wikimedia.org/T118976) [20:38:49] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: higher sstable per read threshold [puppet] - 10https://gerrit.wikimedia.org/r/253941 (https://phabricator.wikimedia.org/T118976) (owner: 10Filippo Giunchedi) [20:39:22] (03PS4) 10Filippo Giunchedi: restbase: raise threshold for pending compactions [puppet] - 10https://gerrit.wikimedia.org/r/253942 (https://phabricator.wikimedia.org/T118976) [20:39:28] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: raise threshold for pending compactions [puppet] - 10https://gerrit.wikimedia.org/r/253942 (https://phabricator.wikimedia.org/T118976) (owner: 10Filippo Giunchedi) [20:41:23] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: puppet fail [20:41:52] ^I know [20:44:32] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [20:48:00] (03PS1) 10Merlijn van Deen: diamond: fix up dependencies [puppet] - 10https://gerrit.wikimedia.org/r/254287 [20:48:04] YuviPanda: ^ that :-p [20:48:21] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [20:48:22] (03PS1) 10Jcrespo: Racktables user placeholder [puppet] - 10https://gerrit.wikimedia.org/r/254288 [20:48:31] valhallasw`cloud: looking at diamond now [20:48:43] it's not that important, just something that fails on the first run [20:48:55] (03CR) 10jenkins-bot: [V: 04-1] diamond: fix up dependencies [puppet] - 10https://gerrit.wikimedia.org/r/254287 (owner: 10Merlijn van Deen) [20:49:23] ok [20:49:26] valhallasw`cloud: let's do fastapt then [20:49:31] ok! [20:49:43] let's switch to -labs [20:50:08] (03CR) 10Dzahn: "thank you, i would like the TRIGGER privilege because release notes for 0.20.7 say "Database triggers are used for some data consistency m" [puppet] - 10https://gerrit.wikimedia.org/r/254288 (owner: 10Jcrespo) [20:53:25] (03PS1) 10Andrew Bogott: Define labs dns/designate servers for codfw. [puppet] - 10https://gerrit.wikimedia.org/r/254289 [20:53:50] (03PS2) 10Andrew Bogott: Define labs dns/designate servers for codfw. [puppet] - 10https://gerrit.wikimedia.org/r/254289 [20:54:08] (03PS2) 10Merlijn van Deen: diamond: fix up dependencies [puppet] - 10https://gerrit.wikimedia.org/r/254287 [20:54:32] PROBLEM - puppet last run on labnodepool1001 is CRITICAL: CRITICAL: puppet fail [20:54:38] (03CR) 10Rush: [C: 031] ":)" [puppet] - 10https://gerrit.wikimedia.org/r/254289 (owner: 10Andrew Bogott) [20:55:02] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:55:56] (03PS2) 10Jcrespo: Racktables user placeholder [puppet] - 10https://gerrit.wikimedia.org/r/254288 [20:56:53] (03CR) 10Andrew Bogott: [C: 032] Define labs dns/designate servers for codfw. [puppet] - 10https://gerrit.wikimedia.org/r/254289 (owner: 10Andrew Bogott) [21:02:02] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Requests for addition to the #project-creators group (in comments) - https://phabricator.wikimedia.org/T706#1818565 (10greg) >>! In T706#1818300, @Krenair wrote: > Actually, I just noticed 4 others from the same user. I suggest their permissions be r... [21:03:51] (03PS1) 10Andrew Bogott: labs salt-master: Just pull designate_hostname right from a hiera global. [puppet] - 10https://gerrit.wikimedia.org/r/254290 [21:04:28] chasemp: ^ that’s the other bit to fix that bug [21:04:49] ah ok [21:06:29] (03PS1) 10Rush: openstackclient package declaration dupe [puppet] - 10https://gerrit.wikimedia.org/r/254292 [21:06:35] (03CR) 10Andrew Bogott: [C: 032] labs salt-master: Just pull designate_hostname right from a hiera global. [puppet] - 10https://gerrit.wikimedia.org/r/254290 (owner: 10Andrew Bogott) [21:09:41] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:10:19] (03PS2) 10Rush: openstackclient package declaration dupe [puppet] - 10https://gerrit.wikimedia.org/r/254292 [21:10:34] (03CR) 10Rush: [C: 032 V: 032] openstackclient package declaration dupe [puppet] - 10https://gerrit.wikimedia.org/r/254292 (owner: 10Rush) [21:13:54] !log racktables moved to krypton and upgraded to 0.20.6 [21:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:16:48] (03CR) 10Dzahn: "i just upgraded to 0.20.6 for now. that does not need the TRIGGER priv. for reasons i dont know yet the upgrade did not work for versions " [puppet] - 10https://gerrit.wikimedia.org/r/254288 (owner: 10Jcrespo) [21:17:21] (03PS1) 10Odder: Add wikiquote.pl, link to wikisource.org [dns] - 10https://gerrit.wikimedia.org/r/254294 [21:18:03] (03CR) 10Odder: [C: 04-1] "Needs to wait till Monday when Doni from MarkMonitor returns from holiday." [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder) [21:18:35] (03PS2) 10Odder: Add wikiquote.pl, link to wikiquote.org [dns] - 10https://gerrit.wikimedia.org/r/254294 [21:18:56] 6operations, 5Patch-For-Review: move racktables to a VM - https://phabricator.wikimedia.org/T105555#1818621 (10Dzahn) For some reason the upgrade to 0.20.10 did not work, just up to 0.20.6. maybe related to the new mysql privs mentioned in the change log of 0.20.7 but jcrespo had these added too and it still w... [21:21:55] (03PS1) 10Merlijn van Deen: puppet/apt: automatically update packages (hiera-configurable) [puppet] - 10https://gerrit.wikimedia.org/r/254295 [21:22:22] YuviPanda: ^. I'll also fire up puppet-compiler [21:22:29] to make sure this doesn't break prod =p [21:23:18] valhallasw`cloud: heh :D [21:23:22] RECOVERY - cassandra-c CQL 10.192.16.167:9042 on restbase2002 is OK: TCP OK - 0.038 second response time on port 9042 [21:24:22] PROBLEM - Apache HTTP on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:24:41] (03PS1) 10Rush: remove labs::openstack::nova::common from nodepool [puppet] - 10https://gerrit.wikimedia.org/r/254297 [21:25:32] PROBLEM - HHVM rendering on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:25:47] valhallasw`cloud: am typing in a nice long comment now [21:25:53] <3 [21:26:02] PROBLEM - SSH on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:26:02] PROBLEM - RAID on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:26:07] 6operations, 5Patch-For-Review: move racktables to a VM - https://phabricator.wikimedia.org/T105555#1818637 (10Dzahn) 5Open>3Resolved racktables now lives on krypton, a ganeti VM. removed files on magnesium. a tarball and the old Apache config are in /root just in case. [21:26:29] (03CR) 10Yuvipanda: "I'm vaguely ambivalent about this change. Other options to consider and possible drawbacks:" [puppet] - 10https://gerrit.wikimedia.org/r/254295 (owner: 10Merlijn van Deen) [21:26:31] PROBLEM - configured eth on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:26:32] PROBLEM - DPKG on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:26:32] PROBLEM - Check size of conntrack table on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:26:37] valhallasw`cloud: ^ I offered another option! [21:26:51] PROBLEM - dhclient process on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:27:02] YuviPanda: fair point [21:27:02] PROBLEM - puppet last run on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:27:20] valhallasw`cloud: so on prod we have a 'dpkg is broken?' icinga check [21:27:25] valhallasw`cloud: we can do that for labs too maybe [21:27:41] uuuh [21:27:46] what does that check? [21:27:46] valhallasw`cloud: that way we'll get a different alert on wether it is dpkg that's broken or if it is puppet [21:27:50] let me find out [21:28:01] because just 'packages are broken' doesn't tell us much [21:28:16] also, the reason for the 20 min schedule is because I don't want to pssh to all of tool labs [21:28:36] 6operations, 7Database: mysql privs: restrict access to racktables to krypton - https://phabricator.wikimedia.org/T118816#1818642 (10Dzahn) the migration is done. access can now be restriced to coming from krypton.eqiad.wmnet (10.64.32.182). we upgraded to 0.20.6 but only 0.20.7 and above need the TRIGGER pri... [21:29:15] valhallasw`cloud: we can probably customize unattended upgrades schedule to be quicker [21:29:24] yeah [21:29:53] baaah puppet compiler counts this as a change [21:30:21] (03CR) 10Rush: [C: 032] remove labs::openstack::nova::common from nodepool [puppet] - 10https://gerrit.wikimedia.org/r/254297 (owner: 10Rush) [21:30:23] 6operations, 7Database: mysql privs: restrict access to racktables to krypton - https://phabricator.wikimedia.org/T118816#1818643 (10Dzahn) a:5Dzahn>3jcrespo fyi.. low priority. see comment above. feel free to assign "up for grabs" or how you prefer it. [21:30:55] (03CR) 10Merlijn van Deen: "puppet-compiler result: https://puppet-compiler.wmflabs.org/1329/labmon1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/254295 (owner: 10Merlijn van Deen) [21:31:28] !log magnesium - apt-get upgrade [21:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:32:01] PROBLEM - HHVM processes on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:32:10] PROBLEM - nutcracker port on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:32:12] valhallasw`cloud: dpkg -l|grep '^[uirph]'|egrep -v '^(ii|rc)' [21:32:14] is what it checks [21:32:23] so that's output of dpkg -l where the answer isn't ii or rc [21:32:23] valhallasw`cloud: the "dpkg" check is basically running dpkg -l [21:32:31] PROBLEM - nutcracker process on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:32:31] PROBLEM - salt-minion processes on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:32:31] and checking for.. you beat me to it [21:32:34] anything that isnt ii [21:32:52] RECOVERY - puppet last run on labnodepool1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:33:38] mutante: do you know what is 'rc' [21:33:45] right, so that checks for packages that did not install correctly [21:34:01] PROBLEM - HTTPS on magnesium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [21:34:18] remove / conf-files [21:34:19] YuviPanda: removed but has config files [21:34:24] apt-get remove but without --purge [21:34:29] ah [21:34:30] PROBLEM - Disk space on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:34:31] ok [21:34:34] Desired=Unknown/Install/Remove/Purge/Hold [21:34:35] | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend [21:34:35] |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) [21:34:53] unknown is when it's broken, I presume? [21:34:59] yea, you can see those at the top of the dpkg -l output [21:35:09] yes, or installed/unpacked, or installed/half-conf, etc [21:35:28] yep, any of the uncommon combos [21:35:40] yeah [21:35:42] RECOVERY - nutcracker process on mw1132 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [21:35:42] RECOVERY - salt-minion processes on mw1132 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:35:53] valhallasw`cloud: so maybe we can use unattended upgrades and just have dpkg checks [21:36:09] YuviPanda: mmmm. [21:36:14] the first letter is what is "desired" and the second what is actually the case [21:36:17] I'm not sure if that catches everything [21:36:24] because apt-get will not upgrade when there's a conflict [21:36:26] mutante: aaah, that explains the ii [21:36:29] but that will not show up in dpkg -l [21:36:29] yea [21:36:31] * YuviPanda TILs [21:36:47] valhallasw`cloud: what kind of conflict? and I think that'll break the package that conflicts, no? [21:36:59] YuviPanda: no, because it would be conflicted *after* installation [21:37:42] typical example is package A requires B < 1, new version of package C requires B > 1, so apt keeps the old version of package C [21:38:07] I don't think that should ever happen with the official repos, but I'm not 100% sure [21:39:15] yeah, I think if that happens maybe dpkg -l might change since it is 'wanted' to be in a different state than it is in [21:40:39] (03CR) 10Yuvipanda: "Another potential drawback with long running unattended upgrades it that puppet will fail if it runs concurrently because of the apt lock " [puppet] - 10https://gerrit.wikimedia.org/r/254295 (owner: 10Merlijn van Deen) [21:40:50] valhallasw`cloud: ^ anothe rissue with unattended upgrades [21:41:30] PROBLEM - nutcracker process on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:41:30] PROBLEM - salt-minion processes on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:41:48] csteipp, andrewbogott, et al, terbium host ssh key changed. Known or evil? [21:41:59] * YuviPanda MITMs matt_flaschen [21:42:11] RECOVERY - dhclient process on mw1132 is OK: PROCS OK: 0 processes with command name dhclient [21:42:13] matt_flaschen: it has been reinstalled recently [21:42:30] RECOVERY - HHVM processes on mw1132 is OK: PROCS OK: 6 processes with command name hhvm [21:42:31] RECOVERY - SSH on mw1132 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [21:42:40] RECOVERY - nutcracker port on mw1132 is OK: TCP OK - 0.000 second response time on port 11212 [21:42:41] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 49 minutes ago with 0 failures [21:42:52] RECOVERY - configured eth on mw1132 is OK: OK - interfaces up [21:43:05] YuviPanda: mmm, right [21:43:11] RECOVERY - nutcracker process on mw1132 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [21:43:11] RECOVERY - salt-minion processes on mw1132 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:43:14] YuviPanda: we should check what unattended-upgrades uses to apt-get though [21:43:16] and copy that [21:43:20] matt_flaschen: Oh fun. I get that too. 9b:b0:ca:aa:67:81:7e:7e:81:f8:dd:af:7e:f6:73:c9. Maybe someone in ops can verify? [21:43:21] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [21:43:30] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.646 second response time [21:43:31] RECOVERY - Check size of conntrack table on mw1132 is OK: OK: nf_conntrack is 0 % full [21:43:31] RECOVERY - DPKG on mw1132 is OK: All packages OK [21:43:33] Thanks, mutante [21:43:41] RECOVERY - Disk space on mw1132 is OK: DISK OK [21:43:51] RECOVERY - RAID on mw1132 is OK: OK: no RAID installed [21:44:01] RECOVERY - HHVM rendering on mw1132 is OK: HTTP OK: HTTP/1.1 200 OK - 66389 bytes in 0.537 second response time [21:44:11] (03PS2) 10Rush: puppet/apt: automatically update packages (hiera-configurable) [puppet] - 10https://gerrit.wikimedia.org/r/254295 (owner: 10Merlijn van Deen) [21:44:50] YuviPanda: ah, but! we /can/ call unattended-upgades as part of the puppet run [21:45:09] matt_flaschen: 2048 9b:b0:ca:aa:67:81:7e:7e:81:f8:dd:af:7e:f6:73:c9 [21:45:16] csteipp, I get: [21:45:16] The fingerprint for the ECDSA key sent by the remote host is [21:45:16] 3e:1f:ea:cf:b5:7f:4f:76:76:2d:e2:ca:f1:8e:69:1f. [21:45:16] But maybe it depends how you bastion. [21:45:17] I use bast1001.wikimedia.org [21:45:55] I'm wondering if jenkins is having issues [21:46:04] YuviPanda: unattended-upgrades is creepy. It calls apt internals directly. [21:46:20] (03PS1) 10Odder: Redirect wikiquote.pl to pl.wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/254305 [21:46:20] heh [21:46:23] mutante, hmm, doesn't match what I'm seeing. [21:46:28] valhallasw`cloud: well it is part of apt no? [21:46:34] I suppose [21:46:45] bast1001 is fine. [21:46:59] (03CR) 10Odder: [C: 04-1] "Needs to wait till Monday when Doni from MarkMonitor returns from holiday." [puppet] - 10https://gerrit.wikimedia.org/r/254305 (owner: 10Odder) [21:47:21] matt_flaschen: csteipp: [21:47:22] (03PS1) 10Rush: nodepool add password include [puppet] - 10https://gerrit.wikimedia.org/r/254306 [21:47:23] 256 3e:1f:ea:cf:b5:7f:4f:76:76:2d:e2:ca:f1:8e:69:1f root@terbium (ECDSA) [21:47:33] that one is correct for terbium [21:47:53] the difference is that we have RSA vs. ECDSA keys [21:47:57] when i pasted above [21:48:04] (03CR) 10Rush: [C: 032 V: 032] nodepool add password include [puppet] - 10https://gerrit.wikimedia.org/r/254306 (owner: 10Rush) [21:48:11] PROBLEM - puppet last run on mw1132 is CRITICAL: CRITICAL: Puppet has 12 failures [21:48:30] YuviPanda: I actually sort of like calling unattended-upgrade directly [21:48:43] csteipp: the one that matt gets is correct but yours is different [21:48:44] haha [21:48:46] why valhallasw`cloud [21:48:48] [terbium:~] $ ssh-keygen -l -f /etc/ssh/ssh_host_ecdsa_key.pub [21:48:51] btw [21:49:17] (03CR) 10jenkins-bot: [V: 04-1] Redirect wikiquote.pl to pl.wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/254305 (owner: 10Odder) [21:49:19] mutante, isn't his the RSA one you gave? [21:49:28] YuviPanda: because then we do have the ubuntu logic of what is safe to upgrade, but within the timing, locking and reporting framework of the puppet run [21:49:37] mutante, thanks for confirming the ECDSA, BTW. [21:50:07] matt_flaschen: yes, true. that's it [21:50:12] csteipp: confirmed. that's the RSA key [21:50:21] valhallasw`cloud: mmmm [21:50:32] YuviPanda: and e.g. allows to configure an upgrade blacklist, error emails, ... [21:50:40] not sure if the error emails are useful [21:50:44] but 'plz reboot' ones could be [21:50:55] mmm [21:51:11] so I'm totally sold on unattended upgrades :D [21:51:25] question is if we should call it from puppet-run or not [21:51:34] is ok to call it from puppet-run it kills most of my objections [21:51:37] wanna update patch? [21:51:43] ya [21:51:48] how about calling it from a cronjob which puppet creates [21:51:52] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [21:52:12] mutante: the problem with that is locking again - if it is running while puppet is also running puppet will fail [21:52:23] ah, right [21:52:36] so locking is a problem [21:52:37] well the cron job could be "disable puppet, do stuff, enable puppet" i guess :p [21:52:51] mutante: won't work if puppet has started but hasn't done any apt calls :) [21:53:02] hrmm, *nod* [21:53:09] so puppet starts doing things, middle of it the cronjob starts, locks apt, puppet tries apt and dies [21:53:20] we can forcibly kill puppet but that'll count as a fail too :D [21:53:25] maybe it has to check if puppet is running first [21:53:29] by looking for the lock file [21:53:51] and if it finds one.. wait and try again X minutes later [21:53:55] a lot more complicated than putting it in puppet-run :) [21:54:04] this is also why apt-get update is in puppet-run [21:54:18] ok, i thought you were trying to find a way around running it each and every time [21:54:25] yea [21:55:56] also, let me try what unattended-upgrades wants to do on tool-labs currently [21:56:02] bastion-01* [21:56:04] (03PS3) 10Merlijn van Deen: puppet/apt: run unattended-upgrades before run (hiera-configurable) [puppet] - 10https://gerrit.wikimedia.org/r/254295 [21:56:05] probably nothing [21:56:54] (03PS4) 10Merlijn van Deen: puppet/apt: run unattended-upgrades before puppet (hiera-configurable) [puppet] - 10https://gerrit.wikimedia.org/r/254295 [21:57:21] mutante, tin reinstalled too? "ECDSA key fingerprint is 1d:ed:e4:af:ea:a1:c5:cd:9e:eb:89:29:a5:20:32:d6." [21:57:35] Would be good to have these somewhere for reference. [21:57:43] matt_flaschen: hmm. no. uptime 296 days [21:57:44] ^ +1 [21:57:59] the bastion hosts at least are on wiki already [21:58:02] Hmm [21:58:25] 256 1d:ed:e4:af:ea:a1:c5:cd:9e:eb:89:29:a5:20:32:d6 root@tin (ECDSA) [21:58:30] confirmed [21:59:16] I'm on a new machine, and maybe the old one that I copied the known hosts from used RSA. [21:59:21] It depends on the client, right? [21:59:54] Thanks [22:00:04] yes, so the server has all these: [22:00:11] dsa, ecdsa, ed25519 and rsa [22:00:33] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1818735 (10GWicke) @bblack: If there was a reasonably clean way to differentiate the limits between the action API & the rest_v1 API, would you be open to me creating a patch to do so? [22:01:27] YuviPanda: although unattended-upgrades is slow, even with --dry-run O_o [22:02:11] hm, no, that's my faule [22:03:10] (03Abandoned) 10Andrew Bogott: Add hiera config for labtest cluster. [puppet] - 10https://gerrit.wikimedia.org/r/252633 (owner: 10Andrew Bogott) [22:07:02] YuviPanda: apparently '--dry-run' means 'download the package but do not install it' [22:07:13] that explains why it took so long :P [22:07:31] anyway, should be fine after an initial run, I think [22:07:33] but for now: bed! [22:07:39] let's leave this patch to simmer for a bit [22:07:55] valhallasw`cloud: hahaha :) ok [22:09:25] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/1322/ the changes are the source for this file i'm moving. the fail on one host is unrelated due to se" [puppet] - 10https://gerrit.wikimedia.org/r/253457 (owner: 10Dzahn) [22:11:07] !log deployed patch for T97897 [22:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:12:42] (03PS1) 10Ori.livneh: Add jq and tmux to standard packages; remove all other references [puppet] - 10https://gerrit.wikimedia.org/r/254310 [22:13:25] !log deployed patch for T118032 [22:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:13:47] valhallasw`cloud: <3 thanks [22:14:26] (03CR) 10Dzahn: [C: 032] deactivate wikidata.pt [dns] - 10https://gerrit.wikimedia.org/r/254042 (owner: 10Dzahn) [22:15:47] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1818822 (10GWicke) > This varnish-level limiting is really just an outer layer of protection against request-rates of unreasonable scale, to protect the inner layers of our architecture from da... [22:19:22] PROBLEM - puppet last run on es2010 is CRITICAL: CRITICAL: puppet fail [22:20:41] (03CR) 10Yuvipanda: [C: 031] Add jq and tmux to standard packages; remove all other references [puppet] - 10https://gerrit.wikimedia.org/r/254310 (owner: 10Ori.livneh) [22:21:43] !log deployed patch for T118682 [22:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:26:52] (03PS3) 10Yuvipanda: Remove manifests/stages.pp [puppet] - 10https://gerrit.wikimedia.org/r/231143 (owner: 10Faidon Liambotis) [22:27:41] (03CR) 10Muehlenhoff: [C: 031] Add jq and tmux to standard packages; remove all other references [puppet] - 10https://gerrit.wikimedia.org/r/254310 (owner: 10Ori.livneh) [22:28:15] (03CR) 10Ori.livneh: [C: 032] Add jq and tmux to standard packages; remove all other references [puppet] - 10https://gerrit.wikimedia.org/r/254310 (owner: 10Ori.livneh) [22:28:21] (03PS3) 10Andrew Bogott: designate: Stop populating default classes / variables [puppet] - 10https://gerrit.wikimedia.org/r/253807 (https://phabricator.wikimedia.org/T101447) (owner: 10Yuvipanda) [22:29:53] (03CR) 10Andrew Bogott: [C: 032] designate: Stop populating default classes / variables [puppet] - 10https://gerrit.wikimedia.org/r/253807 (https://phabricator.wikimedia.org/T101447) (owner: 10Yuvipanda) [22:30:25] (03PS4) 10Yuvipanda: Remove manifests/stages.pp [puppet] - 10https://gerrit.wikimedia.org/r/231143 (owner: 10Faidon Liambotis) [22:31:04] (03CR) 10Yuvipanda: [C: 032 V: 032] "KILL KILL KILL KILL KILL KILL" [puppet] - 10https://gerrit.wikimedia.org/r/231143 (owner: 10Faidon Liambotis) [22:31:36] !log deployed followup for T109724 [22:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:45:51] RECOVERY - puppet last run on es2010 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [22:47:46] <_joe_> what's up with pages I am getting and not coming out here [22:48:35] <_joe_> it's just some hours late [22:49:19] <_joe_> oh no it was the slave lag, which didn't show here [22:49:24] <_joe_> in fact [22:49:43] _joe_: there's been a bug that has been making things not show up in IRC [22:49:46] if they page [22:49:52] <_joe_> ... [22:50:04] I should file it now [22:50:06] <_joe_> that's luckily not true for most things [22:51:13] _joe_: https://phabricator.wikimedia.org/T118072 [22:51:17] but go back to your vacation [22:51:35] 6operations, 7Icinga: icinga-wm not outputing messages for alerts that also paged - https://phabricator.wikimedia.org/T118072#1819078 (10yuvipanda) Happened again. [22:51:37] <_joe_> a vacation is not a real vacation without a page [22:52:29] the slave lag stuff, standalone I've been ignoring unless there isn't a recovery in a few mins [22:52:36] and so far there has been, 100% of the time [23:17:54] (03PS2) 10Dzahn: apache: remove wikimemory.org redirect [puppet] - 10https://gerrit.wikimedia.org/r/254059 [23:18:17] (03CR) 10Dzahn: [C: 032] apache: remove wikimemory.org redirect [puppet] - 10https://gerrit.wikimedia.org/r/254059 (owner: 10Dzahn) [23:20:05] (03PS2) 10EBernhardson: Turn on language detection user test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254070 (https://phabricator.wikimedia.org/T118290) [23:20:32] (03PS2) 10Dzahn: apache: remove wikipaedia.net redirect [puppet] - 10https://gerrit.wikimedia.org/r/254060 [23:24:43] (03CR) 10Dzahn: [C: 032] apache: remove wikipaedia.net redirect [puppet] - 10https://gerrit.wikimedia.org/r/254060 (owner: 10Dzahn) [23:27:42] (03PS1) 10Jhobs: Enable new QuickSurveys survey on mobile enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254322 (https://phabricator.wikimedia.org/T118881) [23:28:46] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Requests for addition to the #project-creators group (in comments) - https://phabricator.wikimedia.org/T706#1819236 (10Krenair) After a brief discussion with the user involved, #Gadgets-2.0 has been reopened as a release project. I am going to propos... [23:34:19] (03PS1) 10JanZerebecki: grafana: disable gravatar integration [puppet] - 10https://gerrit.wikimedia.org/r/254324 [23:35:25] (03CR) 10JanZerebecki: "Docs: http://docs.grafana.org/installation/configuration/#disable_gravatar" [puppet] - 10https://gerrit.wikimedia.org/r/254324 (owner: 10JanZerebecki) [23:36:25] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1819264 (10jcrespo) @GWicke Please inform me how I can get [[ https://phabricator.wikimedia.org/T118186#1797079 | recentchanges's literally 2 million filter options ]] in restbase and I will pe... [23:38:19] 6operations, 7Icinga: icinga-wm not outputing messages for alerts that also paged - https://phabricator.wikimedia.org/T118072#1819304 (10Dzahn) here's the status turning CRIT on db1051 https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=db1051 and here are the notifications sent out for it: https:/... [23:49:42] (03PS1) 10Ori.livneh: redis::instance: support hash configuration values [puppet] - 10https://gerrit.wikimedia.org/r/254327 [23:55:53] (03CR) 10Ori.livneh: "Confirmed no-op on existing redis::instance user (xenon role): https://puppet-compiler.wmflabs.org/1333/" [puppet] - 10https://gerrit.wikimedia.org/r/254327 (owner: 10Ori.livneh) [23:58:25] 6operations, 7Icinga: icinga-wm not outputing messages for alerts that also paged - https://phabricator.wikimedia.org/T118072#1819423 (10Dzahn) - There is a contact group called "admins". It has only a single member, the contact called "irc". - The contact "irc" has notification commands to write into the logf... [23:58:35] 6operations, 7Icinga: icinga-wm not outputing messages for alerts that also paged - https://phabricator.wikimedia.org/T118072#1819424 (10Dzahn) a:5akosiaris>3Dzahn