[00:31:26] PROBLEM - puppet last run on db2055 is CRITICAL puppet fail [00:38:26] PROBLEM - Disk space on iridium is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=84%) [00:51:27] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 43 not-conn: cp4015_v6 [00:53:26] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 44 ESP OK [00:56:07] RECOVERY - puppet last run on db2055 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [00:58:33] (03PS1) 10Yurik: Fixed RE for maps referrer [puppet] - 10https://gerrit.wikimedia.org/r/231978 [01:37:27] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 15 connecting: cp1046_v6 [01:39:18] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 16 ESP OK [02:05:37] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-tools/snapshot is not accessible: Permission denied [02:25:26] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:25:42] !log l10nupdate@tin Synchronized php-1.26wmf18/cache/l10n: l10nupdate for 1.26wmf18 (duration: 11m 08s) [02:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:19] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf18) at 2015-08-17 02:29:19+00:00 [02:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:16] RECOVERY - RAID on snapshot1002 is OK no RAID installed [02:33:16] PROBLEM - IPsec on cp1046 is CRITICAL: Strongswan CRITICAL - ok: 15 not-conn: cp3017_v6 [02:35:07] RECOVERY - IPsec on cp1046 is OK: Strongswan OK - 16 ESP OK [02:42:35] 6operations, 10Wikimedia-General-or-Unknown: Server at http://upload.wikimedia.org isn't HTTP/1.1 compliant: doesn't allow absolute URI in request - https://phabricator.wikimedia.org/T51467#1543865 (10Krenair) [02:51:06] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [02:52:56] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 496 bytes in 0.015 second response time [02:55:28] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown: Server at http://upload.wikimedia.org isn't HTTP/1.1 compliant: doesn't allow absolute URI in request - https://phabricator.wikimedia.org/T51467#1543874 (10Krenair) This seems to have been fixed now, tested example request via `openssl s_client -connect... [02:57:28] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown: Server at http://upload.wikimedia.org isn't HTTP/1.1 compliant: doesn't allow absolute URI in request - https://phabricator.wikimedia.org/T51467#1543876 (10Krenair) 5Open>3Resolved a:3Krenair [02:57:43] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown: Server at http://upload.wikimedia.org isn't HTTP/1.1 compliant: doesn't allow absolute URI in request - https://phabricator.wikimedia.org/T51467#534698 (10Krenair) a:5Krenair>3None [03:06:38] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [03:08:27] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18602 bytes in 1.094 second response time [04:03:58] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL 20.69% of data above the critical threshold [100000000.0] [04:11:38] PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string Wikimedia and MediaWiki not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 1764 bytes in 0.922 second response time [04:12:41] looking into the phab issue... [04:13:15] twentyafterfour: the disk is full [04:14:14] I see that [04:14:35] but I don't know why ... not that much /var usage [04:15:04] i freed a little bit with apt-get clean and gzipping a log [04:15:29] RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 21128 bytes in 0.208 second response time [04:15:50] twentyafterfour: it's just pretty small partition [04:15:59] \/var/log is 5.4G [04:16:19] yea, 3.5 of which is apache [04:16:31] yeah that partition size is insane [04:16:34] and the entire / is just like 10 [04:16:42] but we stepped up the log rotation schedule that should have fixed it [04:17:11] ohh I forgot to sudo my du ...it was saying /var was just 2.8 gig [04:17:18] sudo du ...says different ;) [04:17:44] !log free some disk space on iridium. apt-get clean; gzip /var/log/account/pacct.0; some apache logs .;. [04:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:17:52] twentyafterfour: it's getting crawled by googlebot [04:17:56] RECOVERY - Disk space on iridium is OK: DISK OK [04:18:07] also var/log/account [04:18:09] sudo tail -f /var/log/apache2/phabricator_access.log and you'll see [04:19:25] gzip'ed access.log.1 too [04:19:59] zcat [04:20:01] let's not do 9 GB root partitions, that's just silly [04:20:06] 2G are back [04:20:08] ori: no shit [04:20:34] especially since this machine has plenty of storage [04:20:53] there's 400G on /srv [04:20:54] when filling /var can kill a service it's insane to have /var on a small partition [04:21:38] I don't know what to do about it but this is going to be a problem until we move var off of / [04:21:44] or resize / [04:22:36] phabricator usage is going to keep increasing, iridium is over-provisioned so it can handle a lot more load, but that partition can't handle current load [04:23:02] * ori recommends http://i.imgur.com/JHVXx8o.png [04:23:03] do we need those apache logs ? [04:23:32] mutante: no need for keeping much history but they are helpful in debugging some things [04:23:47] and retaining standard auditing capabilities [04:23:53] stuff from 2014 should probably be gone though [04:24:00] (03PS1) 10Alex Monk: Clear up usage of --wiki parameter to mwscript [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231982 (https://phabricator.wikimedia.org/T90343) [04:24:20] mutante: yes, I don't know what kind of retention we need. For my purposes 1 day or so is plenty [04:24:45] no LVM here unfortunately [04:24:52] ori: nice [04:26:03] I suppose we should exclude /multimeter via robots.txt [04:26:11] I don't think google should be indexing that [04:26:15] is it ok for now? or we can move more logs to /srv [04:26:35] plenty of space there [04:27:02] cp -R /var/log/apache2 /srv/logs ; service apache2 stop ; rm -rf /var/log/apache2 ; ln -sf /srv/logs /var/log/apache2 ; service apache2 start [04:27:04] there even is /srv/logs already [04:27:49] and maybe a note about it in a comment in iridium's entry in site.pp so we don't make the same mistake when we provision phab elsewhere [04:28:17] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:49] arguably, log spam is silly too. raise it from 9G to 90G and someone will still eventually fill the thing with log spam. the problem isn't the size so much as the spam. [04:29:20] if we really need giant logs for auditing whatever, then yeah, explicitly put them on a larger FS that handles the expected size + rotation schedule I guess, as with apache2->srv above. [04:29:22] (03CR) 10Ori.livneh: Clear up usage of --wiki parameter to mwscript (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231982 (https://phabricator.wikimedia.org/T90343) (owner: 10Alex Monk) [04:29:36] 90 days max, i suppose [04:29:50] but I don't think supersizing rootfs in the general case is an answer [04:30:48] no, i wouldn't suggest that. but it doesn't hurt to have some breathing room to accommodate unforeseeable changes in log verbosity [04:30:57] 9 GB is just living on a razor's edge for no good reason IMO [04:31:30] why would the change in log verbosity be unforeseeable? [04:31:34] a few gigabytes of additional space and google's crawl would have been accomodated, gzipped, and rotated without us caring or having to care [04:31:36] /var/log/apache2# mv phabricator_access.log*.gz /srv/logs/ [04:31:41] done. for now [04:31:46] the rest can be on ticket [04:32:00] don't forget to SIGHUP apache (or whatever signal makes it reopen its log file handles) [04:32:15] no need, i only moved rotated logs [04:32:16] RECOVERY - RAID on snapshot1002 is OK no RAID installed [04:32:18] well, arguably apache logs for a public service don't belong on rootfs anyways [04:32:35] mostly I think of this whole thing in the context of over-noisy daemon traffic to syslog (which has hit us a couple times) [04:33:11] in contrast to "keeping more apache logs makes debugging easier", that problem is the opposite: more pointless noisy spam in syslog makes it hard to see real things. [04:34:29] bblack: are you suggesting we shouldn't even store apache access and error logs? I'm confused [04:34:50] do we want to let googlebot do what its' doing or not [04:35:00] mutante: I don't think so [04:35:19] I'm excluding /multimeter/ via phabricator robots.txt [04:35:20] so robots.txt edit? [04:35:23] *nod* [04:35:35] yeah already submitted the patch to gerrit [04:35:38] twentyafterfour: no, I'm saying if it's an http access log that might get crazy-huge in the course of serving legit traffic, maybe it shouldn't be on the root fs [04:36:27] https://gerrit.wikimedia.org/r/#/c/231983/ [04:36:54] in any case, this still seems pathological. has anyone looked at what google scanned? [04:37:06] bblack: definitely I don't think it should be on root [04:37:07] I guess that's right above heh [04:37:25] yeah google is indexing stuff it shouldn't be touching [04:37:39] dynamic pages that probably just confuse the spider [04:38:06] bblack: /multimeter/* [04:38:12] +1 to that patch [04:42:08] I wonder what the sample rate on that is set to [04:42:20] google's own accesses probably generate additional sampled logs for it to crawl heh [04:44:42] !log deployed https://gerrit.wikimedia.org/r/#/c/231983/ to iridium and restarted apache [04:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:44:49] bblack: exactly [04:45:12] it doesnt listen to the new rules yet [04:45:18] I'm not sure why upstream didn't include that in the default robots.txt settings [04:45:38] mutante: I wonder how often they reread robots.txt [04:45:56] yea, that's what i was wondering. how much is already in a queue [04:45:59] I could change the security policy on the app [04:46:54] either way it's not so critical anymore, it can go on a while longer until it'd be full again [04:46:55] changed from 'public' to 'all users' [04:47:23] !log changed phabricator policy for the multimeter application from 'public' to 'all users' [04:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:49:00] for some of its requests it gets 200 and for some 500 as an answer, fwiw [04:55:10] i still see the requests from Googlebot but we can continue during normal work hours [04:55:30] tested it - the urls just return a login form now [04:55:58] ok, good [04:55:58] so the robot should figure out that it's hit a dead end eventually [04:56:29] thanks, i'll be out again for now [04:57:49] :) thanks mutante [04:58:08] and everyone for pitching in [05:09:18] RECOVERY - Incoming network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [05:11:39] !log restarted phd to pick up new configuration, (and to silence the phabricator 'setup issue' warning [05:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:18:26] PROBLEM - Hadoop NodeManager on analytics1028 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [05:24:07] RECOVERY - Hadoop NodeManager on analytics1028 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [05:31:43] (03PS1) 10KartikMistry: Enable article-recommender-1 campaign in ca, en, es, fa, fr, it, sw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231984 (https://phabricator.wikimedia.org/T109245) [05:54:24] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Aug 17 05:54:24 UTC 2015 (duration 54m 23s) [05:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:16:32] (03CR) 10Santhosh: [C: 031] Enable article-recommender-1 campaign in ca, en, es, fa, fr, it, sw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231984 (https://phabricator.wikimedia.org/T109245) (owner: 10KartikMistry) [06:23:14] (03PS2) 10Ori.livneh: Introduce ConfigurationObserver class [debs/pybal] - 10https://gerrit.wikimedia.org/r/230931 [06:30:57] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:28] PROBLEM - puppet last run on mc1017 is CRITICAL Puppet has 2 failures [06:31:36] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 1 failures [06:31:46] PROBLEM - puppet last run on mc2015 is CRITICAL Puppet has 1 failures [06:32:16] PROBLEM - puppet last run on db1018 is CRITICAL Puppet has 1 failures [06:32:27] PROBLEM - puppet last run on mw2023 is CRITICAL Puppet has 1 failures [06:32:27] PROBLEM - puppet last run on cp4010 is CRITICAL Puppet has 1 failures [06:32:46] PROBLEM - puppet last run on mw2036 is CRITICAL Puppet has 1 failures [06:32:46] PROBLEM - puppet last run on mw2018 is CRITICAL Puppet has 1 failures [06:32:56] RECOVERY - RAID on snapshot1002 is OK no RAID installed [06:33:37] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 1 failures [06:56:08] RECOVERY - puppet last run on mc1017 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:56:16] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:56:56] RECOVERY - puppet last run on db1018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:07] RECOVERY - puppet last run on mw2023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:17] RECOVERY - puppet last run on cp4010 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:57:27] RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:57:27] RECOVERY - puppet last run on mw2036 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:18] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:26] RECOVERY - puppet last run on mc2015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:08:23] 6operations, 10Wikimedia-General-or-Unknown, 7Availability: Consider using Cassandra/restbase in place of external store - https://phabricator.wikimedia.org/T100705#1544069 (10Joe) [08:08:55] 6operations, 10Wikimedia-General-or-Unknown, 7Availability: Consider using Cassandra/restbase in place of external store - https://phabricator.wikimedia.org/T100705#1544071 (10Joe) Added operations as a tag, as not involving ops in such a debate seemed a bit peculiar, tbh. [08:23:58] 6operations, 6Labs, 10wikitech.wikimedia.org: intermittent nutcracker failures - https://phabricator.wikimedia.org/T105131#1544092 (10fgiunchedi) [08:41:12] (03PS1) 10ArielGlenn: dumps: do abstract dumps stream in smaller pieces [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/231993 [08:44:43] (03CR) 10ArielGlenn: [C: 032] dumps: do abstract dumps stream in smaller pieces [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/231993 (owner: 10ArielGlenn) [08:55:05] 6operations, 10RESTBase-Cassandra, 7Blocked-on-Services: Test JDK8 with Cassandra - https://phabricator.wikimedia.org/T104888#1544164 (10MoritzMuehlenhoff) a:3fgiunchedi Filippo, you'd been doing that, so I'm assigning you the ticket? [08:57:50] will be out for about an hour or so, beach time [08:59:22] kaldari: oi? [09:10:52] 6operations, 10RESTBase-Cassandra, 7Blocked-on-Services: Test JDK8 with Cassandra - https://phabricator.wikimedia.org/T104888#1544226 (10fgiunchedi) @moritzmuehlenhoff sure! I'll be rolling upgrade cassandra machines this week to latest openjdk [09:17:42] 6operations, 10MediaWiki-General-or-Unknown: Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378#1544250 (10Joe) So, we think the issue here is that a timeout of 0.5 seconds is way too high and, if a server is completely down (... [09:36:55] !log upgrade openjdk on restbase100[127] and restart cassandra [09:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:42:01] 6operations, 5Patch-For-Review: Add Ferm rules for snapshot hosts - https://phabricator.wikimedia.org/T104991#1544300 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [09:43:15] !log about to perform schema change on centralauth [09:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:48:30] 6operations: Jessie imaging installs nfs-common needlessly - https://phabricator.wikimedia.org/T107412#1544337 (10MoritzMuehlenhoff) p:5Triage>3Normal [09:51:54] no metadata locking contention that I can see [09:52:22] but please tell me if you see something wrong with account creation/login/etc. while the process is ongoing [10:02:37] the recent changes of s7 wikis are getting some lag, but I I think it is work continuing [10:07:36] (03PS1) 10Jcrespo: Changes in weight on s7 while the schema change is ongoing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231998 [10:08:17] (03CR) 10Jcrespo: [C: 032] Changes in weight on s7 while the schema change is ongoing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231998 (owner: 10Jcrespo) [10:09:44] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Solving lag issues while schema change is ongoing (duration: 00m 12s) [10:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:12:13] ^that helped with the watchlist [10:12:25] and we are halfway though already [10:13:28] and lag is going down [10:33:21] if is funny, because labs can keep up with the extra load, but some small production servers cannot [10:33:49] what're we changing on ca? [10:34:20] let me seach the commit [10:35:01] Reedy, https://gerrit.wikimedia.org/r/#/c/202344/ [10:35:18] nothing too exciting then :) [10:36:21] oh, it is a new colum, if was anything else I would put myself in "scary mode" [10:36:37] but I got 3 or 4 things from this: [10:36:48] we need to switch to ROW based replication [10:37:00] it's already got a primary key and everything [10:37:03] what is this world coming to [10:37:09] no lag and It would have recontstructed the whole table [10:37:19] assuring consistency [10:37:29] 2) we need new servers for production [10:37:48] we always need new servers :) [10:37:50] the only server that could keep up, even with almost 100% of the load is the new one [10:37:51] jynus or akosiaris , could you +2 a very minor regex bug ('+' is missing) - https://gerrit.wikimedia.org/r/#/c/231978/ [10:38:09] let me put it like this, it is better to have all bad servers, than ony 1 good one [10:38:33] 3) we need to separate centralauth on x1 or a new x2 [10:38:45] for both security and performance [10:38:53] :) [10:38:59] Make it so. [10:39:16] well, it is easier to say that to do it [10:39:43] haha [10:40:07] but I have already started to project it (and Sean already wanted to do it, I just reached to the same conculsions independently) [10:40:11] <_joe_> we could convert it to use cassandra!!!1! [10:40:19] cassandra is nosql [10:40:22] * Reedy sends _joe_ to the naughty step [10:40:24] so no schema changes [10:40:32] <_joe_> jynus: see? [10:40:37] you cracked it! [10:42:48] <_joe_> Reedy: I know my behaviour is not productive and sarcasm will get me nowhere, I consider myself schooled :) [10:43:01] lolol [10:43:06] So, Cassandra it is? [10:43:11] <_joe_> I'll take you as an example to follow [10:43:13] <_joe_> :D [10:45:34] 4) let's not discard TokuDB because for some loads we are getting 3x the performance with 7x the load [10:48:34] about to finish- lag going down [10:52:15] also the failover process may need some refinements, but those are larger words [10:55:23] it's done, all servers back to normal but db1034 [11:09:57] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL check_failover servers up 2 down 1 [11:11:16] (03PS1) 10Mobrovac: RESTBase: Add MobileApps endpoints and back-end config [puppet] - 10https://gerrit.wikimedia.org/r/232003 [11:12:08] (03CR) 10Mobrovac: [C: 04-1] "Need to test it first in deployment-prep." [puppet] - 10https://gerrit.wikimedia.org/r/232003 (owner: 10Mobrovac) [11:13:39] ^is phabricator ok?, that is the proxy for phabricator db [11:14:34] jynus: [12:05:31] Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. [11:14:38] (from another channel) [11:14:51] jynus: w4m [11:18:43] so the issue is, the proxy detects the issue and failovers [11:18:57] but phabricator doesn't use the proxy [11:19:04] anybody working on modules/mediawiki/manifests/hhvm.pp on deployment-puppetmaster in beta ? [11:19:12] _joe_: perhaps? ^^ [11:19:22] <_joe_> mobrovac: me sorry yes [11:19:39] _joe_: kk, lemme know when i can cherry-pick my stuff [11:19:39] <_joe_> mobrovac: freed now [11:19:44] k thnx [11:20:27] oh come on, who put nano for the git editor? [11:20:42] <_joe_> don't look at me! [11:21:44] tsc tsc [11:22:44] <_joe_> I was using vim there :P [11:23:23] nano was installed at some point recently iirc, does it get selected as default as soon as it's installed? :o [11:23:54] Or maybe I'm mixing things with https://gerrit.wikimedia.org/r/228398 [11:24:42] Nemo_bis: nah, that's mw-vagrant [11:27:07] Nemo_bis: that can be changed by setting the EDITOR= env var [11:29:49] !log reloading dbproxy1003 haproxy config- it was a temporal max_connections issue; db1043 should be the canonical server again [11:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:31:36] RECOVERY - haproxy failover on dbproxy1003 is OK check_failover servers up 2 down 0 [11:31:39] (03PS1) 10Jcrespo: Revert "Changes in weight on s7 while the schema change is ongoing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232005 [11:32:10] damn, deployment-restbase0x aren't running master any more [11:32:11] :/ [11:33:22] (03CR) 10Jcrespo: [C: 032] Revert "Changes in weight on s7 while the schema change is ongoing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232005 (owner: 10Jcrespo) [11:42:11] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Reverting change to settings before schema change (no more lag) (duration: 00m 12s) [11:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:14:09] (03CR) 10Mobrovac: "Verified to work in deployment-prep" [puppet] - 10https://gerrit.wikimedia.org/r/232003 (owner: 10Mobrovac) [12:16:36] (03PS2) 10Mobrovac: RESTBase: Add MobileApps endpoints and back-end config [puppet] - 10https://gerrit.wikimedia.org/r/232003 (https://phabricator.wikimedia.org/T102130) [12:27:11] (03PS1) 10KartikMistry: cxserver: Use registry from cxserver repository [puppet] - 10https://gerrit.wikimedia.org/r/232018 (https://phabricator.wikimedia.org/T103856) [12:31:18] (03CR) 10BBlack: [C: 032] Fixed RE for maps referrer [puppet] - 10https://gerrit.wikimedia.org/r/231978 (owner: 10Yurik) [12:33:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 2 below the confidence bounds [12:53:48] (03PS1) 10Alex Monk: Revert dawiki logo back to normal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232021 (https://phabricator.wikimedia.org/T102237) [12:56:19] (03PS2) 10Giuseppe Lavagetto: service: add deployment_script define [puppet] - 10https://gerrit.wikimedia.org/r/231790 [12:56:21] (03PS1) 10Giuseppe Lavagetto: restbase: add deployment script [puppet] - 10https://gerrit.wikimedia.org/r/232024 [12:56:55] <_joe_> mobrovac: still completely untested and needs verification ^^ but it should make our life much less painful [12:56:56] <_joe_> ;) [12:57:46] (03PS3) 10Giuseppe Lavagetto: puppet_compiler: Create the workdir as well [puppet] - 10https://gerrit.wikimedia.org/r/231755 [12:59:41] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet_compiler: Create the workdir as well [puppet] - 10https://gerrit.wikimedia.org/r/231755 (owner: 10Giuseppe Lavagetto) [13:01:05] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Remove the working directory at the end of a successful run [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/231573 (owner: 10Giuseppe Lavagetto) [13:01:54] 6operations, 7Database: db2034 crashed - https://phabricator.wikimedia.org/T109282#1544788 (10jcrespo) 3NEW [13:02:07] (03PS2) 10BBlack: bits-legacy: remove special https://bits redirects for secure wikis [puppet] - 10https://gerrit.wikimedia.org/r/231778 [13:02:09] (03PS2) 10BBlack: bits-legacy: remove beacon/statsv support [puppet] - 10https://gerrit.wikimedia.org/r/231777 [13:02:11] (03PS1) 10BBlack: re-order range / purge to align text+mobile [puppet] - 10https://gerrit.wikimedia.org/r/232027 [13:02:13] (03PS1) 10BBlack: align text+mobile on filter_(headers|noise) in shared code [puppet] - 10https://gerrit.wikimedia.org/r/232028 [13:03:58] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 10Wikimedia-Video: Backport libtheora 1.2.0alpha package to Trusty - https://phabricator.wikimedia.org/T109207#1544801 (10fgiunchedi) @brion looks good, I've tweaked the source a bit and uploaded to carbon (removed non dfsg docs like the d... [13:07:33] 6operations, 5Patch-For-Review: Puppet catalog compiler is broken - https://phabricator.wikimedia.org/T96802#1544813 (10Joe) 5Open>3Resolved [13:07:34] 6operations, 5Patch-For-Review: Investigate the compatibility of our puppet tree with ruby2.1 and create a plan to upgrade - https://phabricator.wikimedia.org/T98129#1544814 (10Joe) [13:11:11] 6operations, 6Labs, 10wikitech.wikimedia.org: Figure out what to do about maintenance scripts on silver/wikitech - https://phabricator.wikimedia.org/T107547#1544831 (10Krenair) [13:12:45] 6operations, 5Patch-For-Review: Investigate the compatibility of our puppet tree with ruby2.1 and create a plan to upgrade - https://phabricator.wikimedia.org/T98129#1544838 (10Joe) The new puppet compiler is running on jessie, so it should help! [13:16:33] (03PS1) 10Ottomata: Add ~/bin to otto's $PATH [puppet] - 10https://gerrit.wikimedia.org/r/232031 [13:18:05] no phabricator updates on IRC, BTW [13:19:25] <_joe_> jynus: what do you mean? [13:19:32] <_joe_> I see wikibugs [13:20:26] not all, apparently [13:20:47] or not in the last seconds [13:21:35] 6operations, 10Wikimedia-Mailing-lists: Rename Advocacy_Advisors@ to publicpolicy@ - https://phabricator.wikimedia.org/T109142#1544883 (10JohnLewis) p:5Triage>3Normal [13:23:36] jynus: which task? The last thing you did before you said that was to a task that shouldn't be reported in here [13:23:52] As AFAIK, the database tag doesn't trigger things here :) [13:24:12] but the operations should [13:24:22] oh, my fault [13:24:26] it does not have one [13:24:28] sorry [13:24:52] JohnFLewis, does phab work for you now continously? [13:24:52] 6operations, 6Phabricator, 7Database: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1544898 (10JohnLewis) [13:25:16] jynus: yeah, no issues at the minute and just added operations to it for you :) [13:25:45] ok ,then, I will take a closer look when I come back [13:26:07] I let the rest of ops know the workaround, just in ase [13:27:22] <_joe_> I think something funny is going on iridium [13:29:14] (03CR) 10Mobrovac: [C: 04-1] "This looks like a promising move. I wonder, however, if tagging would actually skip a normal Puppet run for configs? AFAIK, no, which mean" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/231790 (owner: 10Giuseppe Lavagetto) [13:30:09] 6operations, 6Phabricator, 7Database: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1544912 (10Joe) syslog is pretty dense in messages like ``` Aug 17 13:15:53 iridium kernel: [34147494.926567] do_IRQ: 12.188 No irq handler for vector (irq -1) ``` which seem relate... [13:37:48] 6operations, 10Traffic, 6Zero: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1544965 (10BBlack) 3NEW a:3BBlack [13:46:41] 6operations, 10Traffic: Refactor varnish puppet config - https://phabricator.wikimedia.org/T96847#1545116 (10BBlack) This is highly-interrelated with T109286 as well, but I'm not sure that either really blocks the other. They'll probably happen in concert slowly over time. [13:47:19] 6operations, 10Traffic, 3Discovery-Maps-Sprint: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#1545136 (10BBlack) For (3) above, see also T109286 [13:50:40] (03PS3) 10Giuseppe Lavagetto: RESTBase: Add MobileApps endpoints and back-end config [puppet] - 10https://gerrit.wikimedia.org/r/232003 (https://phabricator.wikimedia.org/T102130) (owner: 10Mobrovac) [13:52:00] (03CR) 10Giuseppe Lavagetto: [C: 032] RESTBase: Add MobileApps endpoints and back-end config [puppet] - 10https://gerrit.wikimedia.org/r/232003 (https://phabricator.wikimedia.org/T102130) (owner: 10Mobrovac) [13:52:08] 6operations, 6Phabricator, 7Database: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1545188 (10Aklapper) > According to a user, it only affects one http request and it works again after reloading. That was me trying to access a pretty small workboard in Phab. > It... [13:55:47] PROBLEM - puppet last run on cerium is CRITICAL Puppet has 1 failures [13:56:08] PROBLEM - puppet last run on xenon is CRITICAL Puppet has 1 failures [13:56:16] PROBLEM - puppet last run on praseodymium is CRITICAL Puppet has 1 failures [13:58:13] !log restbase deploying ed17952 on staging [13:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:00:17] RECOVERY - puppet last run on praseodymium is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [14:03:46] RECOVERY - puppet last run on cerium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:04:07] RECOVERY - puppet last run on xenon is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [14:07:47] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [14:12:18] !log restbase deployed ed17952 on restbase1001 [14:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:18:06] (03PS2) 10Ottomata: Add ~/bin to otto's $PATH [puppet] - 10https://gerrit.wikimedia.org/r/232031 [14:18:11] (03CR) 10Ottomata: [C: 032 V: 032] Add ~/bin to otto's $PATH [puppet] - 10https://gerrit.wikimedia.org/r/232031 (owner: 10Ottomata) [14:30:43] !log restbase updated production cluster to ed17952 [14:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:43:50] 6operations, 6Mobile-Apps, 6Services, 3Mobile-Content-Service, 5Patch-For-Review: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1545427 (10mobrovac) The MobileApps service is now live and kicking in production! Play with it [here](https://en.wikipedia.or... [14:46:23] 6operations, 6Phabricator, 7Database: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1545446 (10chasemp) @jcrespo, does fairly regularly mean once a day or more often? This is one of the particulars of phab, it uses a lot of db connections. The current max_connectio... [14:49:41] 6operations, 6Phabricator, 7Database: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1545484 (10jcrespo) @chasemp It is happening right now every time I do not kill idle connections (which I am activelly doing now). Coordinate with me to test stopping doing that. [14:49:48] (03PS2) 10Filippo Giunchedi: cassandra: WIP support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) [14:50:00] (03CR) 10Filippo Giunchedi: cassandra: WIP support for multiple instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [14:54:35] (03CR) 10Southparkfan: os_version: Add Ubuntu Vivid & Wily, Debian Stretch & Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/231972 (owner: 10Ori.livneh) [14:55:12] 6operations, 6Phabricator, 7Database: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1545507 (10jcrespo) To clarify- Those mysql spike are creatin downtime on phabricator, with user complains, not mysql problems (we do not care about what s3-master mysql suffers). Th... [14:56:03] 6operations, 6Mobile-Apps, 6Services, 3Mobile-Content-Service, 5Patch-For-Review: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1545510 (10Dbrant) *faint* [14:56:34] (03CR) 10Giuseppe Lavagetto: "ran into an issue while testing, see comment in the code." (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/230931 (owner: 10Ori.livneh) [14:56:51] (03PS2) 10BBlack: re-order range / purge to align text+mobile [puppet] - 10https://gerrit.wikimedia.org/r/232027 (https://phabricator.wikimedia.org/T109286) [14:56:53] (03PS2) 10BBlack: align text+mobile on filter_(headers|noise) in shared code [puppet] - 10https://gerrit.wikimedia.org/r/232028 (https://phabricator.wikimedia.org/T109286) [14:56:55] (03PS3) 10BBlack: bits-legacy: remove special https://bits redirects for secure wikis [puppet] - 10https://gerrit.wikimedia.org/r/231778 (https://phabricator.wikimedia.org/T95448) [14:56:57] (03PS3) 10BBlack: bits-legacy: remove beacon/statsv support [puppet] - 10https://gerrit.wikimedia.org/r/231777 (https://phabricator.wikimedia.org/T95448) [14:58:18] (03CR) 10BBlack: [C: 04-1] "We need to come to a conscious decision about traffic sec implications and whether they block this and similar turn-ups or not first..." [dns] - 10https://gerrit.wikimedia.org/r/231772 (owner: 10Faidon Liambotis) [14:58:50] (03PS7) 10Thcipriani: Add service deploy via scap [tools/scap] - 10https://gerrit.wikimedia.org/r/224374 [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150817T1500). [15:00:04] kart_: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:17] Yes. jouncebot Sir [15:00:22] 6operations, 6Mobile-Apps, 6Services, 3Mobile-Content-Service: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1545535 (10mobrovac) [15:00:24] hey [15:00:31] I have a couple of things I forgot to schedule [15:00:32] Krenair: SWAT'ng? [15:00:41] yeah [15:01:57] kart_, newarticle is both true and false on swwiki? [15:02:21] (03CR) 10Alex Monk: [C: 04-1] Enable article-recommender-1 campaign in ca, en, es, fa, fr, it, sw (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231984 (https://phabricator.wikimedia.org/T109245) (owner: 10KartikMistry) [15:02:24] Krenair: that's bad. [15:02:42] (03CR) 10Alex Monk: [C: 032] Revert dawiki logo back to normal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232021 (https://phabricator.wikimedia.org/T102237) (owner: 10Alex Monk) [15:02:48] (03Merged) 10jenkins-bot: Revert dawiki logo back to normal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232021 (https://phabricator.wikimedia.org/T102237) (owner: 10Alex Monk) [15:03:32] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/232021/ (duration: 00m 12s) [15:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:04:43] (03PS2) 10KartikMistry: Enable article-recommender-1 campaign in ca, en, es, fa, fr, it, sw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231984 (https://phabricator.wikimedia.org/T109245) [15:05:02] Krenair: fixed. [15:05:04] ok [15:05:28] I have a feeling there might be a better way to do this config, but let's think about that later [15:05:31] !log rebooting labvirt1004 [15:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:07:13] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/231984/ (duration: 00m 13s) [15:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:07:43] kart_, ^ please test [15:08:13] Krenair: Thanks. [15:08:26] PROBLEM - Host labvirt1004 is DOWN: PING CRITICAL - Packet loss = 100% [15:12:37] RECOVERY - Host labvirt1004 is UPING OK - Packet loss = 0%, RTA = 2.58 ms [15:13:20] ori, for https://gerrit.wikimedia.org/r/#/c/231982/1/multiversion/MWMultiVersion.php do you think we should just replace it with a "Usage: mwscript script.php --wiki=dbname" line? [15:19:35] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1545623 (10EWilfong_WMF) As the domain for the Domain Key record has changed, we need to use a new DNS value. Please update to the following: "k=rsa;... [15:23:30] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1545641 (10chasemp) Ok in looking at the traffic I don't see any load from our overnight jobs that run. It was noted by @mmodell that we had a lot of overnight l... [15:26:30] Krenair: yes [15:27:41] Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. [15:27:59] oup, saw the bug now [15:29:58] someone please +2 https://gerrit.wikimedia.org/r/#/c/232049/1 [15:30:17] phabricator is currently in bad shape [15:31:30] so part of the issue is that googlebot (or something claiming to be googlebot) is crawling rather quickly and seems to ignore the crawldelay robots.txt setting [15:31:36] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL check_failover servers up 2 down 1 [15:32:07] we could look at temporarily blocking it in the misc varnish cluster if it's easy to identify [15:32:30] twentyafterfour: jaime can push it through I believe [15:33:39] twentyafterfour: is it still to /milimetric/ ? [15:34:47] RECOVERY - Disk space on labstore1002 is OK: DISK OK [15:34:52] !log krenair@tin Synchronized php-1.26wmf18/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.Target.js: https://gerrit.wikimedia.org/r/#/c/232048/ (duration: 00m 11s) [15:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:15] edsanders, James_F ^ [15:36:52] so, phab HTML for pastes all has these kinds of links in it: [15:36:54] [15:37:05] to relink to the same document with syntax highlighting, per-line [15:37:10] that's a nice increase in crawl traffic too :P [15:37:30] that's most of the mass hits I'm seeing right now [15:38:17] I think it is legit crawling, as I see both MS and Google hitting them. still looking though [15:39:39] yeah I see big pages with lots of subtasks getting hit hard but it doesn't seem....illegit [15:40:40] !log krenair@tin Synchronized multiversion/MWMultiVersion.php: https://gerrit.wikimedia.org/r/#/c/231982/2 - clear up --wiki usage to mwscript (duration: 00m 12s) [15:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:40:54] where is the robots.txt for phab at? [15:41:49] (configured inside phab maybe?) [15:41:56] I don't see it in the webroot for it [15:43:16] anyways, I can just blanket-block all of the google/bing-bot hits with a 403 for now, until we sort out something better [15:43:19] yes? no? [15:43:35] bblack: I think [15:43:36] src/applications/system/controller/PhabricatorRobotsController.php [15:44:31] !log krenair@tin Synchronized wmf-config/extension-list: https://gerrit.wikimedia.org/r/#/c/232051/ - remove WikiGrok from extension-list, extension is no longer deployed (duration: 00m 11s) [15:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:44:42] bblack: is it indexing disallowed paths? I think we are ok atm [15:45:31] there is no static robots.txt [15:45:47] it's in PhabricatorRobotsController [15:46:01] chasemp: I think the question is how often bots even update robots.txt [15:46:32] but right now, I don't see them indexing disallowed things. still lots of spam on allowed things that seems silly though, like those per-line links in every paste [15:46:35] ...yes, that I do not know [15:46:52] also, these: https://phab.wmfusercontent.org/file/data/kquiqbqnhrowz7zjrvbi/PHID-FILE-op2fqoiqaidrtkwt4btb/kyffmiyl7oybdedh/Proposed_patch_v1 [15:47:03] which seem to be derivaties of diffusion patches? [15:47:06] the bot traffic is pretty much legit, and iridium isn't over loaded, I think increasing the db connection limit is the right solution [15:47:23] so, that should be it [15:47:31] sorry for the confusion of themplate [15:47:37] give it another look [15:49:01] bblack: that 'proposed_patch_v1' file was most likely just a maniphest attachment that someone uploaded [15:49:09] bblack: I think that's someone pasting a legit text file, and that diffusion things are not handled taht way [15:49:11] yes that [15:49:47] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 21.43% of data above the critical threshold [500.0] [15:49:56] bblack: for robots.txt maybe... [15:49:57] ->setCacheDurationInSeconds(phutil_units('2 hours in seconds')); [15:50:01] so 2 hours? [15:50:01] 6operations, 7Database: Replicate flowdb from X1 to stat1003 - https://phabricator.wikimedia.org/T75047#1545788 (10Krenair) [15:50:07] assuming they honor it [15:50:48] we could also increase the crawl delay maybe [15:55:23] 6operations, 10Traffic, 7HTTPS: Inbound TLS for tier-1 varnish backend caches - https://phabricator.wikimedia.org/T109321#1545817 (10BBlack) 3NEW [15:55:38] what happened to operations/puppet/cassandra? deleted from gerrit? [15:55:52] because phabricator is still trying to poll that repo [15:56:44] <_joe_> grrrit-wm is not here [15:57:09] yeah, tools was having issues [15:57:33] looks better now though, restarting it [15:58:52] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1545849 (10chasemp) Talking with @mmodell about disallowing /file/ for indexing as well. It should be useless (more or less) but i think bots are still trying and... [16:01:28] 6operations, 7Database: Replicate flowdb from X1 to stat1003 - https://phabricator.wikimedia.org/T75047#1545862 (10jcrespo) This is not a trivial task. I would like to put some order on all the production and analytics db boxes so that we can provide you a **quality service**. There is some incoming hardware... [16:01:46] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [16:01:49] 6operations, 6Labs, 3Labs-Sprint-107, 3Labs-Sprint-108, and 3 others: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1545865 (10Andrew) [16:02:14] (03CR) 10Jcrespo: [C: 032] Increasing max_connections variable on phabricator dbs [puppet] - 10https://gerrit.wikimedia.org/r/232056 (https://phabricator.wikimedia.org/T109279) (owner: 10Jcrespo) [16:02:28] (03PS2) 10Jcrespo: Revert "Increasing max_connections variables on all misc servers" [puppet] - 10https://gerrit.wikimedia.org/r/232053 [16:02:35] (03CR) 10Jcrespo: [C: 032] Revert "Increasing max_connections variables on all misc servers" [puppet] - 10https://gerrit.wikimedia.org/r/232053 (owner: 10Jcrespo) [16:02:55] twentyafterfour: it has been merged into operations/puppet [16:05:15] now the patch is working [16:10:11] 6operations, 10Traffic, 7HTTPS: Outbound HTTPS for varnish backend instances - https://phabricator.wikimedia.org/T109325#1545933 (10BBlack) 3NEW [16:10:42] and that should fix the proxy [16:10:47] PROBLEM - Host cr1-eqord is DOWN: CRITICAL - Network Unreachable (208.80.154.198) [16:10:48] RECOVERY - haproxy failover on dbproxy1003 is OK check_failover servers up 2 down 0 [16:13:02] 6operations, 10Traffic, 7HTTPS: Outbound HTTPS for varnish backend instances - https://phabricator.wikimedia.org/T109325#1545974 (10BBlack) [16:20:57] https://gerrit.wikimedia.org/r/#/c/232061/ [16:21:41] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1546025 (10fgiunchedi) >>! In T95253#1539380, @mobrovac wrote: >>>! In T95253#1539350, @fgiunchedi wrote: >> I'm not yet sure how we should approa... [16:24:16] PROBLEM - Host cr1-eqdfw is DOWN: PING CRITICAL - Packet loss = 100% [16:24:28] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1546040 (10Joe) @fgiunchedi hiera autolookup obviously doesn't work for defines, only for classes - so you'd need to define explicitly all the ins... [16:25:47] PROBLEM - Router interfaces on cr1-codfw is CRITICAL host 208.80.153.192, interfaces up: 114, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/0: down - Core: cr1-eqdfw:xe-0/0/0 CyrusOne {#?} [10Gbps DWDM]BR [16:25:47] PROBLEM - Router interfaces on cr2-codfw is CRITICAL host 208.80.153.193, interfaces up: 114, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/0: down - Core: cr1-eqdfw:xe-1/0/0 CyrusOne {#?} [10Gbps DWDM]BR [16:26:55] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107, and 3 others: Try to fail over to labnet1002 - https://phabricator.wikimedia.org/T109329#1546058 (10chasemp) 3NEW a:3Andrew [16:27:59] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107, and 4 others: Try to fail over to labnet1002 - https://phabricator.wikimedia.org/T109329#1546071 (10Andrew) [16:35:15] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107, and 4 others: Try to fail over to labnet1002 - https://phabricator.wikimedia.org/T109329#1546129 (10Andrew) [16:51:44] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107, and 4 others: Try to fail over to labnet1002 - https://phabricator.wikimedia.org/T109329#1546322 (10Andrew) [17:20:52] 6operations, 10ops-eqiad: Prepare shipping label for mx80 to eqord - https://phabricator.wikimedia.org/T109338#1546506 (10Cmjohnson) 3NEW a:3Cmjohnson [17:33:41] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1546568 (10BBlack) As a counterpoint to that final paragraph about unifying the mobile+desktop IPs - it's not a checklist item because I'm not really sure about that part on many... [17:37:09] (03PS1) 10John F. Lewis: mailman: move exim outbound ip config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/232080 [17:38:22] 6operations, 7Database: Replicate flowdb from X1 to stat1003 - https://phabricator.wikimedia.org/T75047#1546581 (10Milimetric) @jcrespo, that's ok with me, and this task has waited a while already, if there's a good way forward, it seems worthwhile. Let us know if you'd like to plan the reorganization together. [17:45:01] (03CR) 10John F. Lewis: "Ran through jenkins to check if anything changed; it failed due to the role using the secret() function for arbcom-l archives :(" [puppet] - 10https://gerrit.wikimedia.org/r/232080 (owner: 10John F. Lewis) [17:48:13] 6operations, 7Database: Replicate flowdb from X1 to stat1003 - https://phabricator.wikimedia.org/T75047#1546590 (10jcrespo) @Millimetric It would be great to meet at some point with several of you. I have also some ideas to improve the service, but I need some feedback if they are worth investing some time. [17:57:01] 6operations, 10ops-eqiad, 7Database, 5Patch-For-Review: Remove db1002-db1007 from production - https://phabricator.wikimedia.org/T105768#1546635 (10Cmjohnson) 5Open>3Resolved This has been completed [17:57:41] 6operations, 10ops-eqiad: apply wdqs100x labels to wmf3543/44 - https://phabricator.wikimedia.org/T108367#1546638 (10Cmjohnson) 5Open>3Resolved Completed [18:12:08] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL - Socket timeout after 10 seconds [18:17:56] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.003 second response time [18:24:05] Why can I connect to db2034 (random mostly unused s1 slave) without the .codfw.wmnet from tin but not terbium? [18:24:48] the base::resolving::domain_search in hieradata/hosts/tin.yaml and terbium.yaml looks suspect [18:28:39] (03PS1) 10Alex Monk: Copy rest of tin's domain_search to terbium [puppet] - 10https://gerrit.wikimedia.org/r/232087 [18:41:19] (03PS4) 10Ottomata: Make jq available on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/223974 (owner: 10EBernhardson) [18:44:23] (03CR) 10Ottomata: [C: 032] Make jq available on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/223974 (owner: 10EBernhardson) [18:48:56] (03PS3) 10Ottomata: Rsync cirrus user testing logs to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/230113 (owner: 10EBernhardson) [18:55:59] (03CR) 10Ottomata: [C: 032] Rsync cirrus user testing logs to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/230113 (owner: 10EBernhardson) [18:56:09] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1546867 (10GWicke) It would probably be sufficient to have one seed instance per physical node only, and let the client discover the other instanc... [18:56:22] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1546869 (10Yurik) Please check with @dfoy - there are some promises we made to our Zero partners with regards to the IP allocations. [18:57:54] what is Product duty? [18:58:44] mutante: basically the guy you go to if you want to ask a question about a product was what I got when James_F made it like a year ago [18:59:09] JohnFLewis: alright,thx [19:01:58] (03PS1) 10Ottomata: Don't use partman for analytics kafka jessie reinstall, do this part manually [puppet] - 10https://gerrit.wikimedia.org/r/232097 (https://phabricator.wikimedia.org/T106581) [19:02:46] (03PS1) 10Ottomata: Rename analytics1012 to kafka1012 [dns] - 10https://gerrit.wikimedia.org/r/232098 (https://phabricator.wikimedia.org/T106581) [19:03:47] (03PS2) 10Ottomata: Don't use partman for analytics kafka jessie reinstall, do this part manually [puppet] - 10https://gerrit.wikimedia.org/r/232097 (https://phabricator.wikimedia.org/T106581) [19:03:59] (03CR) 10Ottomata: [C: 032 V: 032] Don't use partman for analytics kafka jessie reinstall, do this part manually [puppet] - 10https://gerrit.wikimedia.org/r/232097 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [19:11:05] (03PS2) 10Dzahn: Use meta.wikimedia.org/api/rest_v1 instead of parsoid-lb in the sysadmin table script [puppet] - 10https://gerrit.wikimedia.org/r/231890 (owner: 10Alex Monk) [19:11:30] Yeah, I guess it's either me or Deskana|Away or JonK as 'lead' whatnot. [19:11:51] (03PS1) 10Ottomata: Rename analytics1012 to kafka1012, site.pp puppetization coming in separate commit [puppet] - 10https://gerrit.wikimedia.org/r/232136 (https://phabricator.wikimedia.org/T106581) [19:12:04] (03PS2) 10Ottomata: Rename analytics1012 to kafka1012, site.pp puppetization coming in separate commit [puppet] - 10https://gerrit.wikimedia.org/r/232136 (https://phabricator.wikimedia.org/T106581) [19:12:10] 10Ops-Access-Requests, 6operations: Grant ebernhardson access to stat1002 to query hive - https://phabricator.wikimedia.org/T109356#1546932 (10EBernhardson) 3NEW [19:12:43] 10Ops-Access-Requests, 6operations: Grant SMalyshev access to stat1002 to query hive - https://phabricator.wikimedia.org/T109357#1546942 (10EBernhardson) 3NEW [19:12:58] 10Ops-Access-Requests, 6operations: Grant ebernhardson access to stat1002 to query hive - https://phabricator.wikimedia.org/T109356#1546949 (10EBernhardson) [19:13:52] (03CR) 10Ottomata: [C: 032] Rename analytics1012 to kafka1012, site.pp puppetization coming in separate commit [puppet] - 10https://gerrit.wikimedia.org/r/232136 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [19:14:04] maybe stat1002 access needs to become default or something, we add like 2 or 3 people per week it feels [19:14:51] default for all users? [19:15:37] (03PS1) 10Ottomata: Don't use paren in regex in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/232175 [19:15:38] haha [19:16:15] (03PS2) 10Ottomata: Don't use paren in regex in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/232175 [19:16:35] (03CR) 10Ottomata: [C: 032 V: 032] Don't use paren in regex in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/232175 (owner: 10Ottomata) [19:16:47] (03PS1) 10Jcrespo: Redact new column on centralauth gu_cas_token [software/redactatron] - 10https://gerrit.wikimedia.org/r/232176 [19:18:02] Krenair: hard to tell but somehow many people seem to need it for ...stuf [19:18:51] Is that the host which provides all users with access to eventlogging data? [19:19:03] Yes, among other things [19:19:12] You get read-only access to all the databases and all the EL data [19:19:24] !log stopping kafka on analytics1012, preparing to reinstall with Jessie and rename to kafka1012 [19:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:19:34] "statistics-privatedata-users group" [19:21:40] 10Ops-Access-Requests, 6operations: Grant SMalyshev access to stat1002 to query hive - https://phabricator.wikimedia.org/T109357#1546988 (10Smalyshev) [19:23:07] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1546989 (10mmodell) Things seem to be more stable now... [19:23:42] (03CR) 10Ottomata: [C: 032] Rename analytics1012 to kafka1012 [dns] - 10https://gerrit.wikimedia.org/r/232098 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [19:23:53] mutante, so that sounds like a bad idea to grant to all users with some sort of production shell [19:25:08] PROBLEM - puppet last run on stat1002 is CRITICAL puppet fail [19:26:49] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 4 below the confidence bounds [19:41:12] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1547094 (10jcrespo) We had a spike of nearly or just over 500 simultaneous connections at 6:14. This time it didn't fail because of the 1500 limit, but we should... [19:41:26] (03CR) 10GWicke: [C: 031] "Would this already work with the current setup of sysvinit scripts executed via systemd?" [puppet] - 10https://gerrit.wikimedia.org/r/230066 (https://phabricator.wikimedia.org/T108306) (owner: 10Filippo Giunchedi) [19:43:34] (03CR) 10GWicke: [C: 031] cassandra: check for heap dumps [puppet] - 10https://gerrit.wikimedia.org/r/227956 (https://phabricator.wikimedia.org/T106346) (owner: 10Filippo Giunchedi) [19:43:41] (03PS3) 10Dzahn: Use meta.wikimedia.org/api/rest_v1 instead of parsoid-lb in the sysadmin table script [puppet] - 10https://gerrit.wikimedia.org/r/231890 (owner: 10Alex Monk) [19:43:53] (03CR) 10GWicke: "@Filippo, can this be merged?" [puppet] - 10https://gerrit.wikimedia.org/r/227956 (https://phabricator.wikimedia.org/T106346) (owner: 10Filippo Giunchedi) [19:45:35] (03CR) 10Dzahn: [C: 032] Use meta.wikimedia.org/api/rest_v1 instead of parsoid-lb in the sysadmin table script [puppet] - 10https://gerrit.wikimedia.org/r/231890 (owner: 10Alex Monk) [19:48:28] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1547138 (10chasemp) 5Open>3Resolved a:3chasemp The bot traffic has dropped substantially so we are now sitting more comfortably and also in theory can susta... [19:54:19] yargh RobH, i've think I asked you this another week and I have forgotten the answer. I am changing a renaming a host, same IP. [19:54:55] i had thought that hostnames were assigned by dhcp via the linux-host entries stuff, so that hte node will get the proper hostname on install [19:55:05] but, is it also related to dns resolution? [19:55:09] you also have to do a bunch of other things [19:55:19] you are reinstalling or in place change? [19:55:23] reinstalling [19:55:24] the former is easier if you have no data. [19:55:36] then just pretty much 'decommission' that hostname in terms of removing from everything [19:55:40] and putting the new hostname in dns [19:55:43] this includes mgmt dns [19:55:46] i think i did that, double checking [19:55:53] also you have to create an onsite task for the physical label to be changed [19:56:01] oh! ok. [19:56:03] that is the part folks forget [19:56:10] and it makes it impossible to keep a spares list [19:56:12] ;] [19:56:22] (also impossible to keep a real list period ;) [19:56:44] you'll wanna do lifecycle in terms of remving the old hostname in salt keys, puppetstoreddb, etc... but the onsite task is the one folks forget =] [19:56:56] yeah, did all that [19:56:56] hm. [19:57:05] i did all the steps, you can see them here: [19:57:36] your earlier question was does dhcp get it from dns, answer yep [19:57:40] https://etherpad.wikimedia.org/p/kafka_0.8.2.1_migration2 [19:57:43] step 3 [19:57:44] hm. [19:57:46] you have to update both the install-server module stuff and then the dns as well [19:57:47] so [19:58:14] did this [19:58:14] https://github.com/wikimedia/operations-puppet/blob/ad7c8803b2989294e815d687f457533d697e5a36/modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200#L2311 [19:58:24] that used to be analytics1012 [19:58:32] it picked up jessie installer just fine [19:58:38] but after install the hostname is still analytics1012 [19:58:42] i just added the part where you change the hostname [19:58:48] hah thanks [19:58:48] :) [19:59:08] Did you update dns to have kafka1012.eqiad.wmnet? [19:59:13] yes [19:59:21] did authdns-update [19:59:32] kafka1012.eqiad.wmnet. 3600 IN A 10.64.5.12 [19:59:42] reverse still points at analytics1012 though [19:59:44] cached i assume [19:59:50] 12.5.64.10.in-addr.arpa. 1207 IN PTR analytics1012.eqiad.wmnet. [20:00:05] gwicke cscott arlolra subbu: Respected human, time to deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150817T2000). Please do the needful. [20:00:10] dns change here https://gerrit.wikimedia.org/r/#/c/232098/ [20:00:39] ottomata: so you answered your own question, the reverse is what sets hostname in the installer =] [20:00:48] you have to kill the old dns names, which you can do on the recursors [20:00:49] ah, so i just have to wait 1h? [20:00:53] oh. [20:00:56] (i assume you wanted a full answer not a 'letmme fit that ;) [20:01:02] so lemme see which one is recursor... [20:01:33] chromium [20:01:34] rec_control wipe-cache hostname [20:01:34] [20:01:34] right! [20:01:35] ah [20:01:36] yep [20:01:38] PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL 10.34% of data above the critical threshold [100000000.0] [20:01:40] i am remembering... [20:01:58] chromium|hydrogen [20:02:01] do that on those two [20:02:03] k [20:02:09] and try to do the reverse lookup again [20:02:17] (via a eqiad host preferably) [20:02:27] unless you wanna run it on all the recursors, but you shouldnt have to [20:02:29] it's missing the "WMF1234".mgmt entry. that would always stay the same for a given hardware even when renamed [20:02:32] the alternative was yes, wait 1h =] [20:02:39] mutante: good catch [20:02:41] (just a comment on the side, not causing an issue) [20:02:43] it should have an asset tag entry [20:02:52] make sure its not elsewhere in the file and if not, pelase add! [20:03:05] ottomata: all systems should have both a hostname.mgmt and an assettag.mgmt dns entry for mgmt [20:03:09] oh hm. [20:03:12] ok. [20:03:13] hm. [20:03:15] so if its missing, fix pls =] [20:03:23] the file is full of examples [20:03:30] robh, cool, will do as a cleanup afte rthis [20:03:32] 12.5.64.10.in-addr.arpa. 983 IN PTR analytics1012.eqiad.wmnet. [20:03:35] still resolves [20:04:37] hrmm [20:04:45] ottomata: from where? [20:04:47] iron? [20:04:54] recall each site hits its own recursors [20:04:56] tried a bunch, palladium, stat1002 [20:05:21] the next thing i would try is doing the rec control on all the recurors [20:05:23] did this on both chromium and hydrogen [20:05:24] sudo rec_control wipe-cache analytics1012.eqiad.wmnet [20:05:25] deploying new version of parsoid [20:05:30] or make sure your dig hits one of those [20:05:34] and see what it says [20:06:27] hmmm, from iron, dig @hydrogen.wikimedia.org -x 10.64.5.12 [20:06:28] still an12 [20:06:35] running rec_control wipe again [20:06:49] nope, still same [20:07:22] hrmm, finishing lunch (i ran down street for bahn mi) [20:07:31] lemme finish scrafing this down and then ill chase it down =] [20:07:36] maerlant.wikimedia.org ? [20:07:44] ah these are not eqad [20:07:45] yep, on all recursors [20:07:51] ok trying that... [20:07:51] so everything that is in site.pp with include role::dnsrecursor [20:07:58] nah, will have them in esams and then codfw as well [20:08:02] but im not sure that'll be the answer [20:08:06] i need to dig into it [20:10:00] in puppet i need to place a one line file in /etc/elasticsearch/scripts/mwgrep.groovy on each of the elasticsearch servers. Is there any way to place this file from the scap role where mwgrep is provisioned? I can do it in the elasticsearch module but seems wrong [20:10:14] s/scap role/scap module/ [20:10:22] yeah, hm, just did it on 6 nodes: chromium, hydroden, maerlant, nescio, acamar, achernar [20:10:28] same [20:10:46] ebernhardson: why wrong in the elasticsearch module when it's in /etc/elasticsearch ? [20:10:47] odd [20:10:50] ebernhardson: probably the elasticsearch node [20:10:52] sorry [20:10:55] elasticsearch role [20:10:55] yea gimme a bit and ill dig in =] [20:11:10] k thanks robh [20:11:29] this is sorta urgent, as we are not wiping kafka data, and the longer the broker is down, the longer its going to take to catch back up [20:11:40] mutante: well, its not for running elasticsearch. its something one of the users of elasticsearch wants to put in place. the script most logically is owned by scap [20:11:48] its not like super emergency urgent [20:11:50] system is fine [20:11:58] but i'll put it in the elasticsearch side, easy enough [20:12:09] role, ja [20:13:58] ebernhardson: hmm. yea. maybe best to upload a patch and then we'll look at it on gerrit [20:14:50] robh, it resolves correctly on ns0.wikimedia.org [20:15:00] dig @ns0.wikimedia.org -x 10.64.5.12 [20:15:04] 12.5.64.10.in-addr.arpa. 3600 IN PTR kafka1012.eqiad.wmnet. [20:16:12] !log deployed parsoid version 4b656b72 [20:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:17:52] 6operations, 10RESTBase, 10RESTBase-Cassandra: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1547224 (10Eevans) [20:18:56] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107, and 4 others: Setup/Install/Deploy labnet1002 - https://phabricator.wikimedia.org/T99701#1547237 (10Andrew) [20:19:24] ottomata: ok, food done you have my undivided attention [20:19:30] lets see.... [20:19:58] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [20:20:59] :) [20:21:39] do you have your gerrit patchset handy for your dns changes? [20:21:44] easier than me having to git log [20:23:06] 6operations, 10RESTBase, 10RESTBase-Cassandra: Cassandra internode TLS encryption - https://phabricator.wikimedia.org/T108953#1547261 (10Eevans) [20:23:47] yes [20:23:56] robh https://gerrit.wikimedia.org/r/#/c/232098/1 [20:24:14] haha, robh, i think it is resovlving now [20:24:23] the 1h window elapsed =] [20:24:26] :) [20:24:28] probably so [20:24:30] ok, reinstalling [20:24:32] heh [20:24:43] well.. you welcome for helping you not at all =] sorry [20:24:44] heh [20:24:44] can I just reboot? [20:24:46] or do I have to reinstall? [20:24:54] hostname is set at install [20:24:58] easier to reinstall [20:25:03] than try to fix midisntall if there is no data [20:25:13] cuz all your puppet initial call ins are wrong and such [20:25:29] k [20:25:44] (03PS1) 10EBernhardson: Fix mwgrep to work without dynamic scripting [puppet] - 10https://gerrit.wikimedia.org/r/232193 (https://phabricator.wikimedia.org/T108151) [20:26:30] thanks for helping time pass robh! :p :) [20:26:35] heheh [20:26:43] quite welcome [20:26:46] anytime! [20:26:54] (03CR) 10jenkins-bot: [V: 04-1] Fix mwgrep to work without dynamic scripting [puppet] - 10https://gerrit.wikimedia.org/r/232193 (https://phabricator.wikimedia.org/T108151) (owner: 10EBernhardson) [20:27:43] (03PS2) 10EBernhardson: Fix mwgrep to work without dynamic scripting [puppet] - 10https://gerrit.wikimedia.org/r/232193 (https://phabricator.wikimedia.org/T108151) [20:28:30] (03CR) 10jenkins-bot: [V: 04-1] Fix mwgrep to work without dynamic scripting [puppet] - 10https://gerrit.wikimedia.org/r/232193 (https://phabricator.wikimedia.org/T108151) (owner: 10EBernhardson) [20:29:20] (03PS3) 10EBernhardson: Fix mwgrep to work without dynamic scripting [puppet] - 10https://gerrit.wikimedia.org/r/232193 (https://phabricator.wikimedia.org/T108151) [20:31:22] anyone have any enwiki dumps from 2008? [20:35:38] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [20:38:49] is there a way to quick and secure way to get files from my home dir on stat1002 to my home dir on stat1003 that doesn't involve scp-ing to my machine and then scp-ing from my machine [20:41:31] Not really... If secure is about data integrity you could use netcat or quickly fire up and ftp/http server and use check sums, if you feel like that [20:42:19] bearloga: yes, but no. [20:42:24] not homedir to homedir [20:42:31] 6operations: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1547319 (10RobH) @Alantz: If you guys want to maintain a list of every single ops team member, that is the list that needs access to that data. We should be notified (we shouldn't have to ask) to know w... [20:42:35] but there is an rsync module set up that works between stat host [20:43:08] bearloga: https://wikitech.wikimedia.org/wiki/Analytics/FAQ [20:44:52] ottomata: thanks [20:45:03] links on https://dumps.wikimedia.org/archive/enwiki/20100312/ are broken, missing /archive/ [20:45:26] also noticed https://dumps.wikimedia.org/backup-index.html links to static.wikipedia.org which does not exist [20:47:09] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [20:50:40] 6operations, 7HTTPS: download.wikipedia.org is using an invalid certificate - https://phabricator.wikimedia.org/T107575#1547337 (10BBlack) That seems like a reasonable approach to me. [20:50:40] hi - any opsen around who can correct file permissions for me on fermium.eqiad.wmnet please? [20:51:04] JohnFLewis: saw pm, what file? [20:51:37] robh: both work :P www-data:list for /var/lib/mailman/lists/wikiit-l/ (recursive please) [20:52:29] gj JohnFLewis [20:52:46] JohnFLewis: try now [20:53:17] SPF|Cloud: not my fault :P [20:53:27] robh: woo, mailman works again [20:53:39] * SPF|Cloud mumbles something about mailman-roots [20:53:40] !log T109369: Restarted logstash on logstash1003; parsoid gelf events not being recorded since 2015-08-15 [20:53:46] mutante: I've fixed the thing you were having issues with the other day, just needed to look in the right place :) [20:53:53] Anyone know how you can write to /public/dumps in tools? [20:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:54:23] (Without having root on labstore1003) [20:54:38] robh: I'm really interested now - can you tell me what the perms of /var/lib/mailman/lists/wikiit-l/ are on sodium? [20:54:57] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [20:55:20] I don't get how only that one list had incorrect permissions and every other list didn't? I also noticed it has group of 1000. Unsure which that is on sodium [20:55:38] JohnFLewis: so list is group on it and recursive but recursive owner changes [20:55:42] between www-data and list [20:55:49] on some assorted files [20:56:16] robh: right. I feel it works on lists via group. what gid id 1000 on sodium? [20:56:37] spamd:x:1000: [20:56:57] heh [20:57:06] list is 38 [20:58:12] spamassian isn't on fermium so presumably wikiit-l was owned by spamd on fermium for some reasons. weird but it works now! [20:58:41] (03PS1) 10Ottomata: Puppetize kafka1012 as kafka broker in analytics Kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/232202 (https://phabricator.wikimedia.org/T106581) [20:58:57] (03PS2) 10Ottomata: Puppetize kafka1012 as kafka broker in analytics Kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/232202 (https://phabricator.wikimedia.org/T106581) [21:00:04] matt_flaschen: Dear anthropoid, the time has come. Please deploy Collaboration team LiquidThreads->Flow and mediawiki-config (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150817T2100). [21:00:27] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: test importing of mailing list configs and archives on staging VM - https://phabricator.wikimedia.org/T108073#1547351 (10JohnLewis) >>! In T108073#1542087, @Dzahn wrote: > > "wikiit-l" broke everything, the listinfo page, manual ./list_lists and eve... [21:01:53] (03CR) 10Ottomata: [C: 032] Puppetize kafka1012 as kafka broker in analytics Kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/232202 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [21:02:30] 6operations, 10Wikimedia-Mailing-lists: export config and archive data from sodium - https://phabricator.wikimedia.org/T108071#1547356 (10JohnLewis) Is this done in theory or is there something else needed before this can be marked as 'done'? [21:04:49] (03PS1) 10Ottomata: Use kafka1012 as hostname in Kafka cluster config [puppet] - 10https://gerrit.wikimedia.org/r/232203 (https://phabricator.wikimedia.org/T106581) [21:05:29] (03CR) 10Ottomata: [C: 032 V: 032] Use kafka1012 as hostname in Kafka cluster config [puppet] - 10https://gerrit.wikimedia.org/r/232203 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [21:07:50] PROBLEM - puppet last run on kafka1012 is CRITICAL puppet fail [21:09:50] RECOVERY - puppet last run on kafka1012 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:09:59] !log renamed Gadget:Invention, Travel, & Adventure --> Gadget Invention, Travel, & Adventure on enwiki using moveBatch.php to work around a permissions screwup [21:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:10:29] a permissions screwup? [21:10:56] ori, can you uninstall XDebug (or disable it) from terbium? You installed it earlier to help me debug the memory leak, but now it's breaking the conversion because of xdebug's 100-level function thing [21:11:18] MaxSem, this is https://phabricator.wikimedia.org/T109236 [21:12:00] yup, and I just fixed it:P [21:12:21] MaxSem, okay so how is this a perimssions screwup? [21:12:23] permissions* [21:12:54] that there's a NS on the cluster that no user can modify in any way [21:13:03] that's not a screwup [21:13:07] it's deliberate [21:13:52] well, kinda shoulda not put anything into it, then ;) [21:14:04] The page existed before the namespace. [21:15:01] OMG YOUR SOFTWARE CHANGES ARE BREAKING STUFF THE SKY IS FALLING [21:15:59] https://xkcd.com/1172/ [21:17:42] Since ori apparently isn't around, can someone else (with root) disable or uninstall XDebug on terbium? It was only enabled Wed: https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2015-08-12 [21:17:43] that's the most relevant xkcd I've ever seen to all things always [21:18:33] there's also "it's compiling!" and "standards proliferation" [21:18:33] (03PS2) 10BBlack: HTTPS redirects: remove InstantCommons exception [puppet] - 10https://gerrit.wikimedia.org/r/224557 (https://phabricator.wikimedia.org/T102566) [21:19:10] Don't forget Little Bobby Tables (though it should say prepared statements, not sanitize). [21:19:21] PROBLEM - puppet last run on analytics1003 is CRITICAL Puppet has 1 failures [21:19:47] matt_flaschen, say that to DatabaseBase? ;) [21:20:28] MaxSem, there are a lot of things in MediaWiki that the word 'should' does not yet apply to. :) [21:20:29] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 3 below the confidence bounds [21:21:31] PROBLEM - Hadoop NodeManager on analytics1041 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [21:23:28] 6operations, 10Wikimedia-General-or-Unknown, 7Database: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1547385 (10Krenair) Does anyone know when cluster16 was decommissioned? Maybe we could restore it from a dump. Might the revision be... [21:23:30] RECOVERY - Hadoop NodeManager on analytics1041 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [21:25:50] bblack, could you uninstall xdebug on terbium? It was only installed Wed. to try to help with the LQT->Flow conversion, which it's now instead blocking. [21:26:10] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 4 below the confidence bounds [21:26:46] 10Ops-Access-Requests, 6operations: Grant ebernhardson access to stat1002 to query hive - https://phabricator.wikimedia.org/T109356#1547393 (10EBernhardson) This is being requested so i can access cirrussearch logs that will soon be traveling through kafka into hadoop and be queryable via hive. [21:27:55] matt_flaschen: you want someone to uninstall php5-xdebug and..reload apache? [21:28:12] chasemp, don't need Apache to be reloaded, since it's a command-line script. [21:28:30] I can do this based on SAL comment sure [21:28:33] seems nothing else uses it [21:28:41] Thank you. [21:28:46] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1547410 (10BBlack) So, I see new releases a week ago for 1.2[345] containing the InstantCommons fix. Also, it's been about a... [21:30:26] (03PS3) 10Andrew Bogott: Add multimedia packages (e.g. ghostscript for pdfinfo) to silver [puppet] - 10https://gerrit.wikimedia.org/r/231293 (https://phabricator.wikimedia.org/T93041) (owner: 10Alex Monk) [21:31:27] !log remove php5-xdebug from terbium per mattflaschen [21:31:28] done [21:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:33:25] Thanks, chasemp [21:33:33] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 4 below the confidence bounds [21:33:53] PROBLEM - Kafka Broker Server on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [21:34:29] ottomata: yer paging [21:34:47] unless that wasnt intended? [21:36:49] ottomata: I see you are logged into it, you are fixng right? [21:37:08] idle for a couple of minutes [21:37:16] 6operations: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1547464 (10ALantz) @RobH: We could add another two people from Ops to our on and offboarding email notification list. The email notifications currently go to representatives from Administration, Financ... [21:37:32] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 5 below the confidence bounds [21:37:52] RECOVERY - Kafka Broker Server on kafka1012 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [21:38:46] 6operations, 6Performance-Team, 10Traffic: Split stats/metrics by cache cluster - https://phabricator.wikimedia.org/T109378#1547485 (10BBlack) 3NEW [21:43:02] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL 100.00% of data above the critical threshold [5000000.0] [21:44:00] 6operations: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1547521 (10RobH) @Alantz, We're attempting to create a procedure that doesn't have any single point of failures, or bottlenecks. While allowing some of us to view the documents is a good first step, it... [21:44:23] RECOVERY - puppet last run on analytics1003 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [21:45:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 6 below the confidence bounds [21:45:43] chasemp, it's giving an error, "Failed loading /usr/lib/php5/20090626/xdebug.so: /usr/lib/php5/20090626/xdebug.so: cannot open shared object file: No such file or directory". Not sure if it needs --purge, or /etc/php5/cli/conf.d/xdebug.ini needs to be removed manually. [21:45:46] chasemp, doesn't block me though. [21:46:12] that's new kafka node coming up! normal! [21:47:36] matt_flaschen: again? [21:47:40] 6operations: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1547529 (10RobH) Task note: I'll be creating a sub-task off of this for auditing the now shared HR document (shared to some of us) to audit our production access lists/ldap groups/wikitech groups against... [21:48:22] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 3.39% of data above the critical threshold [1000.0] [21:48:31] chasemp, it wasn't the same error before. Before XDebug was installed and doing annoying things, now it's not installed, but it's still being loaded by the INI (PHP just ignores the error other than logging it, though). [21:48:41] oh I'm sorry, I meant can you try again :) [21:50:19] chasemp, yeah, it's fine now. Thank you. [21:50:27] sure thing sorry about that [21:52:18] 6operations, 10Wikimedia-General-or-Unknown, 7Database: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1547542 (10Krenair) a:3ArielGlenn [21:54:34] (03CR) 10Mattflaschen: [C: 032] Convert wmgLiquidThreadsBackfill to wmgLiquidThreadsFrozen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228192 (https://phabricator.wikimedia.org/T107068) (owner: 10Mattflaschen) [21:54:42] (03CR) 10Mattflaschen: [C: 032] Disable Special:NewMessages on wiki with LiquidThreads frozen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229460 (https://phabricator.wikimedia.org/T107898) (owner: 10Sbisson) [21:55:00] (03Merged) 10jenkins-bot: Convert wmgLiquidThreadsBackfill to wmgLiquidThreadsFrozen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228192 (https://phabricator.wikimedia.org/T107068) (owner: 10Mattflaschen) [21:55:05] 6operations: audit hr staff and tracking sheet (2015-08-17 revision) against shell access/ldap wmf group - https://phabricator.wikimedia.org/T109382#1547549 (10RobH) 3NEW a:3RobH [21:55:21] (03Merged) 10jenkins-bot: Disable Special:NewMessages on wiki with LiquidThreads frozen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229460 (https://phabricator.wikimedia.org/T107898) (owner: 10Sbisson) [21:55:42] RECOVERY - Outgoing network saturation on labstore1002 is OK Less than 10.00% above the threshold [75000000.0] [21:56:49] (03PS4) 10EBernhardson: Fix mwgrep to work without dynamic scripting [puppet] - 10https://gerrit.wikimedia.org/r/232193 (https://phabricator.wikimedia.org/T108151) [21:57:25] !log mattflaschen@tin Synchronized wmf-config: LQT->Flow: Make frozen wikis no longer able to create LQT pages (duration: 00m 13s) [21:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:58:08] (03CR) 10Andrew Bogott: [C: 032] Add multimedia packages (e.g. ghostscript for pdfinfo) to silver [puppet] - 10https://gerrit.wikimedia.org/r/231293 (https://phabricator.wikimedia.org/T93041) (owner: 10Alex Monk) [22:00:00] 6operations: audit hr staff and tracking sheet (2015-08-17 revision) against shell access/ldap wmf group - https://phabricator.wikimedia.org/T109382#1547579 (10Pine) Thanks for working on this. I'm going to remove myself from the cc list but I'm glad to know that this is happening. [22:00:39] robh, I'm sure I could come up with many more places to check :p [22:01:00] if they are places for ops, uhh, please mention it =] [22:01:09] i know the wmf ldap group will cover all the oddball web services [22:01:14] wikitech covers labs [22:01:18] for example, don't we have some staff-restricted labs projects or something? [22:01:23] thats wikitech [22:01:27] =] [22:01:33] not really wikitech groups though [22:01:34] but ok [22:01:36] oh? [22:01:53] yea... i see what you mean [22:01:55] OpenStackManager stores project membership separately from wiki groups [22:02:13] What about mailman list passwords? [22:02:41] if its a staff maintianed mailing list [22:02:45] its the issue of hte mailing list admin [22:02:55] the only mailing lists ops manages are operations and the master password. [22:03:03] otherwise its list admin or OIT [22:03:07] (afaik) [22:03:12] okay [22:03:21] I'm not disagreeing thats important mind you, it is! [22:03:26] I guess that's not really something you can handle in general [22:03:51] if it was access given to specific users it'd be simple [22:04:14] but the password system... anyone can get a password, adminship moves to a new person who doesn't know who else has the password [22:04:32] quite easy to lose track of who is a list admin [22:04:40] ah, mailman and password management [22:05:27] Krenair: that's why people should put their email down to be subscribed to -owner; if not - their fault :) [22:05:40] heh [22:06:20] andrewbogott, did puppet run on silver? [22:06:37] Krenair: yes [22:07:01] Krenair: did something change? For the better? Or the worse? [22:07:13] not that I've found yet [22:07:25] ooh, hang on [22:07:30] https://wikitech.wikimedia.org/w/thumb.php?f=DiscoveryReport.pdf&width=600 - that actually work snow [22:07:32] works now* [22:08:09] So, no pdfinfo command like on tin, but it installed gs which fixed some pdf thumbnails [22:09:43] looks like thumbnails have worked for other pdfs too [22:12:13] 6operations, 6Labs, 6Multimedia, 10wikitech.wikimedia.org, and 2 others: Some wikitech.wikimedia.org thumbnails broken (404) - https://phabricator.wikimedia.org/T93041#1547621 (10Krenair) 5Open>3Resolved Special:ListFiles is looking much better [22:12:43] PROBLEM - puppet last run on analytics1027 is CRITICAL Puppet last ran 2 days ago [22:13:54] 6operations: audit hr staff and tracking sheet (2015-08-17 revision) against shell access/ldap wmf group - https://phabricator.wikimedia.org/T109382#1547625 (10RobH) There are going to be a few users where the answer isn't obvious. On those specific users, I'll note them here and create a task for tracking down... [22:15:59] 6operations: Determine Sam Reed's access rights - https://phabricator.wikimedia.org/T109386#1547637 (10RobH) 3NEW a:3Reedy [22:17:10] Krenair: I was just about to do that :) [22:17:23] 6operations: Determine Sam Reed's access rights - https://phabricator.wikimedia.org/T109386#1547662 (10RobH) My understanding (and that of others in IRC) is Sam is now a volunteer, and we want him to retain ALL the same access rights he had before. It would be excellent to get both his input and possibly whoeve... [22:19:00] 6operations: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1547673 (10ALantz) @RobH: While creating a task in phabricator is a simple task for many, it's outside of our tools and as such is not a simple, quick task. I'm here to help figure out a compromise as ou... [22:19:11] !log LQT->Flow done on MediaWiki.org. [22:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:20:55] 6operations: audit hr staff and tracking sheet (2015-08-17 revision) against shell access/ldap wmf group - https://phabricator.wikimedia.org/T109382#1547677 (10RobH) Well, there were 6 on here we only recently located in other tasks (linked off the initial T108131). The only outstanding question for shell acces... [22:21:22] that wasnt as bad as i thought, we have no ex employees let go in 2015 that still have shell. [22:21:33] left/letgo/quit/whatever [22:21:43] other than that tricky Reedy fellow. [22:23:27] what about jsahleen? [22:23:38] removed when I pointed it out [22:23:47] now absent [22:24:42] yea that one was fixed [22:26:50] 6operations: Determine Sam Reed's access rights - https://phabricator.wikimedia.org/T109386#1547684 (10greg) >>! In T109386#1547662, @RobH wrote: > It would be excellent to get both his input and possibly whoever is managing him as a volunteer (likely his same manager from when he was an employee.) ohai Yes, I... [22:29:58] 6operations: Determine Sam Reed's access rights - https://phabricator.wikimedia.org/T109386#1547685 (10Krenair) To sign the volunteer NDA you'll need to join #WMF-NDA-Requests (which should be possible as you're a member of #Security ) and then L2 should be visible to you [22:30:24] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [22:31:52] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK Less than 1.00% above the threshold [1000000.0] [22:34:13] RECOVERY - puppet last run on analytics1027 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [22:38:59] 6operations: audit hr staff and tracking sheet (2015-08-17 revision) against shell access/ldap wmf group - https://phabricator.wikimedia.org/T109382#1547693 (10RobH) So it seems this list is accounted for; with the exceptions of Sam and Nik. I'll create another sub-task for Nik's access resolution. [22:44:54] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [22:51:24] 6operations: determine nik everett's shell/production access levels - https://phabricator.wikimedia.org/T109390#1547709 (10RobH) 3NEW a:3tomasz [22:59:43] 6operations: determine nik everett's shell/production access levels - https://phabricator.wikimedia.org/T109390#1547719 (10JohnLewis) a:5tomasz>3Tfinc [23:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150817T2300). Please do the needful. [23:00:10] robh: ^^ :) [23:00:35] 6operations: determine nik everett's shell/production access levels - https://phabricator.wikimedia.org/T109390#1547722 (10Tfinc) @ebernhardson Can you comment on this? [23:00:48] opps =P [23:00:52] thx for fix! [23:02:46] SWAT is empty :) [23:05:16] 6operations: determine nik everett's shell/production access levels - https://phabricator.wikimedia.org/T109390#1547727 (10EBernhardson) I talked with nik about this before he departed, it makes sense for him to retain shell access for now. He has by far the most knowledge of our existing search infrastructure i... [23:09:23] 6operations: determine nik everett's shell/production access levels - https://phabricator.wikimedia.org/T109390#1547731 (10RobH) @EBernhardson: I'm not sure that covers all the use-cases those groups give him. If you guys are fine with him retaining the full sudo level rights (logstash roots, full mediawiki de... [23:13:40] 6operations: determine nik everett's shell/production access levels - https://phabricator.wikimedia.org/T109390#1547733 (10EBernhardson) 100% volunteer confirmed. I'm fine with him retaining full sudo level rights. [23:13:49] 6operations: determine nik everett's shell/production access levels - https://phabricator.wikimedia.org/T109390#1547734 (10RobH) I'll outline what each group does: wikidata-query-roots: Full root on the Wikidata Query Service nodes statistics-privatedata-users: Have access to so that they can do analysis on web... [23:14:22] PROBLEM - check_apache2 on payments2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [23:15:09] 6operations: determine nik everett's shell/production access levels - https://phabricator.wikimedia.org/T109390#1547735 (10RobH) a:5Tfinc>3RobH I'll go ahead now and follow up with Nik on getting the NDA signed. [23:19:22] PROBLEM - check_apache2 on payments2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [23:19:22] PROBLEM - check_apache2 on payments2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [23:23:03] PROBLEM - YARN NodeManager Node-State on analytics1041 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:24:12] PROBLEM - check_apache2 on payments2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [23:24:12] PROBLEM - check_puppetrun on payments2003 is CRITICAL Puppet has 1 failures [23:24:22] PROBLEM - check_apache2 on payments2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [23:24:22] PROBLEM - check_puppetrun on payments2002 is CRITICAL Puppet has 1 failures [23:24:22] PROBLEM - check_apache2 on payments2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [23:24:23] PROBLEM - check_puppetrun on payments2001 is CRITICAL Puppet has 1 failures [23:25:02] RECOVERY - YARN NodeManager Node-State on analytics1041 is OK YARN NodeManager analytics1041.eqiad.wmnet:8041 Node-State: RUNNING [23:28:57] 6operations: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1547779 (10RobH) Additional Note for Offboarding: Who covers removal of phabricator groups. (This isn't directed to HR, this is a tech side question.) [23:29:12] PROBLEM - check_apache2 on payments2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [23:29:12] PROBLEM - check_puppetrun on payments2003 is CRITICAL Puppet has 1 failures [23:29:22] PROBLEM - check_apache2 on payments2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [23:29:22] PROBLEM - check_puppetrun on payments2002 is CRITICAL Puppet has 1 failures [23:29:23] PROBLEM - check_apache2 on payments2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [23:29:23] PROBLEM - check_puppetrun on payments2001 is CRITICAL Puppet has 1 failures [23:32:18] 6operations: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1547781 (10RobH) @Alantz: The problem I have with that is IT needs to know, Ops needs to know, and anyone handling internal tooling (like phabricator) needs to know. As its multi-teams have to interact w... [23:33:40] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1547782 (10Krenair) >>! In T108131#1547779, @RobH wrote: > Additional Note for Offboarding: Who covers removal of phabricator groups. (This isn't directed to HR, this is a tech side quest... [23:33:56] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1547785 (10RobH) I just cannot think of an easier way for HR to notify multiple related teams of offboarding than to use the task tracking system used by the whole of engineering (and thus... [23:34:12] PROBLEM - check_apache2 on payments2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [23:34:12] PROBLEM - check_puppetrun on payments2003 is CRITICAL Puppet has 1 failures [23:34:22] PROBLEM - check_apache2 on payments2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [23:34:22] PROBLEM - check_puppetrun on payments2002 is CRITICAL Puppet has 1 failures [23:34:22] PROBLEM - check_apache2 on payments2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [23:34:22] PROBLEM - check_puppetrun on payments2001 is CRITICAL Puppet has 1 failures [23:35:22] PROBLEM - puppet last run on analytics1041 is CRITICAL: Timeout while attempting connection [23:35:22] PROBLEM - salt-minion processes on analytics1041 is CRITICAL: Timeout while attempting connection [23:35:22] PROBLEM - Hadoop DataNode on analytics1041 is CRITICAL: Timeout while attempting connection [23:35:22] PROBLEM - Disk space on Hadoop worker on analytics1041 is CRITICAL: Timeout while attempting connection [23:35:22] PROBLEM - dhclient process on analytics1041 is CRITICAL: Timeout while attempting connection [23:35:23] PROBLEM - Hadoop NodeManager on analytics1041 is CRITICAL: Timeout while attempting connection [23:35:42] PROBLEM - RAID on analytics1041 is CRITICAL: Timeout while attempting connection [23:36:03] PROBLEM - SSH on analytics1041 is CRITICAL: Connection timed out [23:36:34] PROBLEM - configured eth on analytics1041 is CRITICAL: Timeout while attempting connection [23:36:52] PROBLEM - DPKG on analytics1041 is CRITICAL: Timeout while attempting connection [23:36:52] PROBLEM - YARN NodeManager Node-State on analytics1041 is CRITICAL: Timeout while attempting connection [23:36:53] PROBLEM - Disk space on analytics1041 is CRITICAL: Timeout while attempting connection [23:39:12] PROBLEM - check_apache2 on payments2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [23:39:12] PROBLEM - check_puppetrun on payments2003 is CRITICAL Puppet has 1 failures [23:39:22] PROBLEM - check_apache2 on payments2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [23:39:22] PROBLEM - check_puppetrun on payments2002 is CRITICAL Puppet has 1 failures [23:39:23] PROBLEM - check_apache2 on payments2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [23:39:23] PROBLEM - check_puppetrun on payments2001 is CRITICAL Puppet has 1 failures [23:40:53] PROBLEM - Host analytics1041 is DOWN: PING CRITICAL - Packet loss = 100% [23:43:02] (03CR) 10Tim Landscheidt: [C: 031] replace $::instanceproject with $::labsproject [puppet] - 10https://gerrit.wikimedia.org/r/230652 (https://phabricator.wikimedia.org/T93684) (owner: 10Andrew Bogott) [23:43:22] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: test importing of mailing list configs and archives on staging VM - https://phabricator.wikimedia.org/T108073#1547796 (10JohnLewis) >>! In T108073#1542086, @Dzahn wrote: > ``` > ==> /var/log/mailman/mischief <== > Aug 15 02:14:58 2015 (3431) Hostile... [23:43:35] mutante: ^ noting down what I just said to you :) [23:44:58] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: test importing of mailing list configs and archives on staging VM - https://phabricator.wikimedia.org/T108073#1547801 (10JohnLewis) Clarify: List *names* can be capital. List identifiers (the /listinfo/* part and directories and what Mailman knows th... [23:47:36] 6operations, 10Wikimedia-Mailing-lists: rename wikitech-announce.disabled.T100503 - https://phabricator.wikimedia.org/T109393#1547808 (10Dzahn) 3NEW a:3Dzahn [23:48:03] JohnFLewis: it was already on that ticket [23:48:15] hm? [23:48:52] well, the log from mischief.log [23:49:02] yeah [23:49:05] new task to rename that list created now [23:49:14] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [23:58:53] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]