[02:39:51] !log logstash Testing stashbot 04S07A08L03 00m02e06s04s07a08g03e processing [02:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Logstash/SAL, Master [02:49:28] hmm, my downloads from ogvjs-testing.wmflabs.org seem to be throttled... or else there's some slowdown in there :D [02:49:59] throttling would seem sensible though [02:50:37] brion: is that through the proxy? There isn't any throttling in place [02:50:52] yeah that'll be through the web proxy [02:50:56] lemme confirm it's not ssl-only [02:51:17] nah that wasn't even ssl [02:52:12] hmmmm ok that's weird. lemme double-check my network [02:54:13] interesting [02:54:23] i'm gonna laugh if this turns out to be an ipv4 vs ipv6 thing [02:55:43] Yeah we don't do v4 in labs [02:55:55] ipv6 to upload.wikimedia.org goes via a different route than v4 to labs [02:56:06] he.net for v6, ntt.net for v4 [02:56:20] we don't do v4 in labs? [02:56:43] downloading via v4 to my linode from labs proxy is fast... [02:57:33] Krenair: heh..... i think we main don't do v6 in labs :D [02:57:35] *mean even [02:58:33] ...and v4 from labs back to me goes via gtt.net [02:58:54] networking is confusing stuff :D [02:59:33] ah well, it's working well enough for me, just some of my higher-bandwidth videos play slowly in my testing due to the bottleneck [03:00:24] it's down to about 1-1.5 Mbit/s [03:00:59] I meant v6 [03:01:04] * YuviPanda clearly should put phone down [03:02:13] remember when 56k modems were the shizznit? :D [03:02:18] No :p [03:02:56] * brion gets his cane again [03:19:00] Hi YuviPanda, you busy? [03:39:34] brion: if It makes you feel any better I remember. I remember when 19.2 was fancy and fast [04:59:00] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Global-user-contributions, 10Labs-Infrastructure, and 2 others: meta_p.wiki table corrupt (contains many NULL entries for 'url' field) - https://phabricator.wikimedia.org/T106897#1483907 (10Ricordisamoa) >>! In T106897#1483754, @MZMcBride wrote: >>>! In T106897#1482983... [05:21:28] 6Labs, 10Tool-Labs: Migrate individual tools to trusty to relieve pressure on older precise nodes - https://phabricator.wikimedia.org/T88228#1483917 (10Ricordisamoa) >>! In T88228#1483512, @scfc wrote: > Ubuntu Precise is supported until [[https://wiki.ubuntu.com/Releases|April 2017]] But it offers PHP 5.3, w... [06:13:05] 6Labs, 10Labs-Infrastructure: meta_p.wiki is missing url value for wikipedia wikis and lang value is wrong - https://phabricator.wikimedia.org/T107004#1483980 (10Merl) 3NEW a:3coren [06:14:29] 6Labs, 10Labs-Infrastructure: meta_p.wiki is missing url value for wikipedia wikis and lang value is wrong - https://phabricator.wikimedia.org/T107004#1483989 (10Glaisher) [06:14:31] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Global-user-contributions, 10Labs-Infrastructure, and 2 others: meta_p.wiki table corrupt (contains many NULL entries for 'url' field) - https://phabricator.wikimedia.org/T106897#1483990 (10Glaisher) [09:22:32] 10Tool-Labs-tools-Other, 7Tracking: merl tools (tracking) - https://phabricator.wikimedia.org/T69556#1484265 (10Addshore) [09:40:56] PROBLEM - SSH on tools-exec-1213 is CRITICAL - Socket timeout after 10 seconds [09:45:47] RECOVERY - SSH on tools-exec-1213 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2~wmfprecise2 (protocol 2.0) [09:46:10] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3Labs-Q4-Sprint-4, 3ToolLabs-Goals-Q4: Labs NFSv4/idmapd mess - https://phabricator.wikimedia.org/T87870#1484300 (10faidon) The solution I thought of was using the internal glibc function `__nss_configure_lookup` to explicitly configure LDAP for mountd (while... [10:21:00] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Global-user-contributions, 10Labs-Infrastructure, and 2 others: meta_p.wiki table corrupt (contains many NULL entries for 'url' field) - https://phabricator.wikimedia.org/T106897#1484415 (10Vituzzu) >>! In T106897#1483754, @MZMcBride wrote: > > In this particular case... [11:34:43] (03CR) 10Lucie Kaffee: [C: 031] Allow to serialize Element objects as JSON [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/225269 (owner: 10Ricordisamoa) [11:39:28] (03CR) 10Ricordisamoa: [C: 032 V: 032] Allow to serialize Element objects as JSON [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/225269 (owner: 10Ricordisamoa) [13:02:00] 6Labs, 10Tool-Labs, 7Database: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T107029#1484659 (10Giftpflanze) 3NEW [13:14:52] 6Labs, 10Labs-Infrastructure, 10hardware-requests, 6operations: New server: labdns1001 - https://phabricator.wikimedia.org/T106147#1484694 (10Andrew) As per https://phabricator.wikimedia.org/T105723, we now have hot spares for all vital labs services /except/ internal DNS. The associated task https://pha... [13:19:11] 6Labs, 10Labs-Infrastructure, 10hardware-requests, 6operations: New server: labdns1001 - https://phabricator.wikimedia.org/T106147#1484699 (10mark) It's a bit unclear to me what lives where now, and what the plan for this is. Also serving our documentation, could you make a simple map of what essential ma... [14:27:39] Hi! I'm trying 'sql enwiki' after sshing in to tool labs but it prompts me for a password. No documentation for same on https://wikitech.wikimedia.org/wiki/Help:Tool_Labs Help? [14:28:51] try copying replica.my.cnf to .my.cnf [14:31:51] gifti: Thanks. I'll try that. [14:35:25] gifti: I don't seem to have a replica.my.cnf in my home directory by default. [14:36:35] hm [14:36:42] that might be an error [14:37:29] 6Labs: Labs team reliability goal for Q1 2015/16 - https://phabricator.wikimedia.org/T105720#1484861 (10coren) [14:37:30] 6Labs: start-nfs script warning message is scary and wrong - https://phabricator.wikimedia.org/T101742#1484859 (10coren) 5Open>3Invalid The message was not //wrong//; having the other server powered off was the only guaranteed safe way to make certain none of the arrays were assembled in any way, and that th... [14:40:38] Niharika: please create a task on phabricator, labs project, YuviPanda as CC [14:41:02] valhallasw`cloud: Okay. [14:41:36] 6Labs, 10Labs-Infrastructure, 10hardware-requests, 6operations: New server: labdns1001 - https://phabricator.wikimedia.org/T106147#1484884 (10Andrew) I've updated https://wikitech.wikimedia.org/wiki/Labs_infrastructure#dns and added diagrams. [14:43:08] 6Labs: Setup monitoring and reporting for disk space usage of each project on NFS - https://phabricator.wikimedia.org/T106476#1484886 (10coren) We may still want to turn quotas on to do the monitoring itself (just don't put quotas) - a du over millions of files is very I/O intensive, and very long. [14:44:30] 6Labs, 10Continuous-Integration-Infrastructure, 10Labs-Infrastructure: Diamond metrics for cpu.system suddenly up 100% after a reboot - https://phabricator.wikimedia.org/T95912#1484891 (10hashar) 5Open>3Resolved a:3hashar I rebooted the slaves and all metrics look fine. [14:44:39] 6Labs: Home directory does not contain replica.my.cnf - https://phabricator.wikimedia.org/T107034#1484895 (10NiharikaKohli) 3NEW [14:44:42] 6Labs, 10Continuous-Integration-Infrastructure, 10Labs-Infrastructure: Diamond collected metrics about memory usage inaccurate until third reboot - https://phabricator.wikimedia.org/T91351#1484903 (10hashar) I rebooted the slaves and all metrics look fine. [14:44:46] 6Labs, 10Continuous-Integration-Infrastructure, 10Labs-Infrastructure: Diamond collected metrics about memory usage inaccurate until third reboot - https://phabricator.wikimedia.org/T91351#1484904 (10hashar) 5Open>3Resolved a:3hashar [14:45:58] Why does tools' php not have interactive shell support?! [14:49:12] Niharika: I'm not sure what you mean [14:50:08] valhallasw`cloud: From the docs: "Database credentials (credential user/password) are stored in the 'replica.my.cnf' file found in the tool account’s home directory. To use these credentials with command-line tools by default , copy 'replica.my.cnf' to '.my.cnf'." [14:50:13] 6Labs: Make continuous backups of NFS data to codfw - https://phabricator.wikimedia.org/T106474#1484915 (10coren) @yuvipanda: We now have working on-demand backups, pending a script to manage cleanup of snapshots we could now automate this entirely. Do you have a preference for the retention policy? I was cons... [14:50:27] Niharika: yes. the fact replica.my.cnf is missing is a bug [14:50:37] oh, sorry [14:50:41] valhallasw`cloud: I filed that as a bug. [14:50:43] I mis-tabbed you [14:50:50] sorry! [14:50:51] Negative24: I'm not sure what you mean [14:50:55] No problem. :) [14:52:03] valhallasw`cloud: php -a doesn't bring me to a prompt which usually means the package wasn't built with the support but I know the one from ubuntu ppa was built with it [14:52:21] Negative24: we're not using any ppa's, just the regular ubuntu php package [14:52:50] valhallasw`cloud: exactly. regular ubuntu packages still come from their ppa which has the support [14:53:27] eh? ppa by definition means it's not a regular ubuntu package [14:54:05] in any case, it could be a bug with the ubuntu package, or it could be a misconfiguration (missing package?) on the tool labs end, I'm not sure [15:01:49] Negative24: so, not sure. please file a bug. [15:02:13] 6Labs: Ensure that labstore machine is 'known good' hardware - https://phabricator.wikimedia.org/T106479#1484951 (10coren) Switching back to labstore1001 should be a high priority but - as it requires significant downtime - needs to be planned and a good time found. What needs to happen: * Make certain labstor... [15:02:40] valhallasw`cloud: will do (I use ppa loosely) [15:10:39] 6Labs, 3Labs-Sprint-107: Setup monitoring and reporting for disk space usage of each project on NFS - https://phabricator.wikimedia.org/T106476#1484981 (10coren) [15:11:08] 6Labs, 3Labs-Sprint-107: Make continuous backups of NFS data to codfw - https://phabricator.wikimedia.org/T106474#1484983 (10coren) [15:11:35] 6Labs, 3Labs-Sprint-107: Ensure that labstore machine is 'known good' hardware - https://phabricator.wikimedia.org/T106479#1484990 (10coren) [15:14:15] 6Labs, 10Tool-Labs: Migrate individual tools to trusty to relieve pressure on older precise nodes - https://phabricator.wikimedia.org/T88228#1485002 (10scfc) Yes, and that is what Canonical committed to until April 2017. For Trusty, our users' security also relies on them constantly publishing security update... [15:14:44] 6Labs, 10Labs-Infrastructure, 6operations, 10ops-eqiad: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1485005 (10coren) @yuvipanda: No, the hardware is known to have issues - though up to date it's always been fully working once it gets working at all (all the issues... [15:17:32] 6Labs: Identify services labs provides - https://phabricator.wikimedia.org/T105721#1485019 (10coren) [15:19:33] 6Labs, 3Labs-Sprint-107: Switch NFS server back to labstore1001 - https://phabricator.wikimedia.org/T107038#1485031 (10coren) 3NEW [15:22:59] 6Labs, 6operations, 3Labs-Sprint-102, 3Labs-Sprint-103, and 3 others: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1485053 (10coren) This should be done, or very near completion. As far as I can tell, there is no unpuppetized configuration but I'm no... [15:23:34] 6Labs, 6operations, 3Labs-Sprint-102, 3Labs-Sprint-103, and 4 others: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1485056 (10coren) [16:25:40] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1407 is CRITICAL 22.22% of data above the critical threshold [0.0] [16:49:52] 6Labs, 7Tracking: Create stashbot project - https://phabricator.wikimedia.org/T107047#1485319 (10bd808) 3NEW [16:50:07] 6Labs, 7Tracking: Create stashbot project - https://phabricator.wikimedia.org/T107047#1485319 (10bd808) a:3bd808 [16:57:18] 6Labs, 7Tracking: New Labs project requests (Tracking) - https://phabricator.wikimedia.org/T76375#1485353 (10bd808) [16:57:18] 6Labs, 7Tracking: Create stashbot project - https://phabricator.wikimedia.org/T107047#1485351 (10bd808) 5Open>3Resolved https://wikitech.wikimedia.org/wiki/Nova_Resource:Stashbot [17:32:33] 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-107: labnet1001 is a spof - https://phabricator.wikimedia.org/T106141#1485422 (10Andrew) [17:32:56] 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-107: holmium is a spof - https://phabricator.wikimedia.org/T106142#1485425 (10Andrew) [17:32:57] 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-105, 3Labs-Sprint-106, 3Labs-Sprint-107: replica.my.cnf creation broken - https://phabricator.wikimedia.org/T104453#1485424 (10yuvipanda) [17:35:41] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1407 is OK Less than 1.00% above the threshold [0.0] [17:36:25] 6Labs, 3Labs-Sprint-107: allow routing between labs instances and public labs ips - https://phabricator.wikimedia.org/T96924#1485436 (10Andrew) [17:37:54] 6Labs, 6operations, 3Labs-Sprint-107, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1485449 (10Andrew) [17:38:53] 6Labs, 3Labs-Sprint-107: Build proper monitoring for making sure that processes that need to run only once on one labstore only are running only once on one labstore only - https://phabricator.wikimedia.org/T106590#1485454 (10yuvipanda) [17:46:02] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Global-user-contributions, 10Labs-Infrastructure, and 3 others: meta_p.wiki table corrupt (contains many NULL entries for 'url' field) - https://phabricator.wikimedia.org/T106897#1485498 (10yuvipanda) [17:48:44] 6Labs, 3Labs-Sprint-107: Identify services labs provides - https://phabricator.wikimedia.org/T105721#1485513 (10yuvipanda) [17:52:11] 6Labs, 3Labs-Sprint-107: Evaluate a 'cluster solution' for use on Tool Labs - https://phabricator.wikimedia.org/T106475#1485527 (10yuvipanda) [17:53:19] 6Labs: Have checkpoint checks for all labs services - https://phabricator.wikimedia.org/T107058#1485532 (10yuvipanda) 3NEW [17:53:37] 6Labs, 3Labs-Sprint-107: Identify services labs provides - https://phabricator.wikimedia.org/T105721#1485539 (10yuvipanda) [17:53:38] 6Labs: Have checkpoint checks for all labs services - https://phabricator.wikimedia.org/T107058#1485532 (10yuvipanda) [17:53:58] 6Labs, 3Labs-Sprint-107: Identify services labs provides - https://phabricator.wikimedia.org/T105721#1450114 (10yuvipanda) [17:53:59] 6Labs: Labs team reliability goal for Q1 2015/16 - https://phabricator.wikimedia.org/T105720#1485540 (10yuvipanda) [17:54:23] YuviPanda: does OSM keep me from making hosts with non-unique names even though we have project in the fqdn now? [17:54:43] having my project name twice in the host name seems ... anoying [17:54:44] bd808: we support old style naming still, so you do need unique names unfortunately [17:54:46] it is [17:57:20] 6Labs, 6operations, 3Labs-Sprint-107, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1485567 (10Andrew) Let's schedule this for one of the live labvirts next week. [17:57:32] also very annoyed that you have to setup a scap master in order to setup a trebuchet master [18:04:08] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Global-user-contributions, 10Labs-Infrastructure, and 3 others: meta_p.wiki table corrupt (contains many NULL entries for 'url' field) - https://phabricator.wikimedia.org/T106897#1485594 (10yuvipanda) a:5Krenair>3coren [18:04:27] to make a trebuchet from scratch first you must create teh universe :) [18:04:31] bd808: indeed, we should have logstash be able to run without needing trebuchet [18:04:43] you need to go to the imperial age and build a castle I think [18:04:56] upgrades at the workshop are also useful to provide more armor to the trebuchet [18:06:11] 6Labs: Have checkpoint checks for all labs services (Tracking) - https://phabricator.wikimedia.org/T107058#1485599 (10yuvipanda) [18:06:16] YuviPanda: I could decouple logstash from trebuchet but elasticsearch needs it too so :/ [18:06:24] whatever [18:06:28] :( [18:09:08] * bd808 hunts through http://apt.wikimedia.org for the right logstash-contrib package [18:12:43] 6Labs, 10Tool-Labs, 5Patch-For-Review: Create process for 'tool labs is down' notifications on tools.wmflabs.org/* - https://phabricator.wikimedia.org/T102971#1485628 (10valhallasw) a:5valhallasw>3None To clarify, the current patchset now contains said errorpage. Unassigning myself as someone else has to... [18:15:39] * valhallasw`cloud wonders why people like the new gerrit changes screen [18:15:52] probably people with more screen real estate [18:16:45] do we do local gerrit style changes? [18:46:11] 6Labs: Measure capacity and utilization of labs services (Tracking) - https://phabricator.wikimedia.org/T107066#1485722 (10yuvipanda) 3NEW [18:49:57] 6Labs: Measure capacity and utilization of labvirt**** servers - https://phabricator.wikimedia.org/T107067#1485740 (10yuvipanda) 3NEW [18:50:35] 6Labs, 3Labs-Sprint-107: Setup monitoring and reporting for disk space usage of each project on NFS - https://phabricator.wikimedia.org/T106476#1485747 (10yuvipanda) [18:52:51] Coren: YuviPanda: https://gerrit.wikimedia.org/r/#/c/226939/ is awaiting review to fix breaking regression https://phabricator.wikimedia.org/T106897 which is breaking tools that implement wiki-agnostic config [18:53:20] Krinkle: Will do this right after our ops meeting. [18:53:46] Thanks :) [18:58:36] YuviPanda: seen the low drive space warning for tools-bastion-01? there's only 1.5G free [18:58:49] YuviPanda: not sure why though. /tmp is only 1.5G, but can probably be cleaned [18:58:57] valhallasw`cloud: probably /var/log [18:59:07] ah, yeah [18:59:10] 8.3G [18:59:20] pacct, specifically [19:00:59] 6Labs, 6operations, 3Labs-Sprint-107, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1485787 (10yuvipanda) use labvirt1009, has only 3 tools instances and they all can be failed over or sustain downtime. [19:02:32] YuviPanda: so it's the same 10.64.37.10-man stuff we say before [19:02:41] booooo [19:02:44] not sure wtf's up with that... [19:02:48] other instances aren't affected. [19:03:09] it's something with nfs, it seems? that ip is labstore [19:03:21] 6Labs, 10Labs-Infrastructure, 6operations, 10ops-eqiad: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1485790 (10yuvipanda) Heh, it did thankfully work when it was rebooted last time accidentally. [19:03:31] 6Labs, 10Labs-Infrastructure, 6operations, 10ops-eqiad: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1485791 (10yuvipanda) (thankfully - let's not do that again, etc) [19:03:56] valhallasw`cloud: that is https://phabricator.wikimedia.org/T107052 [19:04:15] Which has no details and anyway I have no explanation [19:04:38] 6Labs, 3Labs-Sprint-107: Identify services labs provides - https://phabricator.wikimedia.org/T105721#1485792 (10yuvipanda) [19:04:40] The format of that log doesn’t even correspond with the documentation for what kinds of things can be in there. It has an ip address where there should be a commandline [19:05:59] andrewbogott: I think it is a command line, actually, but I'm not completely sure [19:06:13] valhallasw@tools-bastion-01:/var/log/account$ sudo lastcomm | grep -v man | head [19:06:13] sshd S root __ 0.01 secs Mon Jul 27 19:06 [19:06:18] ^ this looks completely sane [19:06:25] oh, that’s not what I was seeing before [19:06:29] but the rest is 10.64.37.10-man F root __ 0.00 secs Mon Jul 27 19:06 [19:06:45] but also, an hour ago I deleted the most enormous of those log files. So maybe it isn’t misbehaving at the moment [19:06:51] it is [19:06:58] Um… not that I like erasing history, but the instance was about to die [19:06:59] 19:06 is now [19:07:10] ah, right [19:07:11] 10.64.37.10-man [19:07:16] so that ip is labstore [19:07:17] how is that a command? [19:07:27] so I'm guessing it's something where the nfs mount reports a weird name? [19:07:32] not completely sure [19:07:49] might be a red herring [19:07:55] is this happening on bastion-02? [19:08:04] -01 [19:08:16] You mean, the fact that it’s labstore’s IP? It could be a red herring but it’s awfully suspicious [19:08:19] but not on -02, right? [19:08:31] Random suspicion: that log freaks out while a backup is running [19:08:34] YuviPanda: haven't checked [19:08:36] I haven’t tested that though [19:08:40] I just checked, not on -02 [19:08:56] and now I’m about to go eat lunch. But I welcome further info on that ticket! I looked at it last week and got nowhere. [19:08:59] so... maybe 1. switchover tools-login to -02 (just DNS, not IP) [19:09:04] 2. wait for connections to drain from -01 [19:09:08] 3. kill -01 and rebuild [19:09:15] 4. wonder forever wtf was going on [19:09:29] this is the cowardly way out ofc [19:09:35] 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1485804 (10valhallasw) [19:09:58] Also, I scheduled a reboot for that bastion and then forgot to do it :( [19:10:22] heh [19:10:25] andrewbogott: 'tis ok [19:14:29] 6Labs, 7Database: Measure capacity and utilization of labsdb*** boxes - https://phabricator.wikimedia.org/T107070#1485831 (10yuvipanda) 3NEW [19:15:36] 6Labs, 10Tool-Labs: Grants for my Tools-db missing to insert new lines - https://phabricator.wikimedia.org/T98790#1485839 (10Kolossos) 5Open>3Resolved a:3Kolossos Thanks, it seems to work. [19:17:01] 6Labs, 7Database: Measure capacity and utilization of labsdb*** boxes - https://phabricator.wikimedia.org/T107070#1485849 (10yuvipanda) This is basically capacity planning for these boxes, since we've been seeing more consistent issues with them now. @jcrespo how do you think we should handle these? [19:18:05] 6Labs, 7Tracking: Sn1per mediawiki testing labs project - https://phabricator.wikimedia.org/T106086#1485852 (10Sn1per) Perhaps https://wikitech.wikimedia.org/wiki/Nova_Resource:Mediawiki-api would be relevant? [19:20:40] 6Labs, 10Beta-Cluster, 6operations, 7Monitoring: Setup (simple) catchpoint monitoring and metrics for enwiki betacluster just like production - https://phabricator.wikimedia.org/T97865#1485870 (10hashar) Will be done with Jenkins, see {T106421}. [19:21:57] 6Labs, 7Tracking: Sn1per mediawiki testing labs project - https://phabricator.wikimedia.org/T106086#1485879 (10yuvipanda) Most probably.I could add you to that project. [19:22:28] [13intuition] 15Krinkle pushed 1 new commit to 06master: 02https://github.com/Krinkle/intuition/commit/f6921fde6693928f7c712632f47852a6001e66d2 [19:22:29] 13intuition/06master 14f6921fd 15Timo Tijhof: Update Raun messages, after integrating ORES... [19:22:49] Krinkle: nice! (re: ORES integration) [19:23:05] [13intuition] 15Krinkle closed pull request #49: Updating Raun messages (06master...06master) 02https://github.com/Krinkle/intuition/pull/49 [19:25:59] 6Labs, 10Tool-Labs, 5Patch-For-Review: toolsbeta-puppetmaster3 can't resolve hiera('labs_puppet_master') - https://phabricator.wikimedia.org/T106627#1485916 (10scfc) 5Open>3Resolved Removed. [19:32:12] * valhallasw`cloud wonders why pacct doesn't contain the parent pid >_< [19:32:22] PROBLEM - Puppet failure on tools-webgrid-generic-1403 is CRITICAL 55.56% of data above the critical threshold [0.0] [19:33:13] 6Labs: Measure capacity and utilization of labvirt**** servers - https://phabricator.wikimedia.org/T107067#1485943 (10yuvipanda) @andrew says this should be done with https://wiki.openstack.org/wiki/Ceilometer and is pending an OpenStack upgrade [19:33:57] wait, it does [19:41:02] 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1485994 (10valhallasw) OK, so besides `lastcomm`, there's also `dump-acct` to parse the pacct file, which *does* dump the parent pid. In this case: ``` command, version,... [19:41:04] andrewbogott: ^ [19:43:41] 6Labs, 7Database: Measure capacity and utilization of labsdb*** boxes - https://phabricator.wikimedia.org/T107070#1485999 (10jcrespo) I can help and even own this task. There is an already installed measurement plugin for MySQL (user_stats) https://www.percona.com/doc/percona-server/5.5/diagnostics/user_stat... [19:44:07] 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1486002 (10scfc) `tools-bastion-01`'s IP address is `10.68.17.228`, the IP in the first column refers to: ``` root@tools-bastion-01:~# host 10.64.37.10 10.37.64.10.in-add... [19:44:52] 6Labs, 7Tracking: Sn1per mediawiki testing labs project - https://phabricator.wikimedia.org/T106086#1486003 (10Sn1per) >>! In T106086#1485879, @yuvipanda wrote: > Most probably.I could add you to that project. That would be great, thanks! :) (still have to read up on documentation on how to use labs :P) [19:49:22] 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1486020 (10valhallasw) It's supposed to be a command that runs on `tools-bastion-01`, called `10.64.37.10-man`. [20:07:21] RECOVERY - Puppet failure on tools-webgrid-generic-1403 is OK Less than 1.00% above the threshold [0.0] [20:23:32] 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1486106 (10scfc) According to http://unix.stackexchange.com/questions/63569/what-are-processes-123-45-78-901-ma-on-linux-where-the-number-is-an-nfs-ser, it seems to be a "... [20:27:40] 6Labs, 10Labs-Infrastructure: create database view for azbwiki and add domain azbwiki.labsdb to nameserver - https://phabricator.wikimedia.org/T107081#1486110 (10Merl) 3NEW a:3coren [20:28:12] 6Labs, 10Labs-Infrastructure: create database view for azbwiki and add domain azbwiki.labsdb to nameserver - https://phabricator.wikimedia.org/T107081#1486122 (10Merl) [20:30:59] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Global-user-contributions, 10Labs-Infrastructure, and 3 others: meta_p.wiki table corrupt (contains many NULL entries for 'url' field) - https://phabricator.wikimedia.org/T106897#1486125 (10Krenair) 5Open>3Resolved Looks fixed to me [20:31:16] 6Labs, 10Labs-Infrastructure: create database view for azbwiki and add domain azbwiki.labsdb to nameserver - https://phabricator.wikimedia.org/T107081#1486129 (10Krenair) I guess the maintain-replicas run @coren did earlier for T106897 fixed the views. [20:35:17] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: create database view for azbwiki and add domain azbwiki.labsdb to nameserver - https://phabricator.wikimedia.org/T107081#1486136 (10Merl) But host file is still missing the update $ date && mysql -hazbwiki.labsdb Mon Jul 27 20:33:27 UTC 2015 ERROR 200... [20:37:46] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: create database view for azbwiki and add domain azbwiki.labsdb to nameserver - https://phabricator.wikimedia.org/T107081#1486138 (10Krenair) Yes, I uploaded a patch for that. :) [20:41:40] 6Labs, 10Tool-Labs: [tracking] Tool labs admin guides - https://phabricator.wikimedia.org/T104734#1486153 (10valhallasw) [20:42:10] PROBLEM - Puppet failure on tools-webproxy-01 is CRITICAL 66.67% of data above the critical threshold [0.0] [20:42:25] hmmm [20:42:26] that's strange [20:44:45] pff, I forgot how crazily large the kernel is [20:45:43] valhallasw`cloud: were you trying to grep for that error in kernel? [20:45:55] I'm trying to grep for kthread in the nfs module [20:47:28] right [20:48:20] how often is puppet supposed to run on tools-bastion-01? [20:49:53] every 20mins [20:57:17] RECOVERY - Puppet failure on tools-webproxy-01 is OK Less than 1.00% above the threshold [0.0] [20:59:25] 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1486219 (10valhallasw) My hunch is that these are actual processes (their PIDs are 10859, 10860, etc) spawned by NFS somehow, but I'm not sure how to confirm that. Searchi... [21:12:58] valhallasw`cloud: thank you for researching this! I’m working on other things but am happy to assist if there’s anything useful I can do (e.g. rootwise) [21:14:13] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: create database view for azbwiki and add domain azbwiki.labsdb to nameserver - https://phabricator.wikimedia.org/T107081#1486301 (10Krenair) 5Open>3Resolved [21:16:06] andrewbogott: maybe a 'man' on the nfs host, to see if there's any obvious processes that jump out there? I don't think manage-nfs-volumes-daemon could have anything to do with what we see, though [21:17:11] root@labstore1002:~# man [21:17:11] man mandb [21:17:11] manage-keys-nfs manpath [21:17:11] manage-nfs-volumes-daemon [21:17:28] …not very helpful [21:18:18] one thing that also really confuses me is that kernel threads typically start with their module name [21:19:24] oh! found it [21:19:25] it's https://github.com/torvalds/linux/blob/master/fs/nfs/nfs4state.c#L1160 [21:20:22] 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1486308 (10valhallasw) > neither of those resemble the format we see. The latter *does* resemble the format we see: ``` snprintf(buf, sizeof(buf), "%s-manager", rpc_... [21:20:42] that’s it! But why... [21:22:37] so it checks if the manager is running just before, and returns if it is [21:23:46] I mean, why is it landing in the account log? [21:23:48] but the task does get created, and then crashes immediately [21:23:59] Ah, that would do it. [21:24:10] So the messages reflect a real problem [21:24:15] I think so, yes [21:25:21] anyway, I'm off to bed [21:25:53] one thing to consider is remounting nfs, but that probably has a fairly large effect [21:26:18] but I'll disable pacct for now [21:26:59] Sounds like I should’ve just rebooted it last week :/ [21:27:08] !log tools turned off process accounting on tools-login while we try to find the root cause of [[phab:T107052]]:
accton off
[21:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:27:39] anyway, good night [21:28:51] thanks for your hard work! [21:32:18] gack [21:32:20] back [21:36:12] Coren: valhallasw`cloud andrewbogott so lzia is helping run a tool labs survey, I just chatted to her. She's drafting some questions, will email you'all (and Tim) to collaborate / finalize later on. [21:36:26] ok [21:36:51] Great! [21:59:54] (03CR) 10BryanDavis: [C: 031] "I needed this on 2 different deploy servers I setup in labs." [labs/private] - 10https://gerrit.wikimedia.org/r/225251 (owner: 10BryanDavis) [22:00:50] (03CR) 10Yuvipanda: [C: 032 V: 032] Add empty releases/id_rsa.upload [labs/private] - 10https://gerrit.wikimedia.org/r/225251 (owner: 10BryanDavis) [22:01:16] ty YuviPanda [22:02:01] there is another hack I needed as well but I need to look at beta cluster and see if merging it will make things melt thete [22:02:05] *there [22:22:56] hi andrewbogott, are you busy? [22:23:42] GEOFBOT: somewhat; what’s up? [22:25:15] andrewbogott: I was looking for a labs project that I could use to test mediawiki-api patches (especially since I am going to be away from a decent machine for a little bit soon) [22:25:39] noticed you were an admin on the mediawiki-api project [22:25:42] YuviPanda: Cool beans [22:25:49] could you add me :3 [22:26:57] GEOFBOT: I don’t have a whole lot to do with that project; I’m an admin of most projects because I created them :) Who are the other admins? Maybe we can find one. [22:27:49] andrewbogott: yurik and dr0ptp4kt [22:27:51] I think [22:28:26] YuviPanda: hm, you think that project is defunct? I guess they’re here; we’ll see. [22:28:39] andrewbogott: has three instances yurik seems to be using to test thing? [22:28:40] *things? [22:28:57] andrewbogott, dr0ptp4kt and I were testing zero stuff in that proj [22:29:09] still do on ocassion [22:29:17] (despite the name mismatch) [22:29:25] Um… weird. [22:29:44] OK, so do you want to connect with GEOFBOT about testing api stuff, or should we kill that project and start with two new ones? [22:30:06] andrewbogott, well, is it possible to rename a project? [22:30:13] no [22:30:17] i would rather keep the existing instances [22:31:03] they could live under a different project - not a big deal [22:32:58] what’s on those instances that can’t be recreated? Aren’t they just one-offs? [22:33:15] It’s inconvenient for you to be squatting on that name now that someone else wants to actually use it for api testing :) [22:33:21] thanks for helping; apologies if i'm causing a lot of trouble or anything :P [22:34:42] GEOFBOT: no trouble; since that project seems to not be what you need it’s probably best for you to file a new project request with a confusingly-similar name :) [22:35:03] https://phabricator.wikimedia.org/T76375 [22:36:49] GEOFBOT: ping me when you have a task created and we’ll have a look. [22:38:52] 6Labs, 7Tracking: Mediawiki API testing labs project - https://phabricator.wikimedia.org/T106086#1486524 (10Sn1per) [22:39:15] andrewbogott: ^ (adaptation of my own request which was too personal) [22:40:40] Ah, so I see [22:40:44] 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Setup a tools checker service that can check all internal services for availability - https://phabricator.wikimedia.org/T97748#1486539 (10yuvipanda) [22:41:20] 6Labs, 7Tracking: Mediawiki API testing labs project - https://phabricator.wikimedia.org/T106086#1486540 (10Andrew) Weirdly the mediawiki-api project turns out to be used for testing Zero. I've chided the users as needed, but we'll need to create a new project with a new name for SN1per. [22:41:36] GEOFBOT: I need to step away but I’ll try to catch up with this later today. Or YuviPanda may want to follow up. [22:41:57] thanks :) [22:42:19] 6Labs, 7Tracking: Mediawiki API testing labs project - https://phabricator.wikimedia.org/T106086#1486541 (10yuvipanda) Chiding intensifies..... /me stares at @yurikk intensely. BAD YURIK, BAD. Anyway, I can create a project named mw-api-testing if you'd like. [22:42:49] GEOFBOT: i can create mw-api-testing if you'd like now? [22:43:00] YuviPanda: that would be great! thanks :D [22:43:10] 6Labs, 7Tracking: Mediawiki API testing labs project - https://phabricator.wikimedia.org/T106086#1486542 (10Sn1per) [22:43:14] * yurik hides [22:46:39] 6Labs, 10Tool-Labs: Rewrite the meta_p table populating code to python and have it run on a cron - https://phabricator.wikimedia.org/T107094#1486552 (10yuvipanda) 3NEW [22:47:31] 6Labs, 7Tracking: New Labs project requests (Tracking) - https://phabricator.wikimedia.org/T76375#1486562 (10yuvipanda) [22:47:32] 6Labs, 7Tracking: Mediawiki API testing labs project - https://phabricator.wikimedia.org/T106086#1486559 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Done! [22:47:35] GEOFBOT: done [22:47:36] ! [22:47:36] hello :) [22:48:46] YuviPanda: thanks :D [22:48:50] GEOFBOT: yw [22:49:08] didn't know wm-bot could say hi [22:49:15] !brain [22:49:57] bah, !brain doesn't work for this channel [23:00:42] andrewbogott: can I consider https://phabricator.wikimedia.org/T105721 checked off from you? [23:00:43] Coren: ^ [23:01:11] YuviPanda: I can't think of anything else. [23:09:08] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Setup a tools checker service that can check all internal services for availability - https://phabricator.wikimedia.org/T97748#1486706 (10Ricordisamoa) [23:14:01] 6Labs, 10Labs-Infrastructure: create database view for azbwiki and add domain azbwiki.labsdb to nameserver - https://phabricator.wikimedia.org/T107081#1486732 (10Ricordisamoa) [23:14:28] 6Labs, 6Design Research Backlog: Public IP and Wildcard DNS for REFLEX project - https://phabricator.wikimedia.org/T92273#1486736 (10ggellerman) [23:25:58] ssh: connect to host sn1per-api.eqiad.wmflabs port 22: Connection timed out -- doing something wrong? [23:29:36] "Could not request certificate: Connection refused" [23:30:00] andrewbogott ^ looks like new instance creation failure... [23:30:41] my first instance had the timed out error so i deleted and made another one and it is giving the ssl error thingy [23:30:48] let me try making an instance with a different name [23:30:49] hmm [23:31:01] GEOFBOT: are these debian or ubuntu? [23:31:08] ubuntu [23:31:10] 14 [23:33:02] YuviPanda: I made another instance with a different name and it works fine (so far); i think the issue was that I already had an instance with that name that I deleted and re-made another instance with the same name immediately afterward [23:34:53] GEOFBOT: that might be it, yeah [23:35:12] am I supposed to wait for the next 30 min interval before ssh'ing in? [23:35:59] no, about 5mins usually [23:36:38] because I read somewhere in the docs about a 30 minute timeout and I thought that may be related to the ssh timing out [23:38:50] 6Labs, 7Database: Measure capacity and utilization of labsdb*** boxes - https://phabricator.wikimedia.org/T107070#1486889 (10yuvipanda) Wonderful! Quotas are probably not going to work, so that'll be the last resort. However, I think just identifying and informing users of intense queries will help a lot. We... [23:38:57] GEOFBOT: nope, that isn't true [23:39:15] my instance isn't returning pings :P [23:39:31] probably did something wrong [23:39:31] sigh [23:39:35] GEOFBOT: what's the instance name? [23:39:42] sn1per-tests [23:44:28] YuviPanda: what project name? [23:44:41] andrewbogott: mw-api-tests [23:47:19] mw-api-testing you mean? [23:49:08] GEOFBOT: you need to open port 22 in your default security group. That’s ssh. [23:49:19] Did you remove it when adding 80, or was it empty when you started? [23:49:44] It was empty. I think I added 22, but it wasn't making much of a difference, so I removed it [23:49:46] I'll add it again [23:51:56] works for me, now. [23:52:57] huh, turns out I had screwed up the CIDR, thanks andrewbogott and YuviPanda :) [23:53:16] silly me