[00:56:20] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1407 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[01:24:40] <Polsaker>	 Hi, today I checked my cronjobs and found all of them commented out, with a comment that read "jobs DISABLED by bd808 on 2017-04-20". Have I missed anything?
[01:27:20] <Reedy>	 You've only noticed 3 and a half months later?
[01:27:49] <Reedy>	 Usually cronjobs are disabled because they're doing something wrong - running things in the wrong place, causing errors, using execessive amounts of resources
[01:32:54] <Polsaker>	 yeah, I was absent for a bit <_<
[01:33:15] <Polsaker>	 I also noticed an extra "-l hostname=tools-exec-1422" on the jsub commands that I didn't add
[01:33:16] <Reedy>	 What's the tool name?
[01:34:35] <Polsaker>	 cobot
[01:35:17] <Reedy>	 https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cobot/SAL
[01:35:26] <Reedy>	 15:04 bd808: Disabled cron jobs
[01:35:27] <Reedy>	 13:11 chasemp: set crons to schedule on tools-exec-1422 only for testing. This tool is launching jobs that write and then read hundreds of megs in a few minutes. I caught it tripping up puppet on tools-exec-1437
[01:37:44] <Polsaker>	 uh, thanks
[01:38:06] <Polsaker>	 is there any way to get email notifications about these things? I left disabled the job that might be causing that
[01:38:21] <Reedy>	 You can add that page to your watchlist on wikitech
[01:38:22] <Reedy>	 Then enable..
[01:39:04] <Reedy>	 https://wikitech.wikimedia.org/wiki/Special:Preferences#mw-prefsection-personal
[01:39:10] <Reedy>	 "Email me when a page or a file on my watchlist has changed"
[01:39:46] <Polsaker>	 Thanks
[01:40:59] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1418 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[01:42:06] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-lighttpd-1427 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[01:43:20] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-lighttpd-1413 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[01:44:09] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1410 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[02:04:54] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1436 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[02:16:08] <bd808>	 Reedy: thanks for helping Polsaker find the logs
[02:16:22] <bd808>	 Polsaker: I don't remember any more than what we wrote there
[02:17:04] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-lighttpd-1427 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:17:54] <Polsaker>	 bd808: I just re-enabled two of the jobs that use the least amount of resources
[02:18:11] <Polsaker>	 And I thouhgt "bd808" was a server's name :p
[02:18:15] <bd808>	 heh
[02:18:22] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-lighttpd-1413 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:18:53] <bd808>	 if you can figure out how to slow down the i/o needs of the other job it would probably be fine to turn back on too
[02:19:10] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:19:48] <bd808>	 or if you need faster io and can do it in /tmp that works too
[02:20:44] <bd808>	 basically our NFS servers get overloaded and that slows everything down. when it happens we go looking for things that are causing load
[02:20:54] <bd808>	 not a very nice system, but its what we have right now
[02:21:03] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1418 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:22:11] <wikibugs>	 (03PS1) 10BryanDavis: Change ToolInfo authors to list of strings [labs/striker] - 10https://gerrit.wikimedia.org/r/370134 (https://phabricator.wikimedia.org/T149458)
[02:22:12] <wikibugs>	 (03PS1) 10BryanDavis: Encode toolinfo.json as utf8 [labs/striker] - 10https://gerrit.wikimedia.org/r/370135
[02:32:11] <wikibugs>	 (03CR) 10BryanDavis: Add toolinfo.json style data (031 comment) [labs/striker] - 10https://gerrit.wikimedia.org/r/353909 (https://phabricator.wikimedia.org/T149458) (owner: 10BryanDavis)
[02:36:17] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:45:47] <wikibugs>	 (03PS1) 10BryanDavis: [WIP] DDL changes for tool account management [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/370139 (https://phabricator.wikimedia.org/T149458)
[03:09:57] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1436 is OK: OK: Less than 1.00% above the threshold [0.0]
[03:14:34] <wikibugs>	 (03PS2) 10BryanDavis: [WIP] DDL changes for tool account management [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/370139 (https://phabricator.wikimedia.org/T149458)
[03:26:45] <wikibugs>	 (03PS1) 10BryanDavis: Change default mysql collation to utf8mb4_unicode_ci [labs/striker] - 10https://gerrit.wikimedia.org/r/370143
[03:33:13] <XioNoX>	 bd808: about T172478 I might have found the cause of the issue, but I need to understand better how it's configured that way, and nobody from the task is replying to pings
[03:33:14] <stashbot>	 T172478: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478
[03:33:57] <bd808>	 XioNoX: nice. I'm just going to send a note to the list so when someone comes yelling we can point to it :)
[03:34:22] <bd808>	 Its pretty late for mukunda now
[03:35:53] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1436 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[03:37:54] <Zppix>	 the people involved with the migration i think have gone to bed
[03:38:21] <Zppix>	 I think you're better off waiting til the morning
[03:41:52] <XioNoX>	 bd808: fixed
[03:42:50] <bd808>	 XioNoX: yeah? sweet. Let me check it out
[03:43:05] <XioNoX>	 commented on the task as well
[03:43:19] <bd808>	 "ssh: connect to host git-ssh.wikimedia.org port 22: Connection refused"
[03:43:47] <XioNoX>	 bd808: what are you trying to do?
[03:43:56] <bd808>	 git pull a repo
[03:44:29] <bd808>	 I'll paste a -vvv dump from ssh
[03:44:53] <bd808>	 XioNoX: https://phabricator.wikimedia.org/P5853
[03:45:13] <XioNoX>	 ah
[03:45:19] <XioNoX>	 seems like 2 unrelated issues
[03:45:32] <XioNoX>	 bd808: what server is behind git-ssh ?
[03:45:47] <bd808>	 that is a fine question :)
[03:46:29] <bd808>	 phab1001-vcs.eqiad.wmnet?
[03:47:04] <XioNoX>	 so Connection refused usually means routing OK but nothing listening on that port (fyi)
[03:47:37] <XioNoX>	 oh, it's an LVS
[03:48:40] <bd808>	 The lvs part may be brand new from the switch
[03:48:47] <bd808>	 I'm unsure about that
[03:50:12] <XioNoX>	 yeah, I'm trying to follow the clues
[03:55:04] <bd808>	 XioNoX: from backscroll in -operations it looks like maybe the backend is phab1001.eqiad.wmnet
[03:55:14] <XioNoX>	 yeah it is
[03:55:31] <XioNoX>	 the IPs match
[03:55:55] <bd808>	 *nod* it looks like they fought with pybal for a while too.
[03:56:26] <bd808>	 y'all will probably figure it out as soon as Daniel or Mukunda are around to fill in the blanks
[03:56:37] <XioNoX>	 yeah, I need to figure out which LVS is hosting the public IP
[03:57:03] <XioNoX>	 ah lvs1002.wikimedia.org.
[04:08:08] <mutante>	 back, what's up
[04:08:26] <mutante>	 why did we get paged about phd now
[04:08:53] <XioNoX>	 ah cool
[04:09:46] <mutante>	 first fixing the paging issue
[04:15:54] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1436 is OK: OK: Less than 1.00% above the threshold [0.0]
[04:25:34] <wikibugs>	 10VPS-project-Wikistats: List of largest MediaWikis does not update Wikias - https://phabricator.wikimedia.org/T172481#3499774 (10LWChris)
[05:27:18] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1407 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[06:02:18] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0]
[07:42:50] <wikibugs>	 10Tools: Tool "commons-coverage" loads map tiles from OSM - https://phabricator.wikimedia.org/T172395#3499954 (10Emijrp) @Ricordisamoa Thanks. Then I think I don't have to use it in my code anymore.
[07:55:31] <wikibugs>	 10Tool-Erwin's-tools: 502 Bad Gateway - https://phabricator.wikimedia.org/T172360#3496182 (10TheDJ) ``` 2017-08-01 13:05:33: (server.c.1444) [note] sockets disabled, connection limit reached  ``` I have restarted the web service.   I also fixed a problem with it's ErrorHandler that had been spamming the error lo...
[07:55:44] <wikibugs>	 10Tool-Erwin's-tools: 502 Bad Gateway - https://phabricator.wikimedia.org/T172360#3499987 (10TheDJ) 05Open>03Resolved a:03TheDJ
[08:02:47] <wikibugs>	 10Tools: Tool "cvrminer" loads bootstrap and jquery from cloudflare - https://phabricator.wikimedia.org/T172400#3500000 (10Fnielsen) Thanks for reminding me of this. I will change it to a toolforge serving.
[08:43:10] <wikibugs>	 10Tool-Erwin's-tools: 502 Bad Gateway - https://phabricator.wikimedia.org/T172360#3500140 (10Supernino) Thank you :)
[08:45:59] <wikibugs>	 10Tools: Tool "comidentgen" loads bootstrap and jquery from ajax.googleapis.com and bootstrapcdn - https://phabricator.wikimedia.org/T172391#3500149 (10Samtar) p:05Triage>03Normal Thanks for the task, I'll get this one done today and check the list for any other of my tools
[09:01:53] <wikibugs>	 10Cloud-Services, 10DBA: Prepare and check storage layer for wikimania2018wiki - https://phabricator.wikimedia.org/T155041#3500190 (10Marostegui) a:03Marostegui I have sanitized all the hosts and ran a check_private data there. I have also registered myself and checked that on the labs hosts my user has been...
[09:06:53] <wikibugs>	 10cloud-services-team: Add socket parameter to maintain-views script - https://phabricator.wikimedia.org/T172496#3500217 (10Marostegui)
[09:06:59] <wikibugs>	 10cloud-services-team: Add socket parameter to maintain-views script - https://phabricator.wikimedia.org/T172496#3500229 (10Marostegui) p:05Triage>03Normal
[09:48:12] <wikibugs>	 10Tools: Tool "cvrminer" loads bootstrap and jquery from cloudflare - https://phabricator.wikimedia.org/T172400#3500351 (10Fnielsen) This is now fixed with https://github.com/fnielsen/cvrminer/commit/49975504a590a1ae53e2e8cc81aadea277cc5600 I have checked at https://tools.wmflabs.org/cvrminer/
[09:51:44] <wikibugs>	 10Tools: Tool "cvrminer" loads bootstrap and jquery from cloudflare - https://phabricator.wikimedia.org/T172400#3500388 (10Fnielsen) Can you confirm that it no longer loads from third-party or should I just close the task myself?
[09:52:26] <shinken-wm>	 PROBLEM - Puppet errors on tools-bastion-03 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[10:27:25] <shinken-wm>	 RECOVERY - Puppet errors on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0]
[10:38:19] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1407 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[11:03:19] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0]
[12:30:36] <wikibugs>	 10Tools, 10Toolforge-standards-committee, 10Privacy: Hunt for Toolforge tools that loads resources from third party sites - https://phabricator.wikimedia.org/T172065#3500698 (10zhuyifei1999)
[12:30:36] <wikibugs>	 10Tools: Tool "cvrminer" loads bootstrap and jquery from cloudflare - https://phabricator.wikimedia.org/T172400#3500696 (10zhuyifei1999) 05Open>03Resolved LGTM, though in the console I see a lot of 404s.
[12:48:27] <wikibugs>	 10Tools: Tool "cvrminer" loads bootstrap and jquery from cloudflare - https://phabricator.wikimedia.org/T172400#3500728 (10Fnielsen) Yes, the 404s is a known problem, that I should fix
[13:38:13] <wikibugs>	 10Quarry: Quarry cannot store results with identical column names - https://phabricator.wikimedia.org/T170464#3500901 (10zhuyifei1999) @Huji Your query https://quarry.wmflabs.org/query/20697 is affected by this bug.
[13:38:26] <wikibugs>	 10Quarry: Quarry cannot store results with identical column names - https://phabricator.wikimedia.org/T170464#3500905 (10zhuyifei1999) p:05Low>03Triage
[13:44:25] <wikibugs>	 10Tools: Tool "dimensioner" loads jquery and bootstrap from bootstrapcdn and ajax.googleapis.com - https://phabricator.wikimedia.org/T172516#3500910 (10zhuyifei1999)
[13:48:19] <wikibugs>	 10Tools: Tool "editathonstat" loads assets from google - https://phabricator.wikimedia.org/T172517#3500927 (10zhuyifei1999)
[13:51:40] <wikibugs>	 10Quarry: Quarry cannot store results with identical column names - https://phabricator.wikimedia.org/T170464#3500951 (10Huji) Noted, thanks!
[13:52:01] <wikibugs>	 10Tools: Tool "etwikibots" loads fork-me-on-github ribbon from github - https://phabricator.wikimedia.org/T172518#3500955 (10zhuyifei1999)
[13:52:45] <wikibugs>	 10Tools: Tool "etwikibots" loads fork-me-on-github ribbon from github - https://phabricator.wikimedia.org/T172518#3500970 (10zhuyifei1999) The tool maintainer is Kentaur. I'm unable to find their Phabricator username.
[13:59:22] <wikibugs>	 10Tools: Tool "etytree" loads jquery, jquery-mobile, and jquery-ui from code.jquery.com, mixed http/https - https://phabricator.wikimedia.org/T172519#3500978 (10zhuyifei1999)
[14:04:58] <wikibugs>	 10Tools: Tool "everythingisconnected" loads assets from google - https://phabricator.wikimedia.org/T172522#3501026 (10zhuyifei1999)
[14:09:13] <wikibugs>	 10Tools: Tool "fastilybot" loads assets from google and bootstrapcdn - https://phabricator.wikimedia.org/T172524#3501052 (10zhuyifei1999)
[14:25:19] <wikibugs>	 10Tools: Tool "gdk-artists-research" loads assets from freegeoip.net - https://phabricator.wikimedia.org/T172527#3501131 (10zhuyifei1999)
[14:35:51] <wikibugs>	 10Tools: Tool "gendergapdashboard" loads assets from many sites - https://phabricator.wikimedia.org/T172530#3501180 (10zhuyifei1999)
[14:36:45] <wikibugs>	 10Tools: Tool "gendergapdashboard" loads assets from many sites - https://phabricator.wikimedia.org/T172530#3501194 (10zhuyifei1999)
[14:38:47] <chasemp>	 zhuyifei1999_: you are a phabricator machine, thanks so much for doing all that, are you going to be in Montreal for wikimania by chance?
[14:39:13] <zhuyifei1999_>	 no :/
[14:39:25] <zhuyifei1999_>	 I went there years ago though
[14:39:45] <zhuyifei1999_>	 (I mean Montreal)
[14:40:39] <chasemp>	 ah bummer! Well hopefully I'll see you sometime this year but either way I apprecaite what you're doing
[14:40:48] <zhuyifei1999_>	 btw, I see a tool by WMDE that simply redirects to a WMDE site.
[14:41:00] <zhuyifei1999_>	 haven't created a ticket yet though. shall I?
[14:41:49] <zhuyifei1999_>	 I'll be in illinois starting from mid-september
[14:42:43] <zhuyifei1999_>	 *mid-august
[14:45:48] <chasemp>	 zhuyifei1999_: I think the official thought has been redirecting from Toolforge to prod is technically ok as it's to an in theory more trusted space, but I don't know about WMDE's privacy policy and all that
[14:46:41] <zhuyifei1999_>	 well yeah. if they redirect to google I would yell at them :P
[14:48:39] <chasemp>	 heh
[14:51:24] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (Kanban), 10Operations, 10Patch-For-Review: Switch to new labs puppetmasters - https://phabricator.wikimedia.org/T171786#3501239 (10Andrew)
[14:52:41] <wikibugs>	 10Tools: Tool "geoplotter" loads assets from google, bootstrapcdn, cloudflare, and cartocdn; mixed http/https - https://phabricator.wikimedia.org/T172533#3501275 (10zhuyifei1999)
[14:57:28] <wikibugs>	 10Tool-Gerrit-Patch-Uploader: Tool "gerrit-patch-uploader" loads fork-me-on-github ribbon from Amazon AWS - https://phabricator.wikimedia.org/T172535#3501310 (10zhuyifei1999)
[15:22:35] <wikibugs>	 10PAWS, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth: PAWS can not login, OAuth error: API error mwoauth-invalid-authorization - https://phabricator.wikimedia.org/T136114#3501435 (10MarcoAurelio) >>! In T136114#3493304, @yuvipanda wrote: >  @MarcoAurelio - is that showing up at paws.tools.wmflabs.org and...
[15:24:44] <wikibugs>	 10Tools: Tool "etwikibots" loads fork-me-on-github ribbon from github - https://phabricator.wikimedia.org/T172518#3501439 (10zhuyifei1999) a:03WikedKentaur
[15:32:05] <andrewbogott>	 ☁️ Wikimedia Cloud Services (wikitech.wikimedia.org) | Use !help to find assistance | Status: Move to new puppetmaster in progress | missing data? see T169774 | Create a phab task if you don’t get your problem resolved in chat! | Channel logs: https://wm-bot.wmflabs.org/logs/%23wikimedia-cloud/ | Code of Conduct: https://www.mediawiki.org/wiki/CoC | on call: madhuvishy or !help
[15:32:05] <stashbot>	 T169774: Cleanup: 2017-07-02 Toolforge data loss for permissive data - https://phabricator.wikimedia.org/T169774
[15:32:22] <andrewbogott>	 For people following along… I'm in the process of moving all existing hosts to a new puppetmaster.
[15:32:42] <andrewbogott>	 step one: disable puppet on all hosts so I can avoid collisions with existing puppet runs when I switch things over...
[15:34:24] <zhuyifei1999_>	 Cyberpower678: will reach your IABot tomorrow <evil smile>
[15:34:43] <Cyberpower678>	 zhuyifei1999_: ?
[15:34:59] * paladox wonders how did you get the full puppet message to show "PROBLEM - puppet on phabricator.phabricator.eqiad.wmflabs is WARNING: WARNING: Puppet is currently disabled, message: disabled during transition to new puppet master 2017-08-04, last run 10 minutes ago with 0 failures"
[15:35:11] <paladox>	 when i do it it only cuts half of
[15:35:14] <zhuyifei1999_>	 https://phabricator.wikimedia.org/T172065
[15:35:24] <zhuyifei1999_>	 Cyberpower678: ^
[15:36:54] <Cyberpower678>	 zhuyifei1999_: oh right.  I forgot about that, again...
[15:37:18] <Cyberpower678>	 I'm not pushing any updates, until I'm able to fix that other issue.
[15:37:47] <zhuyifei1999_>	 paladox: where do you see that message? I only know about `puppet agent -tv` and `puppet apply`
[15:37:51] <zhuyifei1999_>	 Cyberpower678: k
[15:38:07] <andrewbogott>	 paladox: I'm not sure I understand the question, but… I did "puppet agent —disable 'reason I am disabling'"
[15:38:47] <paladox>	 ah
[15:38:51] <paladox>	 you did it in strings
[15:39:01] <paladox>	 i just did puppet agent --disable <message>
[15:39:42] <paladox>	 zhuyifei1999_ i see it in #wikimedia-bots-testing and #wikimedia-ai :)
[15:40:17] <zhuyifei1999_>	 hm okay
[15:42:15] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (Kanban), 10Operations, 10Patch-For-Review: Switch to new labs puppetmasters - https://phabricator.wikimedia.org/T171786#3501452 (10Andrew)
[15:42:58] <andrewbogott>	 Now I'm switching the puppetmaster on all hosts that don't use a project-local puppetmaster, with this mouthful: 
[15:42:59] <andrewbogott>	 grep 'server = labs-puppetmaster-eqiad.wikimedia.org' /etc/puppet/puppet.conf && sed -i 's/labs-puppetmaster-eqiad.wikimedia.org/labs-puppetmaster.wikimedia.org/g' /etc/puppet/puppet.conf && rm -rf /var/lib/puppet/ssl &&  puppet agent --enable && puppet agent --onetime --verbose --no-daemonize --no-splay --show_diff --waitforcert=10 —certname=`hostname -f` —server=labs-puppetmaster.wikimedia.org
[15:43:16] <andrewbogott>	 (that will leave puppet disabled on hosts with project-local puppetmasters; I'll fix that next)
[15:48:32] <wikibugs>	 10Cloud-Services, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labmon1002 - https://phabricator.wikimedia.org/T165784#3501476 (10RobH) a:05RobH>03Cmjohnson This does indeed need hardware raid setup, please setup in a large raid10 of all disks via crash cart, then I can install.  Al...
[16:14:45] <andrewbogott>	 now re-enabling puppet everywhere (including locally hosted things)
[16:17:49] <robh>	 chasemp: Heyas, got in the systems on https://phabricator.wikimedia.org/T162486
[16:17:52] <robh>	 and im making a racking task
[16:17:59] <robh>	 What vlan did you guys need these two hosts in?
[16:18:09] <robh>	 (or any other cloud opsen would likely know ;)
[16:18:31] <chasemp>	 robh: same as normal labvirt, two interfaces.  eth0 is in labs-hosts, eth1 is a trunk that allows labs-instances-b
[16:19:02] <robh>	 chasemp: ok, cuz the systems they are replacing are in labs-support1-c-eqiad
[16:19:07] <robh>	 not in labvirt vlans
[16:19:19] <robh>	 but i know a lot of stuff is shifting around 
[16:19:43] <chasemp>	 robh: yep, these are big virts that are going to replace the functionality of labsdb1004-1007 w/ instances
[16:19:45] <robh>	 but now these need to be in labs-hosts1-b-eqiad?
[16:19:47] <chasemp>	 big as in storage
[16:19:49] <robh>	 cool
[16:19:49] <chasemp>	 yes
[16:19:56] <robh>	 ok, so row b restricted for now
[16:20:01] <chasemp>	 it is
[16:20:08] <robh>	 glad i checked, thanks!
[16:20:15] <chasemp>	 ditto
[16:20:20] <chasemp>	 on thanks :)
[16:22:32] <robh>	 so these have both 10 and 1 Gbit
[16:22:41] <robh>	 but pretty sure all of labs is doing 1gbit for now since we dont have the networking support
[16:22:48] <robh>	 correct me if thats wrong ;D
[16:24:05] <chasemp>	 robh: both sides being 1G for now is ok, we'll deal w/ that when it comes to it
[16:24:45] <chasemp>	 I think the 10G inclusion is part of a historical mirroring of old labvirt specs and it carried over 
[16:24:53] <chasemp>	 nice to have onboard already but not a big concern atm
[16:25:16] <wikibugs>	 10Cloud-Services, 10Operations, 10procurement: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3501521 (10RobH)
[16:25:21] <wikibugs>	 10Cloud-Services, 10Operations, 10ops-eqiad: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3501521 (10RobH)
[16:25:28] <robh>	 ok, feel free to double check that
[16:25:43] <robh>	 (task) for accuracy
[16:25:49] <robh>	 oh, i forgot, what os?  
[16:26:25] <chasemp>	 robh: all virts are still ubuntu trusty
[16:26:43] <andrewbogott>	 OK, that's all the easy cases.  Now, picking up stragglers that didn't switch over (probably because puppet can't actually run at all.)  Doing this with "grep  labs-puppetmaster-eqiad.wikimedia.org /etc/puppet/puppet.conf.d/10-main.conf  &&  hostname -f"
[16:26:50] <wikibugs>	 10Cloud-Services, 10Operations, 10ops-eqiad: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3501521 (10RobH)
[16:26:54] <robh>	 duly noted, i assumed but wanted to check ;]
[16:27:32] <chasemp>	 robh: cool man thanks for handling
[16:28:07] <robh>	 welcome =]
[16:28:34] <andrewbogott>	 huh, never seen this before
[16:28:37] <andrewbogott>	 https://www.irccloud.com/pastebin/HUjt6BH2/
[16:29:40] <andrewbogott>	 https://www.irccloud.com/pastebin/C2kjXQ4o/
[16:29:47] <andrewbogott>	 oops, double paste
[16:30:22] <paladox>	 andrewbogott i thought puppet was uninstalled on there?
[16:30:23] <paladox>	 there's a task for huggle
[16:30:39] <paladox>	 https://phabricator.wikimedia.org/T166588
[16:30:41] <andrewbogott>	 maybe so!  If there's no puppet then I should probably just delete the instance
[16:30:59] <andrewbogott>	 ok, thanks for the link
[16:31:09] <andrewbogott>	 I'll ignore for now
[16:31:15] <paladox>	 your welcome :)
[16:31:34] <bd808>	 yeah, that huggle instance is a problem child :/
[16:39:38] <andrewbogott>	 There are 34 VMs which should be managed by the central puppetmaster but which have broken puppet.
[16:39:39] <andrewbogott>	 https://etherpad.wikimedia.org/p/labpuppetmaster1001-stragglers
[16:39:51] <andrewbogott>	 So, this is how I'll be spending the rest of my day :)
[16:46:03] <chasemp>	 andrewbogott: thanks man! did you log the time anywhere that you started the cut over?
[16:46:32] <andrewbogott>	 chasemp: not really, I couldn't decide if syslog or project log so I just narrated here instead
[16:46:37] <andrewbogott>	 for short-term records at lesat
[16:46:42] <chasemp>	 kk got it
[16:54:22] <wikibugs>	 10Cloud-VPS, 10Huggle: huggle.huggle.eqiad.wmflabs does not have puppet installed - https://phabricator.wikimedia.org/T166588#3501682 (10bd808) p:05Triage>03High
[16:58:08] <wikibugs>	 10Cloud-VPS, 10Huggle: huggle.huggle.eqiad.wmflabs does not have puppet installed - https://phabricator.wikimedia.org/T166588#3501686 (10bd808) When @andrew went through all #cloud-vps vms today to switch them to the new project-wide puppetmaster, this instance was the only vm that is not directly or indirectl...
[17:05:03] <wikibugs>	 10Cloud-VPS, 10Huggle: huggle.huggle.eqiad.wmflabs does not have puppet installed - https://phabricator.wikimedia.org/T166588#3501694 (10bd808)
[17:18:24] <wikibugs>	 10Cloud-VPS, 10Puppet: ::profile::puppetmaster::common missing dependencies when $storeconfigs=puppetdb - https://phabricator.wikimedia.org/T172547#3501730 (10bd808)
[17:25:55] <wikibugs>	 10Cloud-VPS, 10Puppet: ::profile::puppetmaster::common missing dependencies when $storeconfigs=puppetdb - https://phabricator.wikimedia.org/T172547#3501753 (10bd808) `::profile::puppetmaster::common` seems to be applied somehow via [[https://tools.wmflabs.org/openstack-browser/puppetclass/role::puppet_compiler...
[17:29:55] <andrewbogott>	 !log grantreview rebooting grantreview-dev because its dns/dhcp is messed up
[17:29:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Grantreview/SAL
[17:37:44] <andrewbogott>	 addshore, harej, librarybase-reston-01.librarybase.eqiad.wmflabs has a full drive and is causing me some trouble, could one of you have a look?
[17:37:54] <addshore>	 tarrow: ^^
[17:37:56] <andrewbogott>	 the usage is all under www
[17:38:03] <andrewbogott>	 so I'm reluctant to just delete things as I would under /var/log
[17:38:20] <tarrow>	 yeah, good thinking. I'll look
[17:38:24] <addshore>	 :)
[17:40:20] <tarrow>	 addshore: can you log in? I get Permission denied public key (perhaps because of a full disk?)
[17:40:45] <addshore>	 nope, I cant login either
[17:41:04] <addshore>	 andrewbogott: any chance you could try and delete something that looks useless? :D
[17:41:06] <tarrow>	 andrewbogott: can you delete something to make enough space for us to log in?
[17:41:14] <tarrow>	 heh
[17:41:24] <wikibugs>	 10Tools: templatecount tool inaccessible due to 502 Bad Gateway - https://phabricator.wikimedia.org/T172549#3501794 (10Jeff_G)
[17:41:40] <andrewbogott>	 I already deleted a bunch of things, it just filled up again instantly
[17:41:43] <andrewbogott>	 but yes, I'll have another go
[17:42:41] <andrewbogott>	 try now, quickly
[17:42:45] <andrewbogott>	 tarrow: ^
[17:43:12] <tarrow>	 trying
[17:43:46] <andrewbogott>	 ok, I deleted today's apache logs too :/
[17:43:50] <tarrow>	 could you bring the webservice down and then delete some stuff?
[17:44:06] <tarrow>	 no joy yet for me
[17:44:48] <shinken-wm>	 PROBLEM - Puppet errors on tools-puppetmaster-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[17:45:05] <andrewbogott>	 'some stuff'?
[17:47:33] <addshore>	 Still no joy for me D:
[17:48:00] <tarrow>	 without being able to log in I can't give you the exact path but... there should be a subdirectory of the webserver root which has dump in it
[17:48:07] <tarrow>	 I imagine these aren't being rotated
[17:48:12] <tarrow>	 or the rotation has failed somehow
[17:48:36] <tarrow>	 probably called rdf 
[17:48:42] <tarrow>	 or rdf_dumps or something
[17:49:15] <andrewbogott>	 html/rdf/librarybase-rdf-20170713.rdf ?
[17:49:25] <tarrow>	 yeah, that's fine to delete
[17:49:39] <tarrow>	 is there just one dump?
[17:49:42] <andrewbogott>	 lots!
[17:49:45] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-lighttpd-1416 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[17:49:51] <andrewbogott>	 11G worth
[17:49:56] <tarrow>	 feel free to trash all by the last few days
[17:50:00] <tarrow>	 but*
[17:50:36] <tarrow>	 say leave me the most recent 3 days?
[17:50:46] <andrewbogott>	 ok, I left 5
[17:50:57] <tarrow>	 great! I'll then figure out why they aren't rotating
[17:51:05] <andrewbogott>	 puppet seems happier now so I'll log off and leave this to you.  Thanks
[17:51:25] <tarrow>	 great! I still can't login. I guess because puppent wasn't able to run
[17:51:40] <tarrow>	 I image it will sort out in the next 5 mins
[17:52:53] <andrewbogott>	 well, puppet isn't exactly 'happy', you should probably check warnings and such
[17:54:32] <addshore>	 tarrow: I'm in now :)
[17:54:53] <tarrow>	 me too
[17:55:07] <addshore>	 cool, ill leave you to sort it :)
[17:58:37] <tarrow>	 We (I) should find out what harej is up to with it. I'm not sure if much is actively changing there right now. We might be able to scale it back a bit
[17:58:52] <harej>	 I'm not really doing anything with it at the moment
[18:09:54] <shinken-wm>	 PROBLEM - Puppet errors on tools-puppetmaster-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[18:14:45] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-lighttpd-1416 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:24:04] <Sagan>	 andrewbogott: I guess with unreachable instances you mean instances shutdown etc?
[18:24:48] <shinken-wm>	 RECOVERY - Puppet errors on tools-puppetmaster-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:24:54] <shinken-wm>	 RECOVERY - Puppet errors on tools-puppetmaster-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:32:51] <andrewbogott>	 Sagan: being shutdown is an example of a way to be unreachable
[18:34:47] <Sagan>	 andrewbogott: I've got an instance which had that status. Do I need to run puppet first, or do you need to take some steps first?
[18:35:14] <Sagan>	 it's cr1.codereview.eqiad.wmflabs
[18:35:50] <Sagan>	 (I've started it some minutes ago)
[18:35:57] <andrewbogott>	 Sagan, I pasted the command above, let's see… 
[18:35:58] <andrewbogott>	 grep 'server = labs-puppetmaster-eqiad.wikimedia.org' /etc/puppet/puppet.conf && sed -i 's/labs-puppetmaster-eqiad.wikimedia.org/labs-puppetmaster.wikimedia.org/g' /etc/puppet/puppet.conf && rm -rf /var/lib/puppet/ssl &&  puppet agent --enable && puppet agent --onetime --verbose --no-daemonize --no-splay --show_diff --waitforcert=10 —certname=`hostname -f` —server=labs-puppetmaster.wikimedia.org
[18:36:01] <andrewbogott>	 that should do it
[18:36:28] <Sagan>	 andrewbogott: as root I guess?
[18:36:31] <andrewbogott>	 yeah
[18:36:42] <andrewbogott>	 and if it goes well, mark it on the etherpad so it saves me a visit :)
[18:37:23] <Sagan>	 andrewbogott: where do I find the etherpad? :)
[18:37:38] <andrewbogott>	 Oh, I thought that was what you were asking about in the first place
[18:37:43] <andrewbogott>	 https://etherpad.wikimedia.org/p/labpuppetmaster1001-stragglers
[18:38:07] <Sagan>	 no, I just read the mail which says, that unreachable instances need extra work, that's why I asked, if that instance is affected :)
[18:38:13] <andrewbogott>	 ok, thanks
[18:38:13] <Sagan>	 yeah, it works :)
[18:38:49] <andrewbogott>	 I'm out for a bit, back in maybe 45 minutes
[18:39:18] <Sagan>	 I've moved it do that fixed section :)
[18:40:19] <Sagan>	 let's see if other projects are affected where I'm admin as well
[18:41:10] <paladox>	 andrewbogott what if we had an instance shut off when you did the migration?
[18:41:18] <paladox>	 What do we put in puppet.conf?
[18:41:34] <paladox>	 oh sorry for ping just reliased it is connected to it's own puppet master
[18:45:45] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-lighttpd-1416 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[19:06:08] <wikibugs>	 10Cloud-VPS, 10Puppet: ::profile::puppetmaster::common missing dependencies when $storeconfigs=puppetdb - https://phabricator.wikimedia.org/T172547#3502028 (10bd808) Caused by a refactoring in progress for `::role::puppet_compiler` by @Joe: https://gerrit.wikimedia.org/r/#/c/370205/1
[19:07:30] <Sagan>	 andrewbogott: for the record: I added notes to the instances I tried but failed
[19:07:39] <Sagan>	 (at the etherpad)
[19:09:31] <paladox>	 andrewbogott phab-01 uses it's own puppetmaster puppet-phabricator
[19:24:43] <paladox>	 phab-test.contributors.eqiad.wmflabs  (i have no access to it but the labs class was removed)
[19:24:51] <paladox>	 needs moving to phabricators prod role
[19:24:54] <paladox>	 which works.
[19:25:06] <paladox>	 though will need tweaking from the labs one
[19:25:15] <paladox>	 (i mean configs in hiera)
[19:26:16] <wikibugs>	 10Cloud-VPS, 10Puppet: ::profile::puppetmaster::common missing dependencies when $storeconfigs=puppetdb - https://phabricator.wikimedia.org/T172547#3502061 (10bd808) a:03Joe
[19:32:10] <andrewbogott>	 Sagan: did you shut it back down again?
[19:32:47] <Sagan>	 andrewbogott: I think so. or should I keep it running?
[19:33:16] <Sagan>	 the instance is not permanently used. or is it better to keep them running anyway?
[19:34:17] <bd808>	 better that they are running or deleted
[19:35:01] <bd808>	 a halted instance won't get config changes and may fail to be reachable after booting
[19:36:09] <andrewbogott>	 Sagan: I definitely can't fix it if it's off :)  PRobably better to keep it running unless it's actually useless in which case you can delete
[19:36:38] <Sagan>	 andrewbogott: ok. I started it now again :)
[19:38:31] <andrewbogott>	 Sagan: looks fine to me, but let's leave it up
[19:38:41] <Sagan>	 ok :)
[19:45:47] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-lighttpd-1416 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:47:29] <Sagan>	 are there any known problems with packet loss? I'm monitoring my labs services from an external server too, and every day I get at least one check which returns a packet loss
[19:47:48] <Sagan>	 16& today, 16% yesterday, 28% on the second
[19:47:59] <Sagan>	 not sure if that is me or labs, but other checks are ok
[19:52:08] <bd808>	 Sagan: we'd need a lot more information to debug that at all
[19:52:27] <bd808>	 were is the check from and to and at what time etc
[19:52:40] <bd808>	 packet loss in the internet is normal as long as its not persistent
[19:53:29] <Sagan>	 bd808: hm, ok, then I'd say it's nothing important. It's just one check that showed up, and the check 30 sec later showed packet loss = 0%
[19:53:45] <Sagan>	 as long as I don't get the 80% I had once, again ;)
[19:59:13] <wikibugs>	 (03PS1) 10Andrew Bogott: Add profile::openstack::main::observer_password [labs/private] - 10https://gerrit.wikimedia.org/r/370247
[20:01:09] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 032 C: 032] Add profile::openstack::main::observer_password [labs/private] - 10https://gerrit.wikimedia.org/r/370247 (owner: 10Andrew Bogott)
[20:13:16] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (Kanban), 10Operations, 10Patch-For-Review: Switch to new labs puppetmasters - https://phabricator.wikimedia.org/T171786#3502143 (10Andrew)
[20:13:20] <bd808>	 Sagan: 80% for a single check is probably not itself a big deal. 80% for *all* tcp/icmp comms is :)
[20:13:37] <Sagan>	 ah, yeah :D
[20:14:03] <Sagan>	 I guess one reason why icinga only sends an alert after the 5th failed check
[20:15:40] <paladox>	 Sagan you can increase it if you want :)
[20:15:51] <paladox>	 though i've never experenced problems with the ping :)
[20:15:56] <Sagan>	 paladox: heh, only one check failed
[20:16:02] <paladox>	 yep
[20:16:37] <Sagan>	 paladox: my icinga is located at my own server, so the ping needs a bit longer, that's why I already increased the ping wtime
[20:16:48] <paladox>	 oh i see
[20:18:36] <Sagan>	 paladox: 101 ms response time is probably bad between servers in one DC, but not between germany and usa
[20:18:41] <paladox>	 oh
[20:19:08] <paladox>	 I thought we get connected through esams then into the us?
[20:21:26] <Sagan>	 if we use labs too?
[20:21:35] <Sagan>	 I guess that's only for prod
[20:22:01] <Sagan>	 anyway, the normal ping is between 98 and 101
[20:22:12] <Sagan>	 *the average
[20:47:52] <wikibugs>	 10VPS-project-Wikistats: wikistats: add wikimania wikis - https://phabricator.wikimedia.org/T172342#3502296 (10Dzahn) ``` MariaDB [wikistats]> select id,prefix,description,statsurl from wmspecials where prefix like "wikimania%"; +-----+---------------+----------------+--------------------------------------------...
[20:49:32] <wikibugs>	 10VPS-project-Wikistats: wikistats: add wikimania wikis - https://phabricator.wikimedia.org/T172342#3502297 (10Dzahn) http://wikistats.wmflabs.org/display.php?t=wx     and click on description column to sort alpha
[20:51:54] <wikibugs>	 (03Draft1) 10Paladox: Add apt.conf file [labs/icinga2] - 10https://gerrit.wikimedia.org/r/370258
[20:51:56] <wikibugs>	 (03PS2) 10Paladox: Add apt.conf file [labs/icinga2] - 10https://gerrit.wikimedia.org/r/370258
[20:51:59] <wikibugs>	 (03CR) 10Paladox: [V: 032 C: 032] Add apt.conf file [labs/icinga2] - 10https://gerrit.wikimedia.org/r/370258 (owner: 10Paladox)
[20:53:14] <wikibugs>	 (03Draft3) 10Zppix: rm dupe checks on gerrit-mysql [labs/icinga2] - 10https://gerrit.wikimedia.org/r/370255
[20:53:39] <wikibugs>	 10VPS-project-Wikistats: wikistats: add wikimania wikis - https://phabricator.wikimedia.org/T172342#3502299 (10Dzahn) 05Open>03Resolved ``` MariaDB [wikistats]> insert into wmspecials (prefix,description,statsurl) values ("wikimania2017", "Wikimania 2017", "https://wikimania2017.wikimedia.org/w/api.php?actio...
[20:53:41] <wikibugs>	 (03CR) 10Paladox: [V: 032 C: 032] rm dupe checks on gerrit-mysql [labs/icinga2] - 10https://gerrit.wikimedia.org/r/370255 (owner: 10Zppix)
[21:01:07] <wikibugs>	 10Data-Services: Data missing from labs replica of enwiki.imagelinks - https://phabricator.wikimedia.org/T172567#3502311 (10Reedy)
[21:03:29] <wikibugs>	 10Data-Services: Data missing from labs replica of enwiki.imagelinks - https://phabricator.wikimedia.org/T172567#3502330 (10ShakespeareFan00) As originator of the prior query can confirm this.   Currently the query in 20709 is pared down to a minimal example.  I was seeing a similar issue with a number of rows i...
[21:07:39] <wikibugs>	 10Data-Services: Data missing from labs replica of enwiki.imagelinks - https://phabricator.wikimedia.org/T172567#3502350 (10chasemp) p:05Triage>03Normal @Reedy, any difference if you hit the in-progress new labsdb cluster @ labsdb-web.eqiad.wmnet?
[21:08:18] <wikibugs>	 10VPS-project-Wikistats: Add hi.wikiversity to wikistats - https://phabricator.wikimedia.org/T171831#3502353 (10Dzahn) 05Open>03Resolved ``` MariaDB [wikistats]> insert into wikiversity (prefix,method) values ("hi","8"); MariaDB [wikistats]> update wikiversity set lang="Hindi",loclang="&#2361;&#2367;&#2344;&...
[21:13:41] <wikibugs>	 10Data-Services: Data missing from labs replica of enwiki.imagelinks - https://phabricator.wikimedia.org/T172567#3502360 (10Reedy) Yup, that replica seems good...  ``` reedy@tools-bastion-03:~$ mysql --defaults-file=$HOME/replica.my.cnf --host labsdb-web.eqiad.wmnet Welcome to the MariaDB monitor.  Commands end...
[21:19:11] <wikibugs>	 10Data-Services: Data missing from labs replica of enwiki.imagelinks - https://phabricator.wikimedia.org/T172567#3502380 (10Reedy) Seems this is another case of T138967
[21:25:25] <andrewbogott>	 !log contributors removed missing role role::phabricator::labs from puppet config on phab-test
[21:25:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Contributors/SAL
[21:29:42] <andrewbogott>	 !log otrs removing missing class role::otrs::webserver from otrs-memoryleak
[21:29:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Otrs/SAL
[21:31:43] <andrewbogott>	 !log packaging removing broken class role::builder from packager02 instance
[21:31:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Packaging/SAL
[21:33:56] <andrewbogott>	 !log swift removing broken package role::swift::storage from prefix swift-stretch-ms-me prefix.  It was throwing "Error 400 on SERVER: Could not find data item swift::proxy::memcached_servers in any Hiera data file and no default supplied at /etc/puppet/modules/role/manifests/swift/storage.pp" which broke puppet runs entirely
[21:33:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Swift/SAL