[00:56:20] PROBLEM - Puppet errors on tools-exec-1407 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [01:24:40] Hi, today I checked my cronjobs and found all of them commented out, with a comment that read "jobs DISABLED by bd808 on 2017-04-20". Have I missed anything? [01:27:20] You've only noticed 3 and a half months later? [01:27:49] Usually cronjobs are disabled because they're doing something wrong - running things in the wrong place, causing errors, using execessive amounts of resources [01:32:54] yeah, I was absent for a bit <_< [01:33:15] I also noticed an extra "-l hostname=tools-exec-1422" on the jsub commands that I didn't add [01:33:16] What's the tool name? [01:34:35] cobot [01:35:17] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cobot/SAL [01:35:26] 15:04 bd808: Disabled cron jobs [01:35:27] 13:11 chasemp: set crons to schedule on tools-exec-1422 only for testing. This tool is launching jobs that write and then read hundreds of megs in a few minutes. I caught it tripping up puppet on tools-exec-1437 [01:37:44] uh, thanks [01:38:06] is there any way to get email notifications about these things? I left disabled the job that might be causing that [01:38:21] You can add that page to your watchlist on wikitech [01:38:22] Then enable.. [01:39:04] https://wikitech.wikimedia.org/wiki/Special:Preferences#mw-prefsection-personal [01:39:10] "Email me when a page or a file on my watchlist has changed" [01:39:46] Thanks [01:40:59] PROBLEM - Puppet errors on tools-exec-1418 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [01:42:06] PROBLEM - Puppet errors on tools-webgrid-lighttpd-1427 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [01:43:20] PROBLEM - Puppet errors on tools-webgrid-lighttpd-1413 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [01:44:09] PROBLEM - Puppet errors on tools-exec-1410 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [02:04:54] PROBLEM - Puppet errors on tools-exec-1436 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [02:16:08] Reedy: thanks for helping Polsaker find the logs [02:16:22] Polsaker: I don't remember any more than what we wrote there [02:17:04] RECOVERY - Puppet errors on tools-webgrid-lighttpd-1427 is OK: OK: Less than 1.00% above the threshold [0.0] [02:17:54] bd808: I just re-enabled two of the jobs that use the least amount of resources [02:18:11] And I thouhgt "bd808" was a server's name :p [02:18:15] heh [02:18:22] RECOVERY - Puppet errors on tools-webgrid-lighttpd-1413 is OK: OK: Less than 1.00% above the threshold [0.0] [02:18:53] if you can figure out how to slow down the i/o needs of the other job it would probably be fine to turn back on too [02:19:10] RECOVERY - Puppet errors on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [02:19:48] or if you need faster io and can do it in /tmp that works too [02:20:44] basically our NFS servers get overloaded and that slows everything down. when it happens we go looking for things that are causing load [02:20:54] not a very nice system, but its what we have right now [02:21:03] RECOVERY - Puppet errors on tools-exec-1418 is OK: OK: Less than 1.00% above the threshold [0.0] [02:22:11] (03PS1) 10BryanDavis: Change ToolInfo authors to list of strings [labs/striker] - 10https://gerrit.wikimedia.org/r/370134 (https://phabricator.wikimedia.org/T149458) [02:22:12] (03PS1) 10BryanDavis: Encode toolinfo.json as utf8 [labs/striker] - 10https://gerrit.wikimedia.org/r/370135 [02:32:11] (03CR) 10BryanDavis: Add toolinfo.json style data (031 comment) [labs/striker] - 10https://gerrit.wikimedia.org/r/353909 (https://phabricator.wikimedia.org/T149458) (owner: 10BryanDavis) [02:36:17] RECOVERY - Puppet errors on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [02:45:47] (03PS1) 10BryanDavis: [WIP] DDL changes for tool account management [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/370139 (https://phabricator.wikimedia.org/T149458) [03:09:57] RECOVERY - Puppet errors on tools-exec-1436 is OK: OK: Less than 1.00% above the threshold [0.0] [03:14:34] (03PS2) 10BryanDavis: [WIP] DDL changes for tool account management [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/370139 (https://phabricator.wikimedia.org/T149458) [03:26:45] (03PS1) 10BryanDavis: Change default mysql collation to utf8mb4_unicode_ci [labs/striker] - 10https://gerrit.wikimedia.org/r/370143 [03:33:13] bd808: about T172478 I might have found the cause of the issue, but I need to understand better how it's configured that way, and nobody from the task is replying to pings [03:33:14] T172478: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478 [03:33:57] XioNoX: nice. I'm just going to send a note to the list so when someone comes yelling we can point to it :) [03:34:22] Its pretty late for mukunda now [03:35:53] PROBLEM - Puppet errors on tools-exec-1436 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [03:37:54] the people involved with the migration i think have gone to bed [03:38:21] I think you're better off waiting til the morning [03:41:52] bd808: fixed [03:42:50] XioNoX: yeah? sweet. Let me check it out [03:43:05] commented on the task as well [03:43:19] "ssh: connect to host git-ssh.wikimedia.org port 22: Connection refused" [03:43:47] bd808: what are you trying to do? [03:43:56] git pull a repo [03:44:29] I'll paste a -vvv dump from ssh [03:44:53] XioNoX: https://phabricator.wikimedia.org/P5853 [03:45:13] ah [03:45:19] seems like 2 unrelated issues [03:45:32] bd808: what server is behind git-ssh ? [03:45:47] that is a fine question :) [03:46:29] phab1001-vcs.eqiad.wmnet? [03:47:04] so Connection refused usually means routing OK but nothing listening on that port (fyi) [03:47:37] oh, it's an LVS [03:48:40] The lvs part may be brand new from the switch [03:48:47] I'm unsure about that [03:50:12] yeah, I'm trying to follow the clues [03:55:04] XioNoX: from backscroll in -operations it looks like maybe the backend is phab1001.eqiad.wmnet [03:55:14] yeah it is [03:55:31] the IPs match [03:55:55] *nod* it looks like they fought with pybal for a while too. [03:56:26] y'all will probably figure it out as soon as Daniel or Mukunda are around to fill in the blanks [03:56:37] yeah, I need to figure out which LVS is hosting the public IP [03:57:03] ah lvs1002.wikimedia.org. [04:08:08] back, what's up [04:08:26] why did we get paged about phd now [04:08:53] ah cool [04:09:46] first fixing the paging issue [04:15:54] RECOVERY - Puppet errors on tools-exec-1436 is OK: OK: Less than 1.00% above the threshold [0.0] [04:25:34] 10VPS-project-Wikistats: List of largest MediaWikis does not update Wikias - https://phabricator.wikimedia.org/T172481#3499774 (10LWChris) [05:27:18] PROBLEM - Puppet errors on tools-exec-1407 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [06:02:18] RECOVERY - Puppet errors on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [07:42:50] 10Tools: Tool "commons-coverage" loads map tiles from OSM - https://phabricator.wikimedia.org/T172395#3499954 (10Emijrp) @Ricordisamoa Thanks. Then I think I don't have to use it in my code anymore. [07:55:31] 10Tool-Erwin's-tools: 502 Bad Gateway - https://phabricator.wikimedia.org/T172360#3496182 (10TheDJ) ``` 2017-08-01 13:05:33: (server.c.1444) [note] sockets disabled, connection limit reached ``` I have restarted the web service. I also fixed a problem with it's ErrorHandler that had been spamming the error lo... [07:55:44] 10Tool-Erwin's-tools: 502 Bad Gateway - https://phabricator.wikimedia.org/T172360#3499987 (10TheDJ) 05Open>03Resolved a:03TheDJ [08:02:47] 10Tools: Tool "cvrminer" loads bootstrap and jquery from cloudflare - https://phabricator.wikimedia.org/T172400#3500000 (10Fnielsen) Thanks for reminding me of this. I will change it to a toolforge serving. [08:43:10] 10Tool-Erwin's-tools: 502 Bad Gateway - https://phabricator.wikimedia.org/T172360#3500140 (10Supernino) Thank you :) [08:45:59] 10Tools: Tool "comidentgen" loads bootstrap and jquery from ajax.googleapis.com and bootstrapcdn - https://phabricator.wikimedia.org/T172391#3500149 (10Samtar) p:05Triage>03Normal Thanks for the task, I'll get this one done today and check the list for any other of my tools [09:01:53] 10Cloud-Services, 10DBA: Prepare and check storage layer for wikimania2018wiki - https://phabricator.wikimedia.org/T155041#3500190 (10Marostegui) a:03Marostegui I have sanitized all the hosts and ran a check_private data there. I have also registered myself and checked that on the labs hosts my user has been... [09:06:53] 10cloud-services-team: Add socket parameter to maintain-views script - https://phabricator.wikimedia.org/T172496#3500217 (10Marostegui) [09:06:59] 10cloud-services-team: Add socket parameter to maintain-views script - https://phabricator.wikimedia.org/T172496#3500229 (10Marostegui) p:05Triage>03Normal [09:48:12] 10Tools: Tool "cvrminer" loads bootstrap and jquery from cloudflare - https://phabricator.wikimedia.org/T172400#3500351 (10Fnielsen) This is now fixed with https://github.com/fnielsen/cvrminer/commit/49975504a590a1ae53e2e8cc81aadea277cc5600 I have checked at https://tools.wmflabs.org/cvrminer/ [09:51:44] 10Tools: Tool "cvrminer" loads bootstrap and jquery from cloudflare - https://phabricator.wikimedia.org/T172400#3500388 (10Fnielsen) Can you confirm that it no longer loads from third-party or should I just close the task myself? [09:52:26] PROBLEM - Puppet errors on tools-bastion-03 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [10:27:25] RECOVERY - Puppet errors on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0] [10:38:19] PROBLEM - Puppet errors on tools-exec-1407 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [11:03:19] RECOVERY - Puppet errors on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [12:30:36] 10Tools, 10Toolforge-standards-committee, 10Privacy: Hunt for Toolforge tools that loads resources from third party sites - https://phabricator.wikimedia.org/T172065#3500698 (10zhuyifei1999) [12:30:36] 10Tools: Tool "cvrminer" loads bootstrap and jquery from cloudflare - https://phabricator.wikimedia.org/T172400#3500696 (10zhuyifei1999) 05Open>03Resolved LGTM, though in the console I see a lot of 404s. [12:48:27] 10Tools: Tool "cvrminer" loads bootstrap and jquery from cloudflare - https://phabricator.wikimedia.org/T172400#3500728 (10Fnielsen) Yes, the 404s is a known problem, that I should fix [13:38:13] 10Quarry: Quarry cannot store results with identical column names - https://phabricator.wikimedia.org/T170464#3500901 (10zhuyifei1999) @Huji Your query https://quarry.wmflabs.org/query/20697 is affected by this bug. [13:38:26] 10Quarry: Quarry cannot store results with identical column names - https://phabricator.wikimedia.org/T170464#3500905 (10zhuyifei1999) p:05Low>03Triage [13:44:25] 10Tools: Tool "dimensioner" loads jquery and bootstrap from bootstrapcdn and ajax.googleapis.com - https://phabricator.wikimedia.org/T172516#3500910 (10zhuyifei1999) [13:48:19] 10Tools: Tool "editathonstat" loads assets from google - https://phabricator.wikimedia.org/T172517#3500927 (10zhuyifei1999) [13:51:40] 10Quarry: Quarry cannot store results with identical column names - https://phabricator.wikimedia.org/T170464#3500951 (10Huji) Noted, thanks! [13:52:01] 10Tools: Tool "etwikibots" loads fork-me-on-github ribbon from github - https://phabricator.wikimedia.org/T172518#3500955 (10zhuyifei1999) [13:52:45] 10Tools: Tool "etwikibots" loads fork-me-on-github ribbon from github - https://phabricator.wikimedia.org/T172518#3500970 (10zhuyifei1999) The tool maintainer is Kentaur. I'm unable to find their Phabricator username. [13:59:22] 10Tools: Tool "etytree" loads jquery, jquery-mobile, and jquery-ui from code.jquery.com, mixed http/https - https://phabricator.wikimedia.org/T172519#3500978 (10zhuyifei1999) [14:04:58] 10Tools: Tool "everythingisconnected" loads assets from google - https://phabricator.wikimedia.org/T172522#3501026 (10zhuyifei1999) [14:09:13] 10Tools: Tool "fastilybot" loads assets from google and bootstrapcdn - https://phabricator.wikimedia.org/T172524#3501052 (10zhuyifei1999) [14:25:19] 10Tools: Tool "gdk-artists-research" loads assets from freegeoip.net - https://phabricator.wikimedia.org/T172527#3501131 (10zhuyifei1999) [14:35:51] 10Tools: Tool "gendergapdashboard" loads assets from many sites - https://phabricator.wikimedia.org/T172530#3501180 (10zhuyifei1999) [14:36:45] 10Tools: Tool "gendergapdashboard" loads assets from many sites - https://phabricator.wikimedia.org/T172530#3501194 (10zhuyifei1999) [14:38:47] zhuyifei1999_: you are a phabricator machine, thanks so much for doing all that, are you going to be in Montreal for wikimania by chance? [14:39:13] no :/ [14:39:25] I went there years ago though [14:39:45] (I mean Montreal) [14:40:39] ah bummer! Well hopefully I'll see you sometime this year but either way I apprecaite what you're doing [14:40:48] btw, I see a tool by WMDE that simply redirects to a WMDE site. [14:41:00] haven't created a ticket yet though. shall I? [14:41:49] I'll be in illinois starting from mid-september [14:42:43] *mid-august [14:45:48] zhuyifei1999_: I think the official thought has been redirecting from Toolforge to prod is technically ok as it's to an in theory more trusted space, but I don't know about WMDE's privacy policy and all that [14:46:41] well yeah. if they redirect to google I would yell at them :P [14:48:39] heh [14:51:24] 10Cloud-VPS, 10cloud-services-team (Kanban), 10Operations, 10Patch-For-Review: Switch to new labs puppetmasters - https://phabricator.wikimedia.org/T171786#3501239 (10Andrew) [14:52:41] 10Tools: Tool "geoplotter" loads assets from google, bootstrapcdn, cloudflare, and cartocdn; mixed http/https - https://phabricator.wikimedia.org/T172533#3501275 (10zhuyifei1999) [14:57:28] 10Tool-Gerrit-Patch-Uploader: Tool "gerrit-patch-uploader" loads fork-me-on-github ribbon from Amazon AWS - https://phabricator.wikimedia.org/T172535#3501310 (10zhuyifei1999) [15:22:35] 10PAWS, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth: PAWS can not login, OAuth error: API error mwoauth-invalid-authorization - https://phabricator.wikimedia.org/T136114#3501435 (10MarcoAurelio) >>! In T136114#3493304, @yuvipanda wrote: > @MarcoAurelio - is that showing up at paws.tools.wmflabs.org and... [15:24:44] 10Tools: Tool "etwikibots" loads fork-me-on-github ribbon from github - https://phabricator.wikimedia.org/T172518#3501439 (10zhuyifei1999) a:03WikedKentaur [15:32:05] ☁️ Wikimedia Cloud Services (wikitech.wikimedia.org) | Use !help to find assistance | Status: Move to new puppetmaster in progress | missing data? see T169774 | Create a phab task if you don’t get your problem resolved in chat! | Channel logs: https://wm-bot.wmflabs.org/logs/%23wikimedia-cloud/ | Code of Conduct: https://www.mediawiki.org/wiki/CoC | on call: madhuvishy or !help [15:32:05] T169774: Cleanup: 2017-07-02 Toolforge data loss for permissive data - https://phabricator.wikimedia.org/T169774 [15:32:22] For people following along… I'm in the process of moving all existing hosts to a new puppetmaster. [15:32:42] step one: disable puppet on all hosts so I can avoid collisions with existing puppet runs when I switch things over... [15:34:24] Cyberpower678: will reach your IABot tomorrow [15:34:43] zhuyifei1999_: ? [15:34:59] * paladox wonders how did you get the full puppet message to show "PROBLEM - puppet on phabricator.phabricator.eqiad.wmflabs is WARNING: WARNING: Puppet is currently disabled, message: disabled during transition to new puppet master 2017-08-04, last run 10 minutes ago with 0 failures" [15:35:11] when i do it it only cuts half of [15:35:14] https://phabricator.wikimedia.org/T172065 [15:35:24] Cyberpower678: ^ [15:36:54] zhuyifei1999_: oh right. I forgot about that, again... [15:37:18] I'm not pushing any updates, until I'm able to fix that other issue. [15:37:47] paladox: where do you see that message? I only know about `puppet agent -tv` and `puppet apply` [15:37:51] Cyberpower678: k [15:38:07] paladox: I'm not sure I understand the question, but… I did "puppet agent —disable 'reason I am disabling'" [15:38:47] ah [15:38:51] you did it in strings [15:39:01] i just did puppet agent --disable [15:39:42] zhuyifei1999_ i see it in #wikimedia-bots-testing and #wikimedia-ai :) [15:40:17] hm okay [15:42:15] 10Cloud-VPS, 10cloud-services-team (Kanban), 10Operations, 10Patch-For-Review: Switch to new labs puppetmasters - https://phabricator.wikimedia.org/T171786#3501452 (10Andrew) [15:42:58] Now I'm switching the puppetmaster on all hosts that don't use a project-local puppetmaster, with this mouthful: [15:42:59] grep 'server = labs-puppetmaster-eqiad.wikimedia.org' /etc/puppet/puppet.conf && sed -i 's/labs-puppetmaster-eqiad.wikimedia.org/labs-puppetmaster.wikimedia.org/g' /etc/puppet/puppet.conf && rm -rf /var/lib/puppet/ssl && puppet agent --enable && puppet agent --onetime --verbose --no-daemonize --no-splay --show_diff --waitforcert=10 —certname=`hostname -f` —server=labs-puppetmaster.wikimedia.org [15:43:16] (that will leave puppet disabled on hosts with project-local puppetmasters; I'll fix that next) [15:48:32] 10Cloud-Services, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labmon1002 - https://phabricator.wikimedia.org/T165784#3501476 (10RobH) a:05RobH>03Cmjohnson This does indeed need hardware raid setup, please setup in a large raid10 of all disks via crash cart, then I can install. Al... [16:14:45] now re-enabling puppet everywhere (including locally hosted things) [16:17:49] chasemp: Heyas, got in the systems on https://phabricator.wikimedia.org/T162486 [16:17:52] and im making a racking task [16:17:59] What vlan did you guys need these two hosts in? [16:18:09] (or any other cloud opsen would likely know ;) [16:18:31] robh: same as normal labvirt, two interfaces. eth0 is in labs-hosts, eth1 is a trunk that allows labs-instances-b [16:19:02] chasemp: ok, cuz the systems they are replacing are in labs-support1-c-eqiad [16:19:07] not in labvirt vlans [16:19:19] but i know a lot of stuff is shifting around [16:19:43] robh: yep, these are big virts that are going to replace the functionality of labsdb1004-1007 w/ instances [16:19:45] but now these need to be in labs-hosts1-b-eqiad? [16:19:47] big as in storage [16:19:49] cool [16:19:49] yes [16:19:56] ok, so row b restricted for now [16:20:01] it is [16:20:08] glad i checked, thanks! [16:20:15] ditto [16:20:20] on thanks :) [16:22:32] so these have both 10 and 1 Gbit [16:22:41] but pretty sure all of labs is doing 1gbit for now since we dont have the networking support [16:22:48] correct me if thats wrong ;D [16:24:05] robh: both sides being 1G for now is ok, we'll deal w/ that when it comes to it [16:24:45] I think the 10G inclusion is part of a historical mirroring of old labvirt specs and it carried over [16:24:53] nice to have onboard already but not a big concern atm [16:25:16] 10Cloud-Services, 10Operations, 10procurement: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3501521 (10RobH) [16:25:21] 10Cloud-Services, 10Operations, 10ops-eqiad: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3501521 (10RobH) [16:25:28] ok, feel free to double check that [16:25:43] (task) for accuracy [16:25:49] oh, i forgot, what os? [16:26:25] robh: all virts are still ubuntu trusty [16:26:43] OK, that's all the easy cases. Now, picking up stragglers that didn't switch over (probably because puppet can't actually run at all.) Doing this with "grep labs-puppetmaster-eqiad.wikimedia.org /etc/puppet/puppet.conf.d/10-main.conf && hostname -f" [16:26:50] 10Cloud-Services, 10Operations, 10ops-eqiad: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3501521 (10RobH) [16:26:54] duly noted, i assumed but wanted to check ;] [16:27:32] robh: cool man thanks for handling [16:28:07] welcome =] [16:28:34] huh, never seen this before [16:28:37] https://www.irccloud.com/pastebin/HUjt6BH2/ [16:29:40] https://www.irccloud.com/pastebin/C2kjXQ4o/ [16:29:47] oops, double paste [16:30:22] andrewbogott i thought puppet was uninstalled on there? [16:30:23] there's a task for huggle [16:30:39] https://phabricator.wikimedia.org/T166588 [16:30:41] maybe so! If there's no puppet then I should probably just delete the instance [16:30:59] ok, thanks for the link [16:31:09] I'll ignore for now [16:31:15] your welcome :) [16:31:34] yeah, that huggle instance is a problem child :/ [16:39:38] There are 34 VMs which should be managed by the central puppetmaster but which have broken puppet. [16:39:39] https://etherpad.wikimedia.org/p/labpuppetmaster1001-stragglers [16:39:51] So, this is how I'll be spending the rest of my day :) [16:46:03] andrewbogott: thanks man! did you log the time anywhere that you started the cut over? [16:46:32] chasemp: not really, I couldn't decide if syslog or project log so I just narrated here instead [16:46:37] for short-term records at lesat [16:46:42] kk got it [16:54:22] 10Cloud-VPS, 10Huggle: huggle.huggle.eqiad.wmflabs does not have puppet installed - https://phabricator.wikimedia.org/T166588#3501682 (10bd808) p:05Triage>03High [16:58:08] 10Cloud-VPS, 10Huggle: huggle.huggle.eqiad.wmflabs does not have puppet installed - https://phabricator.wikimedia.org/T166588#3501686 (10bd808) When @andrew went through all #cloud-vps vms today to switch them to the new project-wide puppetmaster, this instance was the only vm that is not directly or indirectl... [17:05:03] 10Cloud-VPS, 10Huggle: huggle.huggle.eqiad.wmflabs does not have puppet installed - https://phabricator.wikimedia.org/T166588#3501694 (10bd808) [17:18:24] 10Cloud-VPS, 10Puppet: ::profile::puppetmaster::common missing dependencies when $storeconfigs=puppetdb - https://phabricator.wikimedia.org/T172547#3501730 (10bd808) [17:25:55] 10Cloud-VPS, 10Puppet: ::profile::puppetmaster::common missing dependencies when $storeconfigs=puppetdb - https://phabricator.wikimedia.org/T172547#3501753 (10bd808) `::profile::puppetmaster::common` seems to be applied somehow via [[https://tools.wmflabs.org/openstack-browser/puppetclass/role::puppet_compiler... [17:29:55] !log grantreview rebooting grantreview-dev because its dns/dhcp is messed up [17:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Grantreview/SAL [17:37:44] addshore, harej, librarybase-reston-01.librarybase.eqiad.wmflabs has a full drive and is causing me some trouble, could one of you have a look? [17:37:54] tarrow: ^^ [17:37:56] the usage is all under www [17:38:03] so I'm reluctant to just delete things as I would under /var/log [17:38:20] yeah, good thinking. I'll look [17:38:24] :) [17:40:20] addshore: can you log in? I get Permission denied public key (perhaps because of a full disk?) [17:40:45] nope, I cant login either [17:41:04] andrewbogott: any chance you could try and delete something that looks useless? :D [17:41:06] andrewbogott: can you delete something to make enough space for us to log in? [17:41:14] heh [17:41:24] 10Tools: templatecount tool inaccessible due to 502 Bad Gateway - https://phabricator.wikimedia.org/T172549#3501794 (10Jeff_G) [17:41:40] I already deleted a bunch of things, it just filled up again instantly [17:41:43] but yes, I'll have another go [17:42:41] try now, quickly [17:42:45] tarrow: ^ [17:43:12] trying [17:43:46] ok, I deleted today's apache logs too :/ [17:43:50] could you bring the webservice down and then delete some stuff? [17:44:06] no joy yet for me [17:44:48] PROBLEM - Puppet errors on tools-puppetmaster-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:45:05] 'some stuff'? [17:47:33] Still no joy for me D: [17:48:00] without being able to log in I can't give you the exact path but... there should be a subdirectory of the webserver root which has dump in it [17:48:07] I imagine these aren't being rotated [17:48:12] or the rotation has failed somehow [17:48:36] probably called rdf [17:48:42] or rdf_dumps or something [17:49:15] html/rdf/librarybase-rdf-20170713.rdf ? [17:49:25] yeah, that's fine to delete [17:49:39] is there just one dump? [17:49:42] lots! [17:49:45] PROBLEM - Puppet errors on tools-webgrid-lighttpd-1416 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [17:49:51] 11G worth [17:49:56] feel free to trash all by the last few days [17:50:00] but* [17:50:36] say leave me the most recent 3 days? [17:50:46] ok, I left 5 [17:50:57] great! I'll then figure out why they aren't rotating [17:51:05] puppet seems happier now so I'll log off and leave this to you. Thanks [17:51:25] great! I still can't login. I guess because puppent wasn't able to run [17:51:40] I image it will sort out in the next 5 mins [17:52:53] well, puppet isn't exactly 'happy', you should probably check warnings and such [17:54:32] tarrow: I'm in now :) [17:54:53] me too [17:55:07] cool, ill leave you to sort it :) [17:58:37] We (I) should find out what harej is up to with it. I'm not sure if much is actively changing there right now. We might be able to scale it back a bit [17:58:52] I'm not really doing anything with it at the moment [18:09:54] PROBLEM - Puppet errors on tools-puppetmaster-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [18:14:45] RECOVERY - Puppet errors on tools-webgrid-lighttpd-1416 is OK: OK: Less than 1.00% above the threshold [0.0] [18:24:04] andrewbogott: I guess with unreachable instances you mean instances shutdown etc? [18:24:48] RECOVERY - Puppet errors on tools-puppetmaster-02 is OK: OK: Less than 1.00% above the threshold [0.0] [18:24:54] RECOVERY - Puppet errors on tools-puppetmaster-01 is OK: OK: Less than 1.00% above the threshold [0.0] [18:32:51] Sagan: being shutdown is an example of a way to be unreachable [18:34:47] andrewbogott: I've got an instance which had that status. Do I need to run puppet first, or do you need to take some steps first? [18:35:14] it's cr1.codereview.eqiad.wmflabs [18:35:50] (I've started it some minutes ago) [18:35:57] Sagan, I pasted the command above, let's see… [18:35:58] grep 'server = labs-puppetmaster-eqiad.wikimedia.org' /etc/puppet/puppet.conf && sed -i 's/labs-puppetmaster-eqiad.wikimedia.org/labs-puppetmaster.wikimedia.org/g' /etc/puppet/puppet.conf && rm -rf /var/lib/puppet/ssl && puppet agent --enable && puppet agent --onetime --verbose --no-daemonize --no-splay --show_diff --waitforcert=10 —certname=`hostname -f` —server=labs-puppetmaster.wikimedia.org [18:36:01] that should do it [18:36:28] andrewbogott: as root I guess? [18:36:31] yeah [18:36:42] and if it goes well, mark it on the etherpad so it saves me a visit :) [18:37:23] andrewbogott: where do I find the etherpad? :) [18:37:38] Oh, I thought that was what you were asking about in the first place [18:37:43] https://etherpad.wikimedia.org/p/labpuppetmaster1001-stragglers [18:38:07] no, I just read the mail which says, that unreachable instances need extra work, that's why I asked, if that instance is affected :) [18:38:13] ok, thanks [18:38:13] yeah, it works :) [18:38:49] I'm out for a bit, back in maybe 45 minutes [18:39:18] I've moved it do that fixed section :) [18:40:19] let's see if other projects are affected where I'm admin as well [18:41:10] andrewbogott what if we had an instance shut off when you did the migration? [18:41:18] What do we put in puppet.conf? [18:41:34] oh sorry for ping just reliased it is connected to it's own puppet master [18:45:45] PROBLEM - Puppet errors on tools-webgrid-lighttpd-1416 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [19:06:08] 10Cloud-VPS, 10Puppet: ::profile::puppetmaster::common missing dependencies when $storeconfigs=puppetdb - https://phabricator.wikimedia.org/T172547#3502028 (10bd808) Caused by a refactoring in progress for `::role::puppet_compiler` by @Joe: https://gerrit.wikimedia.org/r/#/c/370205/1 [19:07:30] andrewbogott: for the record: I added notes to the instances I tried but failed [19:07:39] (at the etherpad) [19:09:31] andrewbogott phab-01 uses it's own puppetmaster puppet-phabricator [19:24:43] phab-test.contributors.eqiad.wmflabs (i have no access to it but the labs class was removed) [19:24:51] needs moving to phabricators prod role [19:24:54] which works. [19:25:06] though will need tweaking from the labs one [19:25:15] (i mean configs in hiera) [19:26:16] 10Cloud-VPS, 10Puppet: ::profile::puppetmaster::common missing dependencies when $storeconfigs=puppetdb - https://phabricator.wikimedia.org/T172547#3502061 (10bd808) a:03Joe [19:32:10] Sagan: did you shut it back down again? [19:32:47] andrewbogott: I think so. or should I keep it running? [19:33:16] the instance is not permanently used. or is it better to keep them running anyway? [19:34:17] better that they are running or deleted [19:35:01] a halted instance won't get config changes and may fail to be reachable after booting [19:36:09] Sagan: I definitely can't fix it if it's off :) PRobably better to keep it running unless it's actually useless in which case you can delete [19:36:38] andrewbogott: ok. I started it now again :) [19:38:31] Sagan: looks fine to me, but let's leave it up [19:38:41] ok :) [19:45:47] RECOVERY - Puppet errors on tools-webgrid-lighttpd-1416 is OK: OK: Less than 1.00% above the threshold [0.0] [19:47:29] are there any known problems with packet loss? I'm monitoring my labs services from an external server too, and every day I get at least one check which returns a packet loss [19:47:48] 16& today, 16% yesterday, 28% on the second [19:47:59] not sure if that is me or labs, but other checks are ok [19:52:08] Sagan: we'd need a lot more information to debug that at all [19:52:27] were is the check from and to and at what time etc [19:52:40] packet loss in the internet is normal as long as its not persistent [19:53:29] bd808: hm, ok, then I'd say it's nothing important. It's just one check that showed up, and the check 30 sec later showed packet loss = 0% [19:53:45] as long as I don't get the 80% I had once, again ;) [19:59:13] (03PS1) 10Andrew Bogott: Add profile::openstack::main::observer_password [labs/private] - 10https://gerrit.wikimedia.org/r/370247 [20:01:09] (03CR) 10Andrew Bogott: [V: 032 C: 032] Add profile::openstack::main::observer_password [labs/private] - 10https://gerrit.wikimedia.org/r/370247 (owner: 10Andrew Bogott) [20:13:16] 10Cloud-VPS, 10cloud-services-team (Kanban), 10Operations, 10Patch-For-Review: Switch to new labs puppetmasters - https://phabricator.wikimedia.org/T171786#3502143 (10Andrew) [20:13:20] Sagan: 80% for a single check is probably not itself a big deal. 80% for *all* tcp/icmp comms is :) [20:13:37] ah, yeah :D [20:14:03] I guess one reason why icinga only sends an alert after the 5th failed check [20:15:40] Sagan you can increase it if you want :) [20:15:51] though i've never experenced problems with the ping :) [20:15:56] paladox: heh, only one check failed [20:16:02] yep [20:16:37] paladox: my icinga is located at my own server, so the ping needs a bit longer, that's why I already increased the ping wtime [20:16:48] oh i see [20:18:36] paladox: 101 ms response time is probably bad between servers in one DC, but not between germany and usa [20:18:41] oh [20:19:08] I thought we get connected through esams then into the us? [20:21:26] if we use labs too? [20:21:35] I guess that's only for prod [20:22:01] anyway, the normal ping is between 98 and 101 [20:22:12] *the average [20:47:52] 10VPS-project-Wikistats: wikistats: add wikimania wikis - https://phabricator.wikimedia.org/T172342#3502296 (10Dzahn) ``` MariaDB [wikistats]> select id,prefix,description,statsurl from wmspecials where prefix like "wikimania%"; +-----+---------------+----------------+--------------------------------------------... [20:49:32] 10VPS-project-Wikistats: wikistats: add wikimania wikis - https://phabricator.wikimedia.org/T172342#3502297 (10Dzahn) http://wikistats.wmflabs.org/display.php?t=wx and click on description column to sort alpha [20:51:54] (03Draft1) 10Paladox: Add apt.conf file [labs/icinga2] - 10https://gerrit.wikimedia.org/r/370258 [20:51:56] (03PS2) 10Paladox: Add apt.conf file [labs/icinga2] - 10https://gerrit.wikimedia.org/r/370258 [20:51:59] (03CR) 10Paladox: [V: 032 C: 032] Add apt.conf file [labs/icinga2] - 10https://gerrit.wikimedia.org/r/370258 (owner: 10Paladox) [20:53:14] (03Draft3) 10Zppix: rm dupe checks on gerrit-mysql [labs/icinga2] - 10https://gerrit.wikimedia.org/r/370255 [20:53:39] 10VPS-project-Wikistats: wikistats: add wikimania wikis - https://phabricator.wikimedia.org/T172342#3502299 (10Dzahn) 05Open>03Resolved ``` MariaDB [wikistats]> insert into wmspecials (prefix,description,statsurl) values ("wikimania2017", "Wikimania 2017", "https://wikimania2017.wikimedia.org/w/api.php?actio... [20:53:41] (03CR) 10Paladox: [V: 032 C: 032] rm dupe checks on gerrit-mysql [labs/icinga2] - 10https://gerrit.wikimedia.org/r/370255 (owner: 10Zppix) [21:01:07] 10Data-Services: Data missing from labs replica of enwiki.imagelinks - https://phabricator.wikimedia.org/T172567#3502311 (10Reedy) [21:03:29] 10Data-Services: Data missing from labs replica of enwiki.imagelinks - https://phabricator.wikimedia.org/T172567#3502330 (10ShakespeareFan00) As originator of the prior query can confirm this. Currently the query in 20709 is pared down to a minimal example. I was seeing a similar issue with a number of rows i... [21:07:39] 10Data-Services: Data missing from labs replica of enwiki.imagelinks - https://phabricator.wikimedia.org/T172567#3502350 (10chasemp) p:05Triage>03Normal @Reedy, any difference if you hit the in-progress new labsdb cluster @ labsdb-web.eqiad.wmnet? [21:08:18] 10VPS-project-Wikistats: Add hi.wikiversity to wikistats - https://phabricator.wikimedia.org/T171831#3502353 (10Dzahn) 05Open>03Resolved ``` MariaDB [wikistats]> insert into wikiversity (prefix,method) values ("hi","8"); MariaDB [wikistats]> update wikiversity set lang="Hindi",loclang="हिन&... [21:13:41] 10Data-Services: Data missing from labs replica of enwiki.imagelinks - https://phabricator.wikimedia.org/T172567#3502360 (10Reedy) Yup, that replica seems good... ``` reedy@tools-bastion-03:~$ mysql --defaults-file=$HOME/replica.my.cnf --host labsdb-web.eqiad.wmnet Welcome to the MariaDB monitor. Commands end... [21:19:11] 10Data-Services: Data missing from labs replica of enwiki.imagelinks - https://phabricator.wikimedia.org/T172567#3502380 (10Reedy) Seems this is another case of T138967 [21:25:25] !log contributors removed missing role role::phabricator::labs from puppet config on phab-test [21:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Contributors/SAL [21:29:42] !log otrs removing missing class role::otrs::webserver from otrs-memoryleak [21:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Otrs/SAL [21:31:43] !log packaging removing broken class role::builder from packager02 instance [21:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Packaging/SAL [21:33:56] !log swift removing broken package role::swift::storage from prefix swift-stretch-ms-me prefix. It was throwing "Error 400 on SERVER: Could not find data item swift::proxy::memcached_servers in any Hiera data file and no default supplied at /etc/puppet/modules/role/manifests/swift/storage.pp" which broke puppet runs entirely [21:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Swift/SAL