[00:03:59] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Internal Server Error from new articleinfo interface in XTools - https://phabricator.wikimedia.org/T169767#3414312 (10kaldari) >defaults.yml was my idea. In fact, most of the configuration setup is mine - I'll document it better. Thanks Matthew! [00:10:04] (03Draft1) 10Paladox: Gerrit: Add gerrit pub key for ssh [labs/private] - 10https://gerrit.wikimedia.org/r/363755 [00:10:06] (03PS2) 10Paladox: Gerrit: Add gerrit pub key for ssh [labs/private] - 10https://gerrit.wikimedia.org/r/363755 [00:22:43] (03CR) 10Dzahn: "the ".pub" part goes in the public repo and the private part in labs/private" [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox) [00:24:43] (03CR) 10Dzahn: [C: 04-1] "i would say:" [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox) [00:27:27] (03PS3) 10Paladox: Gerrit: Add gerrit pub key for ssh [labs/private] - 10https://gerrit.wikimedia.org/r/363755 [00:29:21] 10Toolforge: Tool labs slow and kills application - https://phabricator.wikimedia.org/T169954#3414370 (10Fnielsen) [00:33:19] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Address all the TODOs in the new XTools interface - https://phabricator.wikimedia.org/T169829#3414375 (10kaldari) Having trouble testing due to XTools slowness. [00:34:29] theres seems to be slowness for toolforge as reported by users... is it possible someone could take a quick glance? [00:44:22] Zppix: reports where? [00:44:28] phab [00:44:54] could you link please? [00:47:21] Zppix: I am not sure what toolforge slowness means, please link to the phab tasks so we can look. Thanks! [00:47:37] https://phabricator.wikimedia.org/T169954#3414370 [00:48:41] Zppix: is this the only one? [00:48:49] (that you know of) [00:48:57] That i still have open yes [00:49:07] Zppix: okay, thanks, I'm looking [00:49:12] no problem [01:11:13] PROBLEM - Puppet errors on tools-bastion-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [01:46:12] RECOVERY - Puppet errors on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [0.0] [02:04:41] 10Toolforge, 10Tools: Someone deleted folder with my bots on Tool Labs - https://phabricator.wikimedia.org/T169736#3414400 (10chasemp) >>! In T169736#3410854, @MaxBioHazard wrote: > Not restored yet. I executed `cp -R -p /data/scratch/T169774/mbh/bots /data/project/mbh/bots/` which should restore your files o... [02:05:14] 10Cloud-Services: Files (Not directories) in Tools /data/project/sigma/cherry/cherryhtml reported missing - https://phabricator.wikimedia.org/T169756#3414401 (10chasemp) @Sigma your files should available for restore on tools-bastion-03 at `/data/scratch/T169774/sigma/cherry/cherryhtml/` [02:06:31] 10Data-Services, 10Toolforge, 10cloud-services-team (Kanban): Toolforge data loss for permissive data July 2 2017 - https://phabricator.wikimedia.org/T169774#3414402 (10chasemp) This data should be available in the scratch share in a directory named `T169774`. Permissions, ownership, etc should be preserved... [02:08:48] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: EditCounter's "pages created" does not match Pages tool - https://phabricator.wikimedia.org/T169955#3414403 (10MusikAnimal) [02:28:29] 21:22 ⇢ Mutter (~Mutter@2600:380:7c34:1427:4928:be48:e98:e0ec) has joined the channel [02:52:57] hello :) In the xtools VPS what redis server should we be using? is redis://tools-redis-1001.tools.eqiad.wmflabs okay? [03:24:29] PROBLEM - Puppet errors on tools-exec-1426 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [03:54:30] RECOVERY - Puppet errors on tools-exec-1426 is OK: OK: Less than 1.00% above the threshold [0.0] [05:05:34] 10Cloud-Services: Files (Not directories) in Tools /data/project/sigma/cherry/cherryhtml reported missing - https://phabricator.wikimedia.org/T169756#3414453 (10Sigma) Thank you. [05:06:24] 10Data-Services, 10Toolforge, 10cloud-services-team (Kanban): Toolforge data loss for permissive data July 2 2017 - https://phabricator.wikimedia.org/T169774#3414455 (10Sigma) [05:06:27] 10Cloud-Services: Files (Not directories) in Tools /data/project/sigma/cherry/cherryhtml reported missing - https://phabricator.wikimedia.org/T169756#3414454 (10Sigma) 05Open>03Resolved [05:15:53] PROBLEM - Puppet errors on tools-worker-1007 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [05:43:20] samwilson: if you want proper security, setup a redis server in your project [05:43:42] tool labs' redis server is not password-protected [05:43:56] *uh, toolforge's [05:44:49] besides, if it is within xtools project the security groups can effectively disable access from any outside projects [05:46:19] 10Toolforge: Tool labs slow and kills application - https://phabricator.wikimedia.org/T169954#3414359 (10zhuyifei1999) signal 9 is SIGKILL. OOM killer? [05:48:20] zhuyifei1999_: ah, that makes sense, thanks! [05:50:45] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Set up load balancing for new XTools - https://phabricator.wikimedia.org/T169590#3414617 (10Samwilson) The other part of this is Redis. @zhuyifei1999 says that we shouldn't use tools' Redis server, but run our own. Can this just be on one of the two new prod s... [05:51:27] zhuyifei1999_: ^ don't worry, i'm not dobbing you in to configure anything :-) [05:53:44] lol [05:54:35] zhuyifei1999_: isn't using the public redis pretty secure, as long as you add a secret string to the key? IIRC the cloud services team disabled the key indexes [05:55:59] harej: their method is https://github.com/wikimedia/puppet/blob/production/modules/toollabs/manifests/redis.pp#L43 disabling some affected commands [05:56:59] who knows if someone extremely-knowledgeable-with-redis could exploit it with non-disabled commands [05:57:17] also, redis pubsub is still public [06:01:03] 10Data-Services, 10cloud-services-team (Kanban), 10DBA, 10MediaWiki-extensions-Babel, 10Patch-For-Review: Replicate babel db table on Labs - https://phabricator.wikimedia.org/T160713#3108521 (10Marostegui) Hi Andrew, labsdb1009 and labsdb1010 have puppet disabled because: ``` The last Puppet run was at... [06:01:42] so if you subscribe to all channels in tools-redis pubsub you'll see a lot of public messages going on. if someone is not careful they might leak their prefix into it. (also it means anyone can fake messages in pubsub) [06:05:48] see also https://redis.io/topics/security [06:16:25] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Give visual feedback while Editcounter is thinking - https://phabricator.wikimedia.org/T169831#3414626 (10Samwilson) I've added a ponderator to all submit buttons. It's nothing fancy: just stops the re-clicking of the submit button, changes the text to 'Loadin... [06:18:49] zhuyifei1999_ harej : we're not caching anything that's not public info; does that make it more ok? [06:19:07] uh I gues [06:19:19] but still better to use our own? [06:19:29] but if your prefix is leaked anyone may write to it [06:19:48] ah yeah that's true. could be confusing if nothing else! [06:22:43] reading https://redis.io/commands there're like a few commands I can use to break tools-redis [06:25:18] if we want to use it while in development, is redis://tools-redis-1001.tools.eqiad.wmflabs the right server to use? [06:39:37] tools-redis [06:40:03] I think there's some failover mechanism to 1001 and other hosts [06:41:14] https://wikitech.wikimedia.org/wiki/Portal:Tool_Labs/Admin#Redis [06:41:25] tools-redis.tools.eqiad.wmflabs [06:41:48] anyways I think I just broke redis security [06:42:54] hehe oh dear [06:44:27] I'm getting Connection timed out [tcp://tools-redis.tools.eqiad.wmflabs:6379] [06:46:50] harej chasemp bd808: https://phabricator.wikimedia.org/T169957 [06:47:00] samwilson: o.O [06:47:26] huh, don't I usually have access to WMF-NDA tasks [06:48:45] cc'ed you [06:49:45] https://phabricator.wikimedia.org/P5696 the paste is visible to security-ers and nda-ers [06:50:05] the task is visible to cc-ed and security-ers [06:50:53] RECOVERY - Puppet errors on tools-worker-1007 is OK: OK: Less than 1.00% above the threshold [0.0] [06:52:39] it's interesting how some 9 out of 10 keys found are unprefixed [06:52:53] interesting. [06:54:50] samwilson: I guess the security groups of tools-redis doesn't allow outside access [06:55:11] so you gotta setup a redis server anyhow [06:55:38] zhuyifei1999_: cool, will do [06:55:43] tools-redis-1001.tools.eqiad.wmflabs and tools-redis.tools.eqiad.wmflabs resolves to the same address [06:58:39] yuvi's fault https://github.com/wikimedia/puppet/commit/2eda5e37839bbf7be375dcdb6f848547757acd8c#diff-7f22e90a5d85eb6a6a9d908d1e51ffcdR22 :P just kidding [07:38:24] PROBLEM - Free space - all mounts on tools-worker-1020 is CRITICAL: CRITICAL: tools.tools-worker-1020.diskspace.root.byte_percentfree (<100.00%) [08:23:23] (03PS1) 10Alexandros Kosiaris: Add role::ores::stresstest hieradata [labs/private] - 10https://gerrit.wikimedia.org/r/363782 [08:24:16] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add role::ores::stresstest hieradata [labs/private] - 10https://gerrit.wikimedia.org/r/363782 (owner: 10Alexandros Kosiaris) [09:53:31] 10Toolforge: Tool labs slow and kills application - https://phabricator.wikimedia.org/T169954#3415065 (10Fnielsen) I see "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py" ``` 'python': { 'cls': PythonWebService, 'image': 'toollabs-python-web',... [09:59:39] 10cloud-services-team, 10DBA, 10Operations, 10Scoring-platform-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3415085 (10jcrespo) Someone announced 60 seconds of downtime, which I do not think is reasonable- rebooting fully a server and all its services takes around 3... [10:27:16] PROBLEM - Puppet errors on tools-worker-1018 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:02:17] RECOVERY - Puppet errors on tools-worker-1018 is OK: OK: Less than 1.00% above the threshold [0.0] [11:16:56] PROBLEM - Puppet errors on tools-worker-1007 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [11:51:53] RECOVERY - Puppet errors on tools-worker-1007 is OK: OK: Less than 1.00% above the threshold [0.0] [12:42:54] PROBLEM - Puppet errors on tools-worker-1007 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [13:22:53] RECOVERY - Puppet errors on tools-worker-1007 is OK: OK: Less than 1.00% above the threshold [0.0] [14:34:11] 10Cloud-Services: Please check long running nova queries and check those should be running for 2 minutes, as it didn't use to happen before - https://phabricator.wikimedia.org/T169991#3415641 (10jcrespo) [14:34:32] 10Cloud-VPS, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3415654 (10Papaul) [14:34:40] 10Cloud-Services: Please check long running nova queries and check if those should be running for 2 minutes, as it didn't use to happen before - https://phabricator.wikimedia.org/T169991#3415655 (10jcrespo) [14:35:38] 10Cloud-VPS, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3379777 (10Papaul) a:05Papaul>03chasemp @chasemp This is complete , You can take over. Thanks. [14:52:25] 10Cloud-VPS, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3379811 (10MoritzMuehlenhoff) The server has been installed with a public IP, but not added to site.pp, so there's currently no ferm rules enabled. Pl... [14:52:36] PROBLEM - Puppet errors on tools-prometheus-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [14:53:14] 10Cloud-VPS, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3379794 (10MoritzMuehlenhoff) The server has been installed with a public IP, but not added to site.pp, so there's currently no ferm rules enabled. Pl... [14:53:31] 10Cloud-VPS, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3379828 (10MoritzMuehlenhoff) The server has been installed with a public IP, but not added to site.pp, so there's currently no ferm rules enabled. Ple... [15:07:58] PROBLEM - Puppet errors on tools-prometheus-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:12:01] bd808: fyi, redis pubsub is still public and there're like no way to make it not public [15:12:03] 10cloud-services-team, 10DBA, 10Operations, 10Scoring-platform-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3415739 (10Halfak) Announcements have been updated. Thanks for the note. Shall we always announce a 1 hour maintenance window for DB maintenance? [15:12:54] zhuyifei1999_: *nod* shared hosting is shared hosting at some point. We can close the easy doors [15:13:13] sigh [15:13:43] there are a lot of celery messages last time I got curious [15:14:21] dunno if that worth a security ticket, in case it can be used for an attack [15:15:15] 10cloud-services-team, 10DBA, 10Operations, 10Scoring-platform-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3415746 (10jcrespo) It varies from maintanance to maintenance, depending on the work to be done. Some take more some take less- the "normally" was meant as "N... [15:16:28] 10Cloud-Services, 10Quarry: Consider moving Quarry to be an installation of Redash - https://phabricator.wikimedia.org/T169452#3415749 (10Halfak) My main concern with this kind of move would be preserving the basic functionality of Quarry in redash. E.g. permalinks to results, recent queries, user queries, pu... [15:20:32] 10Cloud-Services: Please check long running nova queries and check if those should be running for 2 minutes, as it didn't use to happen before - https://phabricator.wikimedia.org/T169991#3415757 (10Andrew) a:03Andrew [15:30:07] 10Tools, 10Commons: Zoomviewer is down - https://phabricator.wikimedia.org/T169864#3415787 (10chasemp) p:05Triage>03Normal [15:30:45] 10Tools, 10Commons: Zoomviewer is down - https://phabricator.wikimedia.org/T169864#3411054 (10chasemp) >>! In T169864#3411281, @zhuyifei1999 wrote: > Maintainer has been contacted [[https://commons.wikimedia.org/wiki/User_talk:Dschwen#Zoomviewer|via talk page]] on July 3rd. I priortized to `normal` since this... [15:30:53] 10Toolforge, 10Tools: Someone deleted folder with my bots on Tool Labs - https://phabricator.wikimedia.org/T169736#3415790 (10chasemp) p:05Triage>03Normal [15:31:08] 10Toolforge, 10Tools: Someone deleted folder with my bots on Tool Labs - https://phabricator.wikimedia.org/T169736#3407049 (10chasemp) @MaxBioHazard anything else needed here? [15:33:46] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Give visual feedback while Editcounter is thinking - https://phabricator.wikimedia.org/T169831#3409899 (10MusikAnimal) Left two comments on the code. Otherwise looks good to me! [15:35:39] 10Toolforge: Tool labs slow and kills application - https://phabricator.wikimedia.org/T169954#3415799 (10chasemp) p:05Triage>03Normal Usually cgroup limits are per process but I'm not entirely sure how this shakes out here, can you run an experiment with fewer workers to see outcomes? What Tool is this? Wh... [15:39:46] 10cloud-services-team, 10DBA, 10Operations, 10Scoring-platform-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3415818 (10Halfak) [15:40:14] 10Data-Services, 10cloud-services-team (Kanban), 10DBA, 10MediaWiki-extensions-Babel, 10Patch-For-Review: Replicate babel db table on Labs - https://phabricator.wikimedia.org/T160713#3108521 (10chasemp) Thanks @Marostegui you are a scholar and a gentlemen. Is this task done now? @Base can you confirm? [15:40:53] 10cloud-services-team, 10DBA, 10Operations, 10Scoring-platform-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3369055 (10Halfak) [15:40:55] hey bd808 im unable to access any of the tools labs servers even though my public ssh key is set in my prefs is there a reason why? [15:41:03] 10cloud-services-team, 10DBA, 10Operations, 10Scoring-platform-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3369055 (10Halfak) Gotcha. Next time, we should add these details to the task description and I'll pick them up from there when making announcement. :) In... [15:42:24] 10Cloud-Services: Please check long running nova queries and check if those should be running for 2 minutes, as it didn't use to happen before - https://phabricator.wikimedia.org/T169991#3415824 (10Andrew) Those look to me like queries that are run by openstack-browser. Most of those queries are wrapped by upst... [15:43:00] 10Toolforge: Tool labs slow and kills application - https://phabricator.wikimedia.org/T169954#3415826 (10Fnielsen) I do not know how to start it with few workers. Isn't the number 4 hardcoded into `/usr/lib/python2.7/dist-packages/toollabs/webservice/services/uwsgiwebservice.py`? The tool is this one: http://t... [15:43:32] nevermind im being stupid i forgot the shelluser@server [15:44:49] 10Cloud-Services: Please check long running nova queries and check if those should be running for 2 minutes, as it didn't use to happen before - https://phabricator.wikimedia.org/T169991#3415839 (10Andrew) btw, the openstack-browser tool first started running in late March. We have added on different stats grad... [15:47:11] 10Cloud-Services, 10DBA, 10Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3415841 (10Marostegui) Before starting any replication for s2 or s6 on labs servers, we need to either upgrade db1102 to 10.1 (which will require... [15:47:38] 10Cloud-Services: Please check long running nova queries and check if those should be running for 2 minutes, as it didn't use to happen before - https://phabricator.wikimedia.org/T169991#3415842 (10jcrespo) > 'TMAX' is the longest time the query ever took, in seconds? Yes. [15:51:46] 10Data-Services, 10cloud-services-team (Kanban), 10DBA, 10MediaWiki-extensions-Babel, 10Patch-For-Review: Replicate babel db table on Labs - https://phabricator.wikimedia.org/T160713#3415846 (10Andrew) Thanks @Marostegui [15:57:45] PROBLEM - Puppet errors on tools-exec-1416 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [16:13:54] PROBLEM - Puppet errors on tools-worker-1007 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [16:20:16] hi, im wondering is it possible to add an ssh key to a local user on an instance without it using ldap? [16:20:26] for example gerrit2?. [16:22:47] RECOVERY - Puppet errors on tools-exec-1416 is OK: OK: Less than 1.00% above the threshold [0.0] [16:32:39] (03Draft1) 10Paladox: DO NOT MERGE [labs/private] - 10https://gerrit.wikimedia.org/r/363847 [16:32:42] (03PS2) 10Paladox: DO NOT MERGE [labs/private] - 10https://gerrit.wikimedia.org/r/363847 [16:41:14] PROBLEM - Puppet errors on tools-worker-1011 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [16:43:18] 10Tools, 10Commons: Zoomviewer is down - https://phabricator.wikimedia.org/T169864#3415970 (10dschwen) Ok, noted. I'll investigate. [16:52:03] 10Cloud-Services: Please check long running nova queries and check if those should be running for 2 minutes, as it didn't use to happen before - https://phabricator.wikimedia.org/T169991#3415991 (10Andrew) Ok -- so I would classify these queries as 'expected'. LMK if they're actively causing you trouble. [17:01:27] 10Cloud-Services: Please check long running nova queries and check if those should be running for 2 minutes, as it didn't use to happen before - https://phabricator.wikimedia.org/T169991#3416012 (10jcrespo) 05Open>03Resolved good enough for me. [17:05:58] 10Tools, 10Commons: Zoomviewer is down - https://phabricator.wikimedia.org/T169864#3416021 (10dschwen) Lame. The webservice was down. A simple webservice start brought it back up. Do I really have to put in a cronjob that kick the service once in a while? [17:16:12] RECOVERY - Puppet errors on tools-worker-1011 is OK: OK: Less than 1.00% above the threshold [0.0] [17:25:34] 10Data-Services, 10DBA, 10Patch-For-Review: Expose ar_content_format and ar_content_model columns of archive table on Labs replicas - https://phabricator.wikimedia.org/T89741#3416085 (10Umherirrender) a:03Umherirrender [17:25:52] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Address all the TODOs in the new XTools interface - https://phabricator.wikimedia.org/T169829#3416086 (10kaldari) 05Open>03Resolved a:03kaldari [17:38:10] (03PS4) 10Paladox: Gerrit: Add gerrit pub key for ssh [labs/private] - 10https://gerrit.wikimedia.org/r/363755 [17:40:39] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Optimize edit count queries in XTools - https://phabricator.wikimedia.org/T163284#3416116 (10kaldari) 05Open>03Resolved http://tools.wmflabs.org/xtools-ec/?user=Kaldari&project=en.wikipedia.org Old XTools: Executed in 34.51 second(s). · Taken 4.5 megabytes... [17:53:00] 10Cloud-Services, 10Cloud-VPS, 10cloud-services-team (Kanban), 10Operations, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3416130 (10chasemp) a:05Cmjohnson>03chasemp I'll try to take care of this in the am mon or tue [17:54:41] 10Data-Services, 10cloud-services-team (Kanban), 10DBA, 10MediaWiki-extensions-Babel, 10Patch-For-Review: Replicate babel db table on Labs - https://phabricator.wikimedia.org/T160713#3108521 (10bd808) a:03Andrew [17:56:30] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: ukwikimedia_p revision table is apparently private - https://phabricator.wikimedia.org/T170005#3416156 (10MusikAnimal) [18:00:09] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: ukwikimedia_p revision table is apparently private - https://phabricator.wikimedia.org/T170005#3416177 (10MusikAnimal) [18:01:20] 10Data-Services, 10cloud-services-team (Kanban), 10DBA, 10MediaWiki-extensions-Babel, 10Patch-For-Review: Replicate babel db table on Labs - https://phabricator.wikimedia.org/T160713#3416178 (10bd808) 05Open>03Resolved Checked with first alphabetic wiki from each shard: ``` $ sql enwiki (u3518@enwik... [18:03:54] 10Data-Services, 10cloud-services-team (Kanban), 10DBA, 10Patch-For-Review, and 2 others: Drop ukwikimedia from labsdb hosts (was: ukwikimedia still present on replicas dbs on labs hosts) - https://phabricator.wikimedia.org/T169488#3416186 (10bd808) 05Open>03Resolved The config change was synced out. `... [18:17:23] 10cloud-services-team (Kanban), 10Project-Admins, 10User-bd808: Rename and update Cloud Services Phabricator projects - https://phabricator.wikimedia.org/T167244#3416336 (10bd808) [18:20:44] 10Toolforge: Tool labs slow and kills application - https://phabricator.wikimedia.org/T169954#3416372 (10Fnielsen) On my Flask development server, Python3 is using 1.2 GB fairly constantly (somewhat more than I anticipated). I suppose that I run into memory problems. I wonder how I can get around this. [18:21:54] 10Cloud-Services, 10Toolforge, 10Documentation: Document labsdb replication set up - https://phabricator.wikimedia.org/T85868#956362 (10Bawolff) I've been writing some stuff at https://wikitech.wikimedia.org/wiki/Labsdb_redaction . Not sure if it should be merged with https://wikitech.wikimedia.org/wiki/Mari... [18:26:22] !log tools Forced puppet runs on tools-redis-* for security fix [18:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:17:20] !log wikilabels staging wikilabels-wmflabs-deploy:c386dac [19:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL [19:18:09] !log wikilabels deploying wikilabels-wmflabs-deploy:c386dac [19:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL [19:24:48] legoktm: you around? [19:53:54] RECOVERY - Puppet errors on tools-worker-1007 is OK: OK: Less than 1.00% above the threshold [0.0] [19:56:12] Krenair: wikibugs isnt working can you restart it please [20:06:28] Zppix: hi [20:07:08] Wikibugs has been down for approx 30 mins can it be restarted [20:07:13] yeah [20:07:20] probably the tools-redis stuff killed it [20:07:27] Ok [20:07:27] Ty [20:09:06] legoktm: crap. let me know if there is something that needs fixing on the redis side [20:09:12] my trivial tests worked [20:09:30] bd808: I assume you just restarted redis? I don't think the bot handles that well [20:09:52] yeah. we patched the config which restarts the demon [20:10:00] 10Wikibugs: wikibugs test bug part II - https://phabricator.wikimedia.org/T90594#1062819 (10Legoktm) test [20:14:00] bd808: all looks fine now [20:15:34] legoktm: awesome. I'll try to remember the wikibugs/redis entanglement in the future [20:17:01] 10Cloud-Services, 10Tool-Labs-tools-Xtools, 10DBA, 10Community-Tech-Sprint: ukwikimedia_p revision table is apparently private - https://phabricator.wikimedia.org/T170005#3416711 (10Matthewrbowker) [20:17:35] 10Tool-Labs-tools-Xtools, 10Community-Tech: Add a slow query killer to new XTools - https://phabricator.wikimedia.org/T170013#3416681 (10MusikAnimal) There's a query killer bash script for the old XTools that we might be able to use: ```lang=bash timeout=120 logfile="query-guard.log" dbhosts=(s1 s2 s3 s4 s5 s6... [20:17:41] 10Cloud-Services, 10wikitech.wikimedia.org, 10Operations, 10RESTBase, and 3 others: Fix RESTBase support for wikitech.wikimedia.org - https://phabricator.wikimedia.org/T102178#1358396 (10Krinkle) >>! In T102178#3403100, @GWicke wrote: > @krinkle, your comment sounds like it might have been intended for {T1... [20:18:19] 10Cloud-Services, 10Tool-Labs-tools-Xtools, 10DBA, 10Community-Tech-Sprint: ukwikimedia_p revision table is apparently private - https://phabricator.wikimedia.org/T170005#3416156 (10Legoktm) uk.wikimedia.org is no longer hosted by Wikimedia and was removed from the database lists. Where are you getting the... [20:21:08] 10Cloud-Services, 10Tool-Labs-tools-Xtools, 10DBA, 10Community-Tech-Sprint: ukwikimedia_p revision table is apparently private - https://phabricator.wikimedia.org/T170005#3416721 (10MusikAnimal) >>! In T170005#3416718, @Legoktm wrote: > uk.wikimedia.org is no longer hosted by Wikimedia and was removed from... [20:22:38] 10Cloud-Services, 10Tool-Labs-tools-Xtools, 10DBA, 10Community-Tech-Sprint: ukwikimedia_p revision table is apparently private - https://phabricator.wikimedia.org/T170005#3416725 (10Matthewrbowker) meta_p.wiki: https://quarry.wmflabs.org/query/20104 [20:22:41] 10Cloud-Services, 10Tool-Labs-tools-Xtools, 10DBA, 10Community-Tech-Sprint: ukwikimedia_p revision table is apparently private - https://phabricator.wikimedia.org/T170005#3416726 (10Legoktm) OK, it just needs to be removed from meta_p. [20:22:54] 10Cloud-Services, 10Tool-Labs-tools-Xtools, 10DBA, 10Community-Tech-Sprint: ukwikimedia_p needs to be removed from meta_p table - https://phabricator.wikimedia.org/T170005#3416727 (10Legoktm) [20:23:15] 10Cloud-Services, 10Tool-Labs-tools-Xtools, 10DBA, 10Community-Tech-Sprint: ukwikimedia_p needs to be removed from meta_p table - https://phabricator.wikimedia.org/T170005#3416731 (10Marostegui) This is probably related: T169488 @bd808 might be able to help here [20:29:31] 10Cloud-Services, 10Tool-Labs-tools-Xtools, 10DBA, 10Community-Tech-Sprint: ukwikimedia_p needs to be removed from meta_p table - https://phabricator.wikimedia.org/T170005#3416740 (10MusikAnimal) [20:34:07] 10Data-Services, 10Tool-Labs-tools-Xtools, 10cloud-services-team (Kanban), 10Community-Tech-Sprint: ukwikimedia_p needs to be removed from meta_p table - https://phabricator.wikimedia.org/T170005#3416766 (10bd808) p:05Triage>03Normal [20:34:46] 10Data-Services, 10Toolforge, 10cloud-services-team (Kanban): 2017-07-02 Toolforge data loss for permissive data - https://phabricator.wikimedia.org/T169774#3416770 (10chasemp) [20:36:22] 10Data-Services, 10Toolforge, 10cloud-services-team (Kanban): 2017-07-02 Toolforge data loss for permissive data - https://phabricator.wikimedia.org/T169774#3408084 (10chasemp) ```On 2017-07-02 some users experienced a loss of data in their projects. We estimate 126 Tools out of 1,766 saw at least one fil... [20:41:01] 10Data-Services, 10Tool-Labs-tools-Xtools, 10cloud-services-team (Kanban), 10Community-Tech-Sprint: ukwikimedia_p needs to be removed from meta_p table - https://phabricator.wikimedia.org/T170005#3416792 (10bd808) To clean this up, we should run `maintain-meta_p --all-databases --purge` on: * labsdb1001 *... [20:42:12] hmm, labs just had a blips in the dns. Checks were reporting packet loss when pings were done. [20:42:52] ping packet loss and DNS are unrelated [20:43:26] I looked at the time it was reported and didn't have an issue, but I'm not sure what is being reported entirely [20:43:31] oh [20:43:52] we had these [20:43:53] [21:38:20] PROBLEM - ping4 on ores-worker-09 is WARNING: PING WARNING - Packet loss = 0%, RTA = 104.37 ms [20:43:56] started comming [20:43:59] 10Data-Services, 10Tool-Labs-tools-Xtools, 10cloud-services-team (Kanban), 10Community-Tech-Sprint: ukwikimedia_p needs to be removed from meta_p table - https://phabricator.wikimedia.org/T170005#3416823 (10chasemp) a:03chasemp I'm on duty, I'll run this [20:43:59] then they recovered [20:44:02] same for -releng too [20:44:25] that could mean any number of things, but mostly just means the shinken instance had trouble reaching instances in a few projects [20:44:28] that say 0% packet loss. it is complaining about the time the response took to come back [20:44:52] ah ok. [20:44:55] thanks for explaning [20:45:09] which could be caused by a large number of things including load on the source or target box [20:45:30] thanks. [20:46:27] if you are running icinga checks on a large number of hosts from one box you will soon discover that icinga's data collection model is fragile and requires a disproportionate amount of resources from the command and control box [20:46:54] bd808 we're only running checks on our own project and ores' labs instances [20:47:15] this is one of the reasons we are migrating to prometheus [20:47:18] our source is not the issue the load is almost nothing [20:48:11] (03CR) 10Paladox: "> still per my comment on PS2, the pub part goes in the public repo," [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox) [20:48:32] (03CR) 10Paladox: "Following how it was done for phabricator." [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox) [20:48:53] Zppix: ok. I assume you checked the point in time process count and network queue depth around 21:38:20 when the alert in question was seen [20:50:14] I checked about 21:38:25 so yea [20:50:35] using check_load [20:55:32] Zppix 21:38:25 would have been yesturday your time. [20:55:33] that's today my time [20:55:34] also he was refering to graphite i think [20:55:54] but theres no big deal with this. Just wanted to report it just in case something big happened. [20:56:02] Anyways it's resolved. [20:57:38] bd808 quick question Wiki-AI wants to send celery logs and events to logstash for prod and labs, how would we implement this functionality? [20:59:22] talk to someone who still works with logstash :) but generally you'd need to add a python log config that routes the messages to the proper logstash collector server and then add rules there via Puppet to process and store the messages [20:59:55] 10Data-Services, 10Tool-Labs-tools-Xtools, 10cloud-services-team (Kanban), 10Community-Tech-Sprint: ukwikimedia_p needs to be removed from meta_p table - https://phabricator.wikimedia.org/T170005#3416853 (10chasemp) 05Open>03Resolved ```mysql:root@localhost [(none)]> use meta_p; Database changed mysql:... [21:01:07] Zppix: there are some bread crumbs to follow in the config that striker uses to log to the prod logstash server. Let me see if I can point you to that... [21:02:29] Zppix: this is one way to configure Python -- https://phabricator.wikimedia.org/source/labs-striker/browse/master/striker/settings.py;9ce253f7346183fc8296162337bbd2556a4de1c4$72-81 [21:02:47] (03CR) 10Dzahn: ">What do you mean that the private key goes into the secret?" [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox) [21:03:43] (03CR) 10Dzahn: ">Following how it was done for phabricator." [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox) [21:03:47] (03CR) 10Paladox: "> >What do you mean that the private key goes into the secret?" [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox) [21:04:42] 10Tools, 10Commons: Zoomviewer is down - https://phabricator.wikimedia.org/T169864#3416863 (10chasemp) >>! In T169864#3416021, @dschwen wrote: > Lame. The webservice was down. A simple webservice start brought it back up. Do I really have to put in a cronjob that kick the service once in a while? The `webser... [21:04:48] (03CR) 10Dzahn: ">storing both keys under secret" [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox) [21:05:12] 10Data-Services, 10Toolforge, 10cloud-services-team (Kanban): 2017-07-02 Toolforge data loss for permissive data - https://phabricator.wikimedia.org/T169774#3416869 (10chasemp) [21:05:17] 10Toolforge, 10Tools: Someone deleted folder with my bots on Tool Labs - https://phabricator.wikimedia.org/T169736#3416866 (10chasemp) 05Open>03Resolved a:03chasemp Closing but let me know if something doesn't work out here [21:08:46] thanks bd808 [21:11:17] 10cloud-services-team (Kanban), 10wikitech.wikimedia.org: Add `wikitech-grep` to puppet - https://phabricator.wikimedia.org/T169820#3416877 (10chasemp) p:05Triage>03Normal [21:11:43] 10cloud-services-team (Kanban), 10wikitech.wikimedia.org: Add `wikitech-grep` to puppet - https://phabricator.wikimedia.org/T169820#3409654 (10chasemp) I have greatly appreciated the ability to search this way, as we are reorganizing docs especially. [21:21:25] 10cloud-services-team (Kanban), 10wikitech.wikimedia.org, 10Patch-For-Review: Add `wikitech-grep` to puppet - https://phabricator.wikimedia.org/T169820#3416917 (10Krinkle) [21:21:27] 10cloud-services-team (Kanban), 10wikitech.wikimedia.org, 10Patch-For-Review: Add `wikitech-grep` to puppet - https://phabricator.wikimedia.org/T169820#3416918 (10Legoktm) >>! In T169820#3411660, @bd808 wrote: >>>! In T169820#3410852, @Legoktm wrote: >> Is Special:Search really not sufficient? > > In theory... [21:26:34] 10cloud-services-team (Kanban), 10wikitech.wikimedia.org, 10Patch-For-Review: Add `wikitech-grep` to puppet - https://phabricator.wikimedia.org/T169820#3416942 (10bd808) >>! In T169820#3416918, @Legoktm wrote: > A tool usable by all tool labs users that did wikitech-grep would be pretty welcome I think. Agr... [21:28:02] legoktm: ^ hope that helps explain my thinking here. A tool that tries to do this via the CirrusSearch API would be swell too, but I just don't have time to work on it [21:29:30] PROBLEM - High iowait on tools-webgrid-lighttpd-1421 is CRITICAL: CRITICAL: tools.tools-webgrid-lighttpd-1421.cpu.total.iowait (>11.11%) [21:29:48] I suppose. I think we're likely to fall in the same trap that mwgrep created though :( [21:39:31] RECOVERY - High iowait on tools-webgrid-lighttpd-1421 is OK: OK: All targets OK [21:46:10] PROBLEM - Puppet errors on tools-worker-1005 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [21:54:53] 10cloud-services-team (Kanban), 10wikitech.wikimedia.org, 10Patch-For-Review: Add `wikitech-grep` to puppet - https://phabricator.wikimedia.org/T169820#3417004 (10chasemp) @legoktm 's objections about this being an admin only tool made me stop and think for awhile. I agree the further from the every-user ex... [21:56:57] thanks legoktm your input gave me great pause and I thought about it for awhile, careful line I think [22:26:08] RECOVERY - Puppet errors on tools-worker-1005 is OK: OK: Less than 1.00% above the threshold [0.0] [22:34:04] 10Toolforge: Tool Labs node on gridengine - https://phabricator.wikimedia.org/T166830#3417105 (10Krinkle) [22:41:06] 10Toolforge: Node.js on gridengine - https://phabricator.wikimedia.org/T166830#3417130 (10Luke081515) [22:42:15] 10Cloud-Services, 10DBA, 10Wikimedia-Site-requests: Reduce watchlist_count threshold - https://phabricator.wikimedia.org/T150548#3417152 (10Dispenser) ```lang=sql /* Most watched unstarted RfAs */ SELECT SUBSTRING(wl_title, 24, 50) AS RfA_subpage, Watchers FROM watchlist_count LEFT JOIN page ON wl_namespace=... [22:44:53] PROBLEM - Puppet errors on tools-worker-1007 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [22:47:09] PROBLEM - Puppet errors on tools-worker-1005 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [22:58:47] 10Tool-Labs-tools-Xtools, 10Community-Tech: Add a slow query killer to new XTools - https://phabricator.wikimedia.org/T170013#3417193 (10bd808) > (since we're running on our own limited VM rather than on Tool Labs) You probably have more resources on your VM than you would have had in Toolforge, but +1 for bei... [23:10:28] (03PS1) 10Krinkle: Use MediaWiki-like escaping for section link [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/363979 (https://phabricator.wikimedia.org/T165549) [23:12:29] (03Merged) 10jenkins-bot: Use MediaWiki-like escaping for section link [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/363979 (https://phabricator.wikimedia.org/T165549) (owner: 10Krinkle) [23:19:54] RECOVERY - Puppet errors on tools-worker-1007 is OK: OK: Less than 1.00% above the threshold [0.0] [23:27:09] RECOVERY - Puppet errors on tools-worker-1005 is OK: OK: Less than 1.00% above the threshold [0.0]