group the similar words

Question

array(['Ruby on Rails', 'Ruby', 'AWS DynamoDB', 'Python', 'MySQL', 'Swift', 'Android', 'iOS', 'JavaScript', 'React Native', 'ReactJS', 'TypeScript', 'Vue.js', 'Webpack', 'Amazon Web Services(AWS)', 'Kubernetes', 'PHP', 'CI/CD', 'Java', 'C#', 'C', 'Node.js', 'REST API', 'Go', 'Redux-Saga', 'Redux.js', 'Babel', 'GraphQL', 'Tensorflow', 'PyTorch', 'Jenkins', 'Spring', 'Django', 'Git', 'AWS EC2', 'CSS', 'HTML', 'MongoDB', 'Docker', 'Scala', 'SQL', 'Embedded System', 'NLP', 'Apache', 'Kotlin', 'Angular', 'jQuery', 'C++', 'RxJS', 'AngularJS', 'Redis', 'Next.js', 'NoSQL', 'GCP(Google Cloud Platform)', 'Elasticsearch', 'OpenStack', 'JPA(Java Persistent API)', 'TCP/IP', 'Objective-C', 'Realm', 'Firebase', 'Ajax', 'Linux', 'PostgreSQL', 'ES6', 'AWS Lambda', 'HTML5', 'AWS S3', 'GitHub', 'RxSwift', 'Terraform', 'AWS EKS', 'AWS RDS', 'Microsoft Azure', 'Sass(SCSS)', 'CodeIgniter', 'Flask', 'Nuxt.js', 'Ansible', 'Spring Boot', 'Linux kernel', 'Apache Kafka', 'Deep Learning', 'Nginx', 'ActionScript', 'OOP', 'Shell', 'gulp', 'Celery', 'SQLAlchemy', 'ExpressJS', 'RxJava', 'Apache Spark', 'WebGL', 'OpenGL', 'Machine Learning', 'MSSQL(Microsoft SQL Server)', 'Database', 'styled-components', 'MVC', 'Retrofit', 'Machine Vision', 'Oracle', 'web3.js', 'R', 'AWS ElasticBeanstalk', 'Elastic Stack', 'Laravel', 'ASP.NET', 'Aurora DB', 'Redux-Observable', '.NET', 'AWS Backup', 'AWS CloudWatch', 'Kibana', 'Fluentd', 'Logstash', 'JSP', 'Bootstrap', 'Datadog', 'Rust', 'Azure', 'Apache Hadoop', 'AWS X-Ray', 'Memcached', 'Jest', 'Mocha', 'DRF(Django REST framework)', 'Spring Cloud', 'Data Analysys', 'Big Data', 'GitLab', 'Gradle', 'SQLite', 'Microsoft IIS', 'Unity', 'Electron', 'MariaDB', 'mSQL', 'gensim', 'Scikit-Learn', 'AWS Simple Queue Service(AWS SQS)', 'gRPC', 'Naver Cloud Platform', 'Ubuntu', 'Microservice Architecture', 'Apache ActiveMQ', 'Oracle Database', 'Apache Subversion(SVN)', 'Apache Tomcat', 'Red Hat Ceph Storage', 'Puppeteer', 'OpenLayers', 'Vuex', 'Less.js', 'JIRA', 'Keras', 'NCP(Naver Cloud Platform)', 'NestJS', 'PKI(Public key infrastructure)', 'AWS ECS', 'Hibernate', 'UML', 'BitBucket', 'Arduino', 'Raspberry Pi', 'RabbitMQ', 'Capistrano', 'Bamboo', 'MVP', 'OkHttp', 'Cocos2d', 'Ethereum', 'Blockchain', 'DSP(Digital Signal Processing)', 'D3.js', 'Cocoa', 'Axios', 'Ionic', 'WPF', 'AWS IAM', 'Shell Script', 'Responsive Web', 'Canvas', 'ThreeJS', 'Apache ZooKeeper', 'Pandas', 'Spring Batch', 'JUnit', 'Spring Data JPA', 'ASP', 'Grunt', 'WordPress', 'MyBatis', 'AWS ElastiCache', 'Apache HTTP Server', 'AWS Security Hub', 'Google API', 'Qt', 'CAD', 'GatsbyJS', 'PostCSS', 'Socket.IO', 'Backbone.js', 'Azure Linux Virtual Machines', 'Heroku', 'CUDA', 'IOCP', 'Unix', 'CocoaPods', 'MVVM(Model-View-ViewModel)', 'Google Firebase Crashlytics', 'Google Cloud Platform', 'Windows kernel', 'OpenCV', 'Unreal Engine', 'Google Cloud SDK', 'RxAndroid', 'Windows Embedded', 'Entity Framework', 'Packer', 'Nexus', 'Consul', 'Selenium', 'Jekyll', 'XML', 'Dependency Lookup', 'RxKotlin', 'Expo', 'Sketch', 'InVision', 'Azure Text Analytics', 'Google Dialogflow', 'Google Cloud Natural Language'], dtype=object)

let say I have this array list
if I want to group them by similar things is there any pretrained language model to do this job easier?

for example
pytorch and tensorflow should be in one group because most of the deep learning people are using pytorch or tensorflow

Jamie · Accepted Answer · 2019-12-09 10:05:04Z

There will not be any pre-trained models to cluster these words.

In fact, in order to build your own clustering model you will need more metadata about each observation/word in your array.

At the moment any model would only be able to "see" the name of the package/software in your array. So the best you could hope for is a model that clusters these words based on their spellings.

Now let's pretend you find a brief description of each software, now you could do a bit more. With this longer text you could use supervised or unsupervised methods to cluster these softwares into groups (see topic models, k-means etc) based upon similar words in the descriptions.

Long story short, there is not a pre-built model to do this, and to build one yourself you're going to need more information about each observation.

Valentin Calomme · Accepted Answer · 2019-12-09 12:12:18Z

By using SpaCy's pretrained XLNet's model, I got some interesting similarities. I used this model because it has been trained on a large scale corpus which has a decent probability to contain these domain-specific terms in the first place. But as @yohanes-alfredo points out, the only way the similarities will be meaningful is if the data the model was trained on is specific to your domain. And given that specific domain, it is quite unlikely to find what you're looking for out-of-the-box

Top 20 most similar:

RxSwift ~ RxKotlin -> 0.9946681187244761 RxJS ~ RxJava -> 0.994435118373502 GCP(Google Cloud Platform) ~ NCP(Naver Cloud Platform) -> 0.9940396059556209 AWS EC2 ~ AWS ECS -> 0.9938487260023671 PKI(Public key infrastructure) ~ AWS ElastiCache -> 0.9935761603450172 GCP(Google Cloud Platform) ~ DSP(Digital Signal Processing) -> 0.9933073878775022 Elasticsearch ~ Elastic Stack -> 0.9931518747103965 AWS EC2 ~ AWS S3 -> 0.993027042898852 PKI(Public key infrastructure) ~ DSP(Digital Signal Processing) -> 0.9929990959502415 Vue.js ~ Node.js -> 0.9929247105854434 AWS Lambda ~ AWS Security Hub -> 0.9929128319080022 NCP(Naver Cloud Platform) ~ DSP(Digital Signal Processing) -> 0.9929057627676084 SQLite ~ mSQL -> 0.9926240872662073 DRF(Django REST framework) ~ DSP(Digital Signal Processing) -> 0.9925413265716948 Vue.js ~ Nuxt.js -> 0.9924630128845782 Node.js ~ Nuxt.js -> 0.9921168820574993 jQuery ~ SQLAlchemy -> 0.9920563002593914 GCP(Google Cloud Platform) ~ PKI(Public key infrastructure) -> 0.9919737922646894 NCP(Naver Cloud Platform) ~ PKI(Public key infrastructure) -> 0.9918400179111244 AWS EKS ~ AWS RDS -> 0.9916245806804818

Top 20 least similar:

HTML ~ Redux-Observable -> 0.816905641379557 HTML ~ Apache ZooKeeper -> 0.8168242467737211 HTML ~ CocoaPods -> 0.8166225352469929 HTML ~ GitHub -> 0.8161220718380207 HTML ~ AngularJS -> 0.8158469544232246 Ruby ~ WordPress -> 0.8154201995919186 Node.js ~ HTML -> 0.813844641023879 HTML ~ GatsbyJS -> 0.8132731528363577 Vue.js ~ HTML -> 0.8119781923186943 Blockchain ~ WordPress -> 0.8116576909602091 HTML ~ Sass(SCSS) -> 0.811630090065908 HTML ~ Nuxt.js -> 0.8109416869322942 Firebase ~ WordPress -> 0.8097501937747014 Java ~ WordPress -> 0.8093956474318938 Linux ~ WordPress -> 0.8089260791066558 Laravel ~ WordPress -> 0.8082594491867635 Redux.js ~ HTML -> 0.8079252976547562 WordPress ~ InVision -> 0.8077375338404283 HTML ~ DRF(Django REST framework) -> 0.8070652034839725 MySQL ~ WordPress -> 0.8070553577302257

My take on it is that first of all, all these terms have rather high similarities, probably because they are all part of a similar generic domain. Second, the top similarities seem to be quite influenced by string similarities altogether. Indeed, if you look at the nearest neighbours, they don't seem to be so related to the concept itself. For instance, here are the top 5 neighbours of "CI/CD":

CI/CD CI/CD ~ CI/CD -> 1.0000001314622171 CI/CD ~ CUDA -> 0.9898242832223353 CI/CD ~ gensim -> 0.9896900420172526 CI/CD ~ SQLite -> 0.9890342868864949 CI/CD ~ mSQL -> 0.9883379829169077

However, it's a start. Also, SpaCy offers different models that you could experient with.

Yohanes Alfredo · Accepted Answer · 2019-12-09 13:04:53Z

Without any context on the language that would near impossible to do. With only that the least you could do is using levenshtein distance and compute clusters based on that. Think of it like this supposed a normal people without prior knowledge of software engineering, I don't think people would even be able to do the task you asked. How people could know if pytorch and tensorflow is related without prior context.

One other way you could do this is to do unsupervised training it using texts from StackOverflow, and after training extract the embedding for those words, compute similarity/distances and use clustering methods to generate clusters.

they know they can't do it out of the box, that's why they're asking whether there are pre-trained models that can do it — Valentin Calomme
– Valentin Calomme, Commented Dec 9, 2019 at 11:01
The answer from my side is that it needs at least training of embeddings on the specific (not necessarily specific) topics otherwise it is impossible. Language need context. — Yohanes Alfredo
– Yohanes Alfredo, Commented Dec 9, 2019 at 11:23

Stack Exchange Network

group the similar words

3 Answers 3

Hot Network Questions

group the similar words

3 Answers 3

Related

Hot Network Questions