1
$\begingroup$

I am new in using word2vec model, as a result, I do not know how I can prepare my dataset as an input for word2vec? I have searched a lot but the datasets in tutorials were in CSV format or just one txt file, but my dataset is in this structure: 2 folders one of these is blood cancer and the other one is breast cancer. each folder contains 1000 txt files which contain 40 sentences. I do not have any idea about I can create a vocabulary as an input for the word2vec model in keras with tensorflow backend? I use python 3.5 in ubuntu 17.10 Any guidance will be appreciated.

$\endgroup$

1 Answer 1

0
$\begingroup$

I have already searched for the solution and found this statements which are proper for this kind of datasets: at first, you should concatenate 2 folders in one folder to apply the code below:

import os import gensim, logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) class MySentences(object): def __init__(self, dirname): self.dirname = './Dataset#2_BinaryClassClassification/lupus_alldataset/' def __iter__(self): for fname in os.listdir(self.dirname): for line in open(os.path.join(self.dirname, fname)): yield line.split() sentences = MySentences('./DatasetBinaryClassClassification/alldataset/') # a memory-friendly iterator model = gensim.models.Word2Vec(sentences, size=200, window=10, min_count=2, workers=5) 

I found the 'class MySentences' in this site:enter link description here

I hope it is helpful.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.