0

A want to create a sample dataframe -- based on a json template -- that looks as real as possible. Hence normal distribution.

This is what I have tried

import json, random import pandas as pd sample_data = """{"product1":[ {"category":"Fruits", "productlist":["Bell Peppers","Red Chillies", "Onions", "Tomatoes"]} ], "product2":[ {"category":"Vegetables", "productlist":["Apple","Mango","Banana"]} ]}""" products = json.loads(sample_data) colHeaders = [] for k,v in products.items(): colHeaders.append(v[0]['category']) df = pd.DataFrame(columns= colHeaders) for i in range (1000): itemlist = [] for k,v in products.items(): itemlist.append(random.choice(v[0]['productlist'])) #print(itemlist) df.loc[len(df)] = itemlist print(df) 

I am not sure I am doing it correctly. If not, please help me with

  • How to check if the data frame rows represent a normal distribution?
  • How to try other distributions in this case?

Other related Stack Overflow questions I have referred are:

1 Answer 1

0

I think what you should do is generate integers in normal distribution and make them the indices of the list. Also graphing the numbers you generated is in my opinion the best way to check whether they are a normal distribution, it should resemble the normal distribution bell shape. However since 20 is such a small number, it may not exactly be the desired shape which is something to keep in mind. The following link I think has all the information you need.

How to generate a random normal distribution of integers

Sign up to request clarification or add additional context in comments.

4 Comments

productlist = ["Apple","Mango","Banana"] can be treated as productlist = [0,1,2] but wouldn't the rest of the logic still remain the same?`
not sure what you mean here but i think your way works just fine, im just not sure if it would generate random distribution. if you do decide to use a different random generator, your code would change as this: for k,v in products.items(): itemlist.append(v[0]['productlist'][random_integer]) the only problem here is that it generates random integers for the two lists seperately, meaning you have two rounded up distributions for ranges (0,3) and (0,4) if that is indeed what you wanted
From what you are suggesting, the random_integer will be in normal distribution but not the values in product list. The idea is the product list appended to the dataframe (as rows) to look like real occurrence.
well since your product list has items in it, there is no real way to have a normal distribution between them. The thing I'm describing is only useful if the lists are ordered. From what I can tell, any random function will do what you want, especially since the example lists are so small anyway.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.