How is training data generated

Question

Very new to Machine learning process.

I am curious as to how researchers, companies, academics get training data for their research.

Do they pay some users to train data. If yes, how do they guarantee accuracy of it.

Do graduate students (on PhD track) do it for their professor. (This is meant to be a joke)

yes, grad students do it for their professors... or more accurately, grad students get their own undergrad research assistants to do it for them ;-) — Brandon Loudermilk
– Brandon Loudermilk, Commented Mar 24, 2016 at 18:52

dmb · Accepted Answer · 2016-03-24 15:45:59Z

'Training' data is really just splitting data you have already collected into test or training sets. For example, if you want to build a classifier for handwritten numbers, you collect thousands of samples of handwritten numbers like the MNIST database. When you think you have enough data to build a model, you then split it into train and test sets (usually by randomly assigning individual samples to one group or another at a specific ratio).

I think where your confusion lies is in the idea of collecting a 'training' set first as if it's truly independent from the test set. When collecting handwritten numbers, the researchers did not say, well, we have 10,000 samples, let's build a model with 10,000 samples and then have it running for our future data sets that we have not collected yet - in fact, that strategy is particularly bad and can lead to overfitting.

What you would do is take those 10,000 samples and split them - say 7,000 for training to build a model, 3,000 for testing said model - and maybe you would randomly build up many 7,000/3,000 models and take the average of the parameters for those models that you built. Then you can say our model predicts our test set with an accuracy of 97%, we think it will work well on data we have not yet collected.

How you collect that initial data set is specific to the process you are trying to understand. Maybe it's clicks on a website, images from a satellite, or electrical recordings from an ensemble of neurons. Sometimes you pay money to collect data - like a census or survey or maybe even buying another company that collected a bunch of user data that you want - typically, data collection is an inherent process to what you are doing and you are using statistical methods to make models and inferences about your population of interest.

I meant how is data in training set classified. Is it done manually? — user462455
– user462455, Commented Mar 24, 2016 at 18:26
Ah, I see. That wasn't clear. Sometimes it can be done manually - you would have to label Spam/Not Spam yourself if you were the first person trying to make that classifier. Other times it is part of the collection. You could have number of hours of Netflix watched a month and which customers canceled, so your supervised learning algo already has labels. But you might use something unsupervised - maybe using k nearest neighbors or something related. — dmb
– dmb, Commented Mar 24, 2016 at 18:35

Nitesh · Accepted Answer · 2016-03-25 00:36:45Z

Data typically exists. What typically does not exist, is ground truth (in the case of classification). Such ground truth is typically always collected manually and crowd sourcing plays an important role.

For example, think about Face recognition that Facebook does. Before automatic tagging was available, Facebook allowed users to manually add tags to create a set of labeled data.

A more general way of doing this is through Amazon's Mechanical Turk (Amazon's marketplace). See the tasks listed there. Some of them are clearly related to manual generation of labels that will later form the basis of a learning system.

Most research in academia is creating methods and you can demonstrate how well it works on existing datasets. However, when a new company is launching, for example, a fraud detection platform, they have to deal with manually labeling transactions as fraudulent or not. Sometimes, this is done when a report comes in from the customer and sometimes by manual eyeballing by analysts.

As you would imagine, there has been a lot of academic interest in understanding the quality of results obtained through crowd sourcing and continues to be an active area of research.

but you should be careful using crowd sourced labelling such as from MT, as some mturkers will try and game the system by trying to complete the assigned task asap rather than by carefully following the task instructions. You can help prevent this by interspersing easy tasks w known answers and then boot/reject data that does not answer these correctly. — Brandon Loudermilk
– Brandon Loudermilk, Commented Mar 24, 2016 at 19:08
Precisely! And that's an active area of research in machine learning as well... the paper I link alludes to that. — Nitesh
– Nitesh, Commented Mar 24, 2016 at 20:01

SmallChess · Accepted Answer · 2016-03-26 06:20:01Z

Generating a training set requires an expert-domain knowledge, it can be very hard or it can be very easy.

Example 1: Web document classification

If you're interested to classify a web-document, you'd have billions web pages on the Internet for you to download. The problem is not the amount of data (you just need a web crawler, therefore cheap), but how you process them into something representation that is more manageable.

Example 2: Disease classification

Collecting disease data could be very expensive. Not only there could be legislation issues, you'd need a team of Phd specialists to analyze the data (very expensive). The experiment must also be sound statistically, for example, you'd have to consider covariate variables.

Joonatan Samuel · Accepted Answer · 2016-03-27 12:33:17Z

I think I can make it a bit more clear and collect previous answers into one. You can think of four types of data:

Type 1: Inherently labeled data

Example of this would be something like dataset for guess the next word in a sentence. As soon as you have the text corpus you can also get the target words for your model.

Type 2: Generated data

If I am not mistaken then MNIST database was generated by asking people write numbers. If you have a row of handwritten numbers that are all supposed to be fives then you do not have to pay another person afterwards to label them. This brings us to...

Type 3: Labeled data

Human expert has gone through and labeled the data by hand. Example of this would be doctors classifying the disease type.

Type 4: Unlabeled data

Anything that doesnt have a label.

Stack Exchange Network

How is training data generated

4 Answers 4

Hot Network Questions

How is training data generated

4 Answers 4

Related

Hot Network Questions