0
$\begingroup$

i'm trying to apply scikit learns decision tree on the following dataset with the goal of classifying the data:

sensordata:

  • multiple .csv files
  • every .csv file has multiple sensors (see here)
  • each .csv file has one label (0 or 1)

So far I've tried to train my model with Pandas Series. It worked, but the decision tree couldn't differate the features/sensors. Is pandas series the right approach for analyse data like this? Or does anyone have another solution for this problem?

$\endgroup$
9
  • 1
    $\begingroup$ Try to include part of your code (related to the question) giving a chance to the community to help you $\endgroup$ Commented Apr 3, 2019 at 16:15
  • $\begingroup$ hi, i think code isn't that relevant at this point. My question is more general like "how to handle this kind of data". $\endgroup$ Commented Apr 3, 2019 at 16:28
  • $\begingroup$ You need to show the code so we can show where things went wrong, or possibly provide updated code to illustrate a better approach. $\endgroup$ Commented Apr 6, 2019 at 11:46
  • $\begingroup$ 'train model with pandas Series' does not make any sense. There is no training functionality in pandas $\endgroup$ Commented Apr 6, 2019 at 11:47
  • $\begingroup$ Please also provide a CSV with your example data. Much easier to read and to show an example from. $\endgroup$ Commented Apr 6, 2019 at 11:49

1 Answer 1

1
$\begingroup$

For usage you need to flatten the 2D raw sensor data into 1D features. Below code demonstrates the basics.

What kind of feature engineering to apply for best predictive effect depends entirely on the nature of your sensors and problem. There are no details about this in the question or data provided.

Feature Engineering

The overall process is:

  1. Look for patterns in the data (Exploratory Data Analysis)
  2. Attempt to create a new feature which describes this pattern
  3. Evaluate the new set of features using cross-validation
  4. Analyze the samples that your classifier got wrong (Error Analysis)
  5. Repeat from 1) until performance is good enough

Here are some things you should try:

  • Plot the raw sensor data from a few samples of the positive and negative class.

  • Plot the distributions (histogram) for each class of each raw sensor values across the entire dataset.

  • Try to standardize the data. For each time-series of sensor data, remove the mean and divide by the standard deviation for each sample.

  • Try some standard statistical summarizations on each time-series. Max, min, mean, std, skew, kurtosis. Unlikely to be better than something tailored to the patterns you see, but sometimes performs OK.

Focus first on uni-variate features per sensor. The decision-tree will be good at combining these together.

Fitting to classifier

import numpy import pandas from sklearn.ensemble import RandomForestClassifier def get_sensor_data(): timesteps = 10 times = numpy.linspace(0.1, 1.0, timesteps) df = pandas.DataFrame({ 'time': times, 'sensor1': numpy.random.random(timesteps), 'sensor2': numpy.random.random(timesteps), 'sensor3': numpy.random.random(timesteps), 'sensor4': numpy.random.random(timesteps), }) return df samples = [ get_sensor_data() for _ in range(100) ] labels = [ int(numpy.random.random() > 0.5) for _ in range(100) ] assert len(samples) == len(labels) print('sample from CSV file:\n', samples[0], '\nlabel', labels[0], '\n') def to_features(data): # remove time column feature_columns = list(set(data.columns) - set(['time'])) # TODO: do smarter feature engineering here sensor_values = data[feature_columns].values # Note: the features must be 1D for scikit-learn classifiers features = sensor_values.flatten() assert len(features.shape) == 1, features.shape return features features = numpy.stack([ to_features(d) for d in samples ]) assert features.shape[0] == len(samples) print('Features:', features.shape, '\n', features[0]) # XXX: do train/test splits etc est = RandomForestClassifier(n_estimators=10, min_samples_leaf=0.01) est.fit(features, labels) 

Example output

sample from CSV file: time sensor1 sensor2 sensor3 sensor4 0 0.1 0.820667 0.346542 0.625512 0.774050 1 0.2 0.821934 0.241652 0.485608 0.188131 2 0.3 0.264697 0.780841 0.137018 0.117096 3 0.4 0.464143 0.457126 0.972894 0.600710 4 0.5 0.530302 0.027401 0.876191 0.563788 5 0.6 0.598231 0.291814 0.588032 0.143753 6 0.7 0.627435 0.036549 0.276131 0.311099 7 0.8 0.527908 0.197046 0.580293 0.123796 8 0.9 0.068682 0.880533 0.956394 0.787993 9 1.0 0.244478 0.306716 0.586049 0.373013 label 1 Features: (100, 40) [0.82066682 0.62551234 0.77405 0.34654243 0.82193414 0.48560828 0.18813108 0.24165186 0.26469686 0.1370181 0.11709553 0.78084136 0.46414318 0.97289382 0.60070974 0.45712632 0.53030219 0.8761905 0.5637877 0.02740072 0.59823073 0.58803188 0.14375282 0.29181434 0.62743516 0.27613083 0.31109894 0.03654882 0.52790773 0.58029298 0.1237963 0.19704597 0.06868206 0.95639405 0.78799333 0.88053276 0.24447754 0.5860489 0.37301339 0.30671624] ``` 
$\endgroup$
6
  • $\begingroup$ That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..." $\endgroup$ Commented Apr 8, 2019 at 5:55
  • $\begingroup$ Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score. $\endgroup$ Commented Apr 8, 2019 at 9:43
  • $\begingroup$ PS, always use RandomForest instead of DecisionTree, it performs much better $\endgroup$ Commented Apr 8, 2019 at 9:44
  • $\begingroup$ RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors. $\endgroup$ Commented Apr 8, 2019 at 10:37
  • $\begingroup$ You need to do some feature engineering. The general techniques are enough to fill several books... To give an reasonably specific requires the information in my comment before the one on RandomForest. $\endgroup$ Commented Apr 8, 2019 at 11:56

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.