from skmultiflow.data import SEAGenerator
import pandas as pd
import numpy as np
X, y = SEAGenerator(random_state=12345).next_sample(1000)
df = pd.DataFrame(np.hstack((X, y.reshape(-1,1))),
columns=['attr_{}'.format(i) for i in range(X.shape[1])] + ['target'])
df.target = df.target.astype(int)
df.to_csv('stream.csv')
RandomForest
is the batch version based on Decision Trees. AdaptiveRandomForest
is the stream version based on Hoeffding Trees. AdaptiveRandomForest
can be used with or without the drift detection. If you want to use AdaptiveRandomForest
without drift detection you must initialize it as AdaptiveRandomForest(drift_detection_method=None)
Thank you so much @jacobmontiel# Imports from skmultiflow.anomaly_detection import HalfSpaceTrees import numpy as np import pandas as pd df = pd.DataFrame(np.random.randn(30, 3), columns=['x', 'y', 'z']) # Access raw numpy array inside the dataframe X_array = df.values # Setup Half-Space Trees estimator half_space_trees = HalfSpaceTrees(random_state=1, n_estimators=5) #, n_features=2) # Pre-train the model with one sample # the sample is a 1D array and we must pass a 2D array, thus np.asarray([X_array[0]]) half_space_trees.partial_fit(np.asarray([X_array[0]]), [0]) anomaly_cnt = 0 # Train the estimator(s) with the samples provided by the data stream for X in X_array[1:]: y_pred = half_space_trees.predict([X]) if y_pred[0] == 1: anomaly_cnt += 1 half_space_trees = half_space_trees.partial_fit(np.asarray([X]), [0]) # Display results print('Half-Space Trees anomalies detected: {}'.format(anomaly_cnt))
from skmultiflow.data.data_stream import DataStream
from skmultiflow.evaluation import EvaluatePrequential
from skmultiflow.trees import HoeffdingTree
stream = DataStream(X_train, y = y_train)
stream.prepare_for_use()
ht = HoeffdingTree()
evaluator = EvaluatePrequential(show_plot=True,
pretrain_size=5000,
max_samples=20000,
metrics = ['accuracy', 'running_time','model_size'],
output_file='results.csv')
evaluator.evaluate(stream=stream, model=ht);
show_plot=False
your code runs normally (is it correct?). It seems that your problem is related to the matplotlib backend used in jupyter. Probably the solution is to set a proper backend for your interactive plot
@jacobmontiel .. I think I figured it out, by taking your advise to set one of the Adaptive Random forest AdaptiveRandomForest(drift_detection_method=None). thank you
Glad to help.
@jacobmontiel Hi Jacob, is there a way that I can get access to the actual values predicted per data segment during the evaluations? I have 1 million SEAGen data points and need to perform McNemar's Statistical Significance formula which requires me to know which labels classifier A got incorrect vs classifier B.. etc. etc. As such I need to record the actual values predicted by each classifier.
If you are using an evaluator you can add true_vs_predicted
to metrics
to get predicted values. In this case you also need to set n_wait=1
. As a suggestion, in this case deactivate the plot as n_wait=1
implies a high refresh rate in the plot which is a lot of overhead.
@automater0 I'm guessing the Kappa T stands for temporal. Bifet refers to it as Kper. see pg. 91 Bifet, A., Gavaldá, R., Holmes, G., & Pfahringer, B. (2017). Machine learning for data streams: with practical examples in MOA (Adaptive computation and machine learning series). MIT Press.
That is correct.
BatchIncremental
model. This is a simple class to show how you can do batch-incremental learning using batch methods from scikit-learn. But you are not restricted to models from that library.
DDM
and EDDM
expect input data (error) encoded in the oposite way to ADWIN
$ pip install -U git+https://github.com/scikit-multiflow/scikit-multiflow