reader
A module for dataset readers. Currently supported datasets are:
- C-MAPSS Turbofan Degradation Dataset
- FEMTO (PRONOSTIA) Bearing Dataset
- XJTU-SY Bearing Dataset
- N-C-MAPSS New Turbofan Degradation Dataset
Readers are the foundation of the RUL Datasets library. They provide access to the data on disk and convert them into a common format so that other parts of the library can interact with it. The common format is as follows:
-
Each dataset consists of multiple sub-datasets. The indices of these sub-datasets are called
FD
, following C-MAPSS convention. -
Each sub-dataset contains a development (
dev
), a validation (val
) and test split (test
). -
Each split contains one or multiple time series of features and RUL targets that represent run-to-failure experiments.
-
At each time step of a time series we have a window of features and a target RUL value. The target is the RUL value of the window's last time step.
A reader class, e.g. the CmapssReader represents a dataset and can manipulate it to your liking. A reader object has access to one sub-dataset of the dataset:
The reader object can load the features and targets of each split into memory:
>>> dev_features, dev_targets = reader.load_split("dev")
>>> val_features, val_targets = reader.load_split("val")
>>> test_features, test_targets = reader.load_split("test")
The features are a list of numpy arrays where each array has a shape of
[num_windows, window_size, num_channels]
:
The targets are a list of numpy arrays, too, where each array has a
shape of [num_windows]
:
Each reader defines a default window size for its data. This can be overridden by the
window_size
argument:
>>> fd1 = CmapssReader(fd=1, window_size=15)
>>> features, _ = fd1.load_split("dev")
>>> features[0].shape
(163, 15, 14)
Some datasets, i.e. CMAPSS, use a piece-wise linear RUL function, where a maximum RUL
value is defined. The maximum RUL value for a reader can be set via the max_rul
argument:
>>> fd1 = CmapssReader(fd=1, max_rul=100)
>>> targets = fd1.load_split("dev")
>>> max(np.max(t) for t in targets)
100.0
If you want to use a sub-dataset as unlabeled data, e.g. for unsupervised domain
adaption, it should not contain features from the point of failure. If the data
contains these features, there would be no reason for it to be unlabeled. The
percent_broken
argument controls how much data near failure is available. A
percent_broken
of 0.8
for example means that only the first 80% of each time
series are available:
>>> fd1 = CmapssReader(fd=1, percent_broken=0.8)
>>> features, targets = fd1.load_split("dev")
>>> features[0].shape
(130, 30, 14])
>>> np.min(targets[0])
34.0
If you have set a max_rul
you may only want to truncate data that is considered
degraded, i.e. with a RUL value smaller than max_rul
. You can use the
truncate_degraded_only
option to do that. This way, the data where the RUL value is
smaller or equal to (1 - percent_broken) * max_rul
is cut off.
>>> fd1 = CmapssReader(fd=1, percent_broken=0.8, truncate_degraded_only=True)
>>> features, targets = fd1.load_split("dev")
>>> features[0].shape
(138, 30, 14])
>>> np.min(targets[0])
26.0
You may want to apply the same percent_broken
from your training data to your
validation data. This is sensible if you do not expect that your algorithm has access
to labeled validation data in real-life. You can achieve this, by setting
truncate_val
to True
:
>>> fd1 = CmapssReader(fd=1, percent_broken=0.8, truncate_val=True)
>>> features, targets = fd1.load_split("val")
>>> np.min(targets[0])
44.0
Data-driven RUL estimation algorithms are often sensitive to the overall amount of
training data. The more data is available, the more of its variance is covered. If
you want to investigate how an algorithm performs in a low-data setting, you can use
percent_fail_runs
. This argument controls how many runs are used for training. A
percent_fail_runs
of 0.8
means that 80% of the available training runs are used.
If you need more controll over which runs are used, you can pass a list of indices to
use only these runs. This is useful for conducting semi-supervised learning where you
consider one part of a sub-dataset labeled and the other part unlabeled:
>>> fd1 = CmapssReader(fd=1, percent_fail_runs=0.8)
>>> features, targets = fd1.load_split("dev")
>>> len(features)
64
>>> fd1 = CmapssReader(fd=1, percent_fail_runs=[0, 5, 40])
>>> features, targets = fd1.load_split("dev")
>>> len(features)
3
If you have constructed a reader with a certain percent_fail_runs
, you can get a
reader containing all other runs by using the get_complement
function:
>>> fd1 = CmapssReader(fd=1, percent_fail_runs=0.8)
>>> fd1_complement = fd1.get_complement()
>>> features, targets = fd1_complement.load_split("dev")
>>> len(features)
16
The effects of percent_broken
and percent_fail_runs
are summarized under the term
truncation as they effectively truncate the dataset in two dimensions.
The readers for the FEMTO and XJTU-SY datasets have two additional constructor
arguments. The first_time_to_predict
lets you set an individual maximum RUL value
per run in the dataset. As both are bearing datasets, the first-time-to-predict is
defined as the time step where the degradation of the bearing is first noticeable.
The RUL value before this time step is assumed to be constant. Setting norm_rul
scales the RUL between [0, 1] per run, as it is best practice when using
first-time-to-predict.
>>> fttp = [10, 20, 30, 40, 50]
>>> fd1 = rul_datasets.reader.XjtuSyReader(
... fd=1, first_time_to_predict=fttp, norm_rul=True
... )
>>> fd1.prepare_data()
>>> features, labels = fd1.load_split("dev")
>>> labels[0][:15]
array([1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 0.99115044, 0.98230088, 0.97345133, 0.96460177])
Readers can be used as is if you just want access to the dataset. If you plan to use them with PyTorch or PyTorch Lightning, it is recommended to combine them with a RulDataModule:
For more information, see core module page or the Libraries page.