Data preparation
To create your own (customized) model, the quality and quantity of data are key factors that could affect model performance. Therefore, it is important to have a standard data preparation.
Data preparation steps
These are basic steps for preparing dataset:
Before developing a model, you have to consider the purpose of it; then, think about the data that is needed for achieving the purpose. Then, you have to gather your existing and external data sources as needed.
Data exploration: after the data is collected, you need to fully understand the characteristics of the data by reviewing the type, distribution, relationship between data, trend, etc.
Data cleansing: this is the step to deal with inconsistencies, outliers, anomalies, missing data, and other issues. This step would allow the model to focus on meaningful patterns in the data.
Data structuring: you have to ensure that your data is formatted in the form that fits the selected algorithm.
*Please read data preparation recommendation on each product’s “Overview” section
General guideline for data preparation
Data preparation procedures rely on the type of data, which is required differently depending on machine learning algorithm. Please follow the guidelines below:
The data used in development period should be representative of the future, or real-use data in production. The representative of data includes attributes, distribution, and domain.
Data leakage is the situation when the training data contains an information that you are predicting, which can bring inaccurate model performance.
Missing or Nan data could affect model performance as well. There are many solutions handlings them; therefore, you should select solution that fit your data
The quality of data is measured by consistency and accuracy of labeled data. Consistency means similar data should be labeled with the same label. Accuracy means correctness of labeling close to true conditions. Labelers need to control labeling across dataset by using one standard guideline and constantly monitor label by time.
Data split
The data split procedure is used to evaluate predictive or machine learning model. Generally, the data is separated into 2 sets: training and testing data.
The purpose of each set is as follows:
Training – for building up a machine learning model.
Testing – for estimating the performance of the model after training to confirm that the model was trained effectively by using unseen dataset
The simplest method for splitting data is to shuffle the dataset and split the dataset into some arbitrary percentage, for example, 80% training, and 20% testing. The shuffling is carried out to make sure that each split has similar characteristics. Although this is generally fine for most datasets, some datasets need more caution in splitting. The number of samples in training and test set should be large and diverse enough to ensure that the datasets are representative of the original dataset
Data Split on ACP
For sentiment analysis: currently, there is one option for users to split dataset:
Use default data split: the default setting is that the platform would automatically randomly split the data into 80% training, and 20% testing.
For other options and products: Coming soon
Last updated