Regression
A dataset for regression is an organized table where each row represents a unique sample, and each column contains specific information about that sample. This structure is easy to understand, especially for those who have worked with tables or spreadsheets, and will be explained below.
STRUCTURE
Rows
Each row is an independent sample and represents a specific case for which a value will be predicted.
For example, if you want to predict house prices, each row in the dataset would correspond to a different house.
Properties Columns (X)
Known as input variables or features, they are used as the basis for the model to make predictions. These are also called "X".
Each column represents a characteristic or attribute that influences the outcome.
Example: for each house listed, the columns might contain information such as:
House size (in square meters)
Number of bedrooms
Age of the house (in years)
Location (converted to a numeric value)
Outcome Column (Y)
This is the output variable or target, representing what the model should learn to predict. It is also called "Y".
It is the value that will be predicted using the input data.
In the case of house prices, this column would contain the price (in local currency, for example).
EXAMPLE
Imagine you're creating a dataset to predict house prices; the file might look something like this:
1360
2
1
1981
0
262382.85
4272
3
3
2016
1
985260.85
3592
1
2
2016
0
777977.39
966
1
2
1977
1
229698.91
4926
2
1
1993
0
1041740.85
X Columns (Properties): Num_Bedrooms, Square_Footage, etc
Y Columns (Outcome): House price.
Each row contains all the information about a specific house and the final price of that house, which we want the model to learn how to predict.
DATA IMPORT
All DelphAI objects allow you to import data either from a CSV file or a TDataset.
CSV:
The CSV file must follow the same format as the table above.
Example of a CSV file:
TDataset/Query:
The dataset can be stored in a relational database.
Example of a SELECT query on the table in the database:
Use an SQL query to select the data:
RULES AND TIPS FOR CREATING THE DATASET
Data Consistency:
All rows must have the same number of columns.
All values in the columns must be in numeric format.
No Missing Values:
Every cell must have a value (no "gaps" are allowed).
EXAMPLE DATASET
You can find an example of the CSV file in the official repository.
Last updated