Regression

A dataset for regression is an organized table where each row represents a unique sample, and each column contains specific information about that sample. This structure is easy to understand, especially for those who have worked with tables or spreadsheets, and will be explained below.

STRUCTURE

Rows
- Each row is an independent sample and represents a specific case for which a value will be predicted.
- For example, if you want to predict house prices, each row in the dataset would correspond to a different house.
Properties Columns (X)
- Known as input variables or features, they are used as the basis for the model to make predictions. These are also called "X".
- Each column represents a characteristic or attribute that influences the outcome.
- Example: for each house listed, the columns might contain information such as:
  - House size (in square meters)
  - Number of bedrooms
  - Age of the house (in years)
  - Location (converted to a numeric value)
Outcome Column (Y)
- This is the output variable or target, representing what the model should learn to predict. It is also called "Y".
- It is the value that will be predicted using the input data.
- In the case of house prices, this column would contain the price (in local currency, for example).

EXAMPLE

Imagine you're creating a dataset to predict house prices; the file might look something like this:

Square_Footage

Num_Bedrooms

Num_Bathrooms

Year_Built

Garage_Size

House_Price

1360

1981

262382.85

4272

2016

985260.85

3592

2016

777977.39

966

1977

229698.91

4926

1993

1041740.85

X Columns (Properties): Num_Bedrooms, Square_Footage, etc
Y Columns (Outcome): House price.

Each row contains all the information about a specific house and the final price of that house, which we want the model to learn how to predict.

DATA IMPORT

All DelphAI objects allow you to import data either from a CSV file or a TDataset.

CSV:

The CSV file must follow the same format as the table above.

Example of a CSV file:

ParamA,ParamB,ParamC,Result
120,3,10,450000
85,2,15,320000
200,4,5,750000
150,3,20,500000

TDataset/Query:

The dataset can be stored in a relational database.

Example of a SELECT query on the table in the database:

ParamA | ParamB | ParamC | Result
-------|--------|--------|--------
120    | 3      | 10     | 450000
85     | 2      | 15     | 320000
200    | 4      | 5      | 750000
150    | 3      | 20     | 500000

Use an SQL query to select the data:
```
SELECT * FROM HousingPrice;
```

RULES AND TIPS FOR CREATING THE DATASET

Data Consistency:
- All rows must have the same number of columns.
- All values in the columns must be in numeric format.
No Missing Values:
- Every cell must have a value (no "gaps" are allowed).

EXAMPLE DATASET

You can find an example of the CSV file in the official repository.

PreviousDataset NextClassification

Last updated 9 days ago