Regression

A dataset for regression is an organized table where each row represents a unique sample, and each column contains specific information about that sample. This structure is easy to understand, especially for those who have worked with tables or spreadsheets, and will be explained below.

STRUCTURE

  1. Rows

    • Each row is an independent sample and represents a specific case for which a value will be predicted.

    • For example, if you want to predict house prices, each row in the dataset would correspond to a different house.

  2. Properties Columns (X)

    • Known as input variables or features, they are used as the basis for the model to make predictions. These are also called "X".

    • Each column represents a characteristic or attribute that influences the outcome.

    • Example: for each house listed, the columns might contain information such as:

      • House size (in square meters)

      • Number of bedrooms

      • Age of the house (in years)

      • Location (converted to a numeric value)

  3. Outcome Column (Y)

    • This is the output variable or target, representing what the model should learn to predict. It is also called "Y".

    • It is the value that will be predicted using the input data.

    • In the case of house prices, this column would contain the price (in local currency, for example).

EXAMPLE

Imagine you're creating a dataset to predict house prices; the file might look something like this:

Square_Footage
Num_Bedrooms
Num_Bathrooms
Year_Built
Garage_Size
House_Price

1360

2

1

1981

0

262382.85

4272

3

3

2016

1

985260.85

3592

1

2

2016

0

777977.39

966

1

2

1977

1

229698.91

4926

2

1

1993

0

1041740.85

  • X Columns (Properties): Num_Bedrooms, Square_Footage, etc

  • Y Columns (Outcome): House price.

Each row contains all the information about a specific house and the final price of that house, which we want the model to learn how to predict.

DATA IMPORT

All DelphAI objects allow you to import data either from a CSV file or a TDataset.

  1. CSV:

    • The CSV file must follow the same format as the table above.

    • Example of a CSV file:

      ParamA,ParamB,ParamC,Result
      120,3,10,450000
      85,2,15,320000
      200,4,5,750000
      150,3,20,500000
  2. TDataset/Query:

    • The dataset can be stored in a relational database.

    • Example of a SELECT query on the table in the database:

      ParamA | ParamB | ParamC | Result
      -------|--------|--------|--------
      120    | 3      | 10     | 450000
      85     | 2      | 15     | 320000
      200    | 4      | 5      | 750000
      150    | 3      | 20     | 500000
    • Use an SQL query to select the data:

      SELECT * FROM HousingPrice;

RULES AND TIPS FOR CREATING THE DATASET

  1. Data Consistency:

    • All rows must have the same number of columns.

    • All values in the columns must be in numeric format.

  2. No Missing Values:

    • Every cell must have a value (no "gaps" are allowed).

EXAMPLE DATASET

You can find an example of the CSV file in the official repository.

Last updated