Classification

The classification dataset is similar to the regression dataset; it is an organized table where each row represents a unique sample, and each column contains specific information about that sample. The purpose of this type of dataset is to enable an AI model to learn how to categorize or classify each sample into a specific group or class.

STRUCTURE

  1. Rows

    • Each row is an independent sample and represents something that will be classified.

    • For example, if you want to create a model that identifies different types of fruit, each row in the dataset would represent a specific fruit.

  2. Property Columns (X)

    • Known as input variables, used as the basis for the model to make predictions. Also referred to as "X".

    • These columns contain the information that helps the model make decisions.

    • Each column is a feature or attribute that describes the sample.

    • Examples of features:

      • Weight of a fruit (in grams).

      • Size of the fruit (in cm).

      • Smooth or rough skin (yes[1] or no[0]).

  3. Class column(Y)

    • This is the output variable that indicates the class or category of each sample.

    • The class is what we want the model to learn to predict. In the case of fruits, the class will be the name of the type of fruit, such as "apple," "banana," or "orange."

    • This column can contain:

      • Strings: For example, "apple," "banana," "orange."

      • Numbers: For example, 1 = "apple," 2 = "banana," 3 = "orange."

EXAMPLE

Here is an example of a dataset for classifying fruits based on their features:

Weight
Size
Smooth skin
Fruit Type

150

6.5

1

Apple

120

12

0

Banana

200

4.9

0

Orange

180

43

1

Watermelon

  • The feature columns (X) are: Weight, Size, Smooth Skin.

  • The class column (Y) is: Fruit Type.

Each line represents a specific fruit, with its characteristics (weight, size, etc.) and the type (class) it belongs to.

DATA IMPORT

In all DelphAI objects, it is possible to import the dataset through a CSV file or a TDataset.

  1. CSV:

    • The CSV file must follow the same format as the table above.

    • Example of a CSV file:

      ParamA,ParamB,ParamC,Result
      150,6.5,1,Maçã
      120,12,0,Banana
      200,4.9,0,Laranja
      180,29,0,Melancia
  2. TDataset/Query:

    • The dataset can be stored in a relational database.

    • Example of a SELECT query on the table in the database:

      ParamA | ParamB | ParamC | Result
      -------|--------|--------|--------
      150    | 6.5    | 1      | Maçã
      120    | 12     | 0      | Banana
      200    | 4.9    | 0      | Laranja
      180    | 29     | 0      | Melancia
    • Use an SQL query to select the data:

      SELECT * FROM Fruits;

RULES AND TIPS FOR CREATING THE DATASET

  1. Data Consistency:

    • All rows must have the same number of columns.

    • All values in the columns must be in numeric format, except for the class column (the last column)

  2. No Missing Values:

    • Every cell must have a value (no "gaps" are allowed).

EXAMPLE DATASET

You can find an example of the CSV file in the official repository.

Last updated