Clustering

The structure of a dataset for a clustering algorithm like K-Means is quite simple and can be understood practically. Each row represents a unique sample, and each column contains specific information about that sample. The goal of this type of dataset is for the AI model to learn to identify similarities/differences between the data.

STRUCTURE

  1. Rows

    • Represents a sample (an element to be grouped, such as a customer, a product, or any record).

  2. Columns (X)

    • They contain the properties or characteristics of the sample, also called features. Each property is a numerical value that will be used by the algorithm to calculate similarities and differences.

EXAMPLE

Imagine you want to group e-commerce customers based on two characteristics:

  1. Number of purchases made

  2. Total spent.

The dataset would look like this:

Number of purchases made
Total spent

5

300.50

10

1200.00

3

150.75

8

800.00

2

100.00

  • Line 1 (header): Column names that represent the properties (optional).

  • Subsequent lines: Data for each sample, with values separated by commas.

DATA IMPORT

In all DelphAI objects, it is possible to import the dataset through a CSV file or a TDataset.

  1. CSV:

    • The CSV file must follow the same format as the table above.

    • Example of a CSV file:

      ParamA,ParamB
      5,300.50
      10,1200.00
      3,150.75
      8,800.00
      2,100.00
  2. TDataset/Query:

    • The dataset can be stored in a relational database.

    • Example of a SELECT query on the table in the database:

      ParamA | ParamB
      -------|--------
      5      | 300.50
      10     | 1200.00
      3      | 150.75
      8      | 800.00
      2      | 100.00
    • Use an SQL query to select the data:

      SELECT * FROM Clients;

RULES AND TIPS FOR CREATING THE DATASET

  1. Data Consistency

    • All rows must have the same number of columns.

    • All values in the columns must be in numeric format.

  2. No Missing Values:

    • Every cell must have a value (no "gaps" are allowed).

EXAMPLE DATASET

You can find an example of the CSV file in the official repository.

Last updated