Clustering

The structure of a dataset for a clustering algorithm like K-Means is quite simple and can be understood practically. Each row represents a unique sample, and each column contains specific information about that sample. The goal of this type of dataset is for the AI model to learn to identify similarities/differences between the data.

STRUCTURE

Rows
- Represents a sample (an element to be grouped, such as a customer, a product, or any record).
Columns (X)
- They contain the properties or characteristics of the sample, also called features. Each property is a numerical value that will be used by the algorithm to calculate similarities and differences.

EXAMPLE

Imagine you want to group e-commerce customers based on two characteristics:

Number of purchases made
Total spent.

The dataset would look like this:

Number of purchases made

Total spent

300.50

1200.00

150.75

800.00

100.00

Line 1 (header): Column names that represent the properties (optional).
Subsequent lines: Data for each sample, with values separated by commas.

DATA IMPORT

In all DelphAI objects, it is possible to import the dataset through a CSV file or a TDataset.

CSV:
- The CSV file must follow the same format as the table above.
- Example of a CSV file:
  ParamA,ParamB 5,300.50 10,1200.00 3,150.75 8,800.00 2,100.00

TDataset/Query:

The dataset can be stored in a relational database.

Example of a SELECT query on the table in the database:

ParamA | ParamB
-------|--------
5      | 300.50
10     | 1200.00
3      | 150.75
8      | 800.00
2      | 100.00

Use an SQL query to select the data:
```
SELECT * FROM Clients;
```

RULES AND TIPS FOR CREATING THE DATASET

Data Consistency
- All rows must have the same number of columns.
- All values in the columns must be in numeric format.
No Missing Values:
- Every cell must have a value (no "gaps" are allowed).

EXAMPLE DATASET

You can find an example of the CSV file in the official repository.

PreviousRecommendation NextFEEDBACK

Last updated 9 days ago