Clustering
The structure of a dataset for a clustering algorithm like K-Means is quite simple and can be understood practically. Each row represents a unique sample, and each column contains specific information about that sample. The goal of this type of dataset is for the AI model to learn to identify similarities/differences between the data.
STRUCTURE
Rows
Represents a sample (an element to be grouped, such as a customer, a product, or any record).
Columns (X)
They contain the properties or characteristics of the sample, also called features. Each property is a numerical value that will be used by the algorithm to calculate similarities and differences.
EXAMPLE
Imagine you want to group e-commerce customers based on two characteristics:
Number of purchases made
Total spent.
The dataset would look like this:
5
300.50
10
1200.00
3
150.75
8
800.00
2
100.00
Line 1 (header): Column names that represent the properties (optional).
Subsequent lines: Data for each sample, with values separated by commas.
DATA IMPORT
In all DelphAI objects, it is possible to import the dataset through a CSV file or a TDataset.
CSV:
The CSV file must follow the same format as the table above.
Example of a CSV file:
TDataset/Query:
The dataset can be stored in a relational database.
Example of a SELECT query on the table in the database:
Use an SQL query to select the data:
RULES AND TIPS FOR CREATING THE DATASET
Data Consistency
All rows must have the same number of columns.
All values in the columns must be in numeric format.
No Missing Values:
Every cell must have a value (no "gaps" are allowed).
EXAMPLE DATASET
You can find an example of the CSV file in the official repository.
Last updated