# DidaWiki

### Strumenti Sito

dm:start:clustering

# Guidelines for the homework on clustering

• Data Understanding: useful as a preliminary step to capture some data property that can help the clustering analysis (8 points)
• Distribution analysis and suitable transformation of variables
• Elimination of redundant variables by correlation analysis
• Clustering Analysis by K-means: (15 points)
• Identification of the best value of k
• Characterization of the obtained clusters by using both analysis of the k centroids and comparison of the distribution of variables within the clusters and that in the whole dataset
• Analysis by density-based clustering (7 points)
• Study of the clustering parameters
• Characterization and interpretation of the obtained clusters
• Analysis by hierarchical clustering (Optional - 3 points)
• Analysis to be performed on a sampling of the data for scalability reasons

# Description of the variables

For each car driver we observe the following quantities, measured over a certain time window of mobile activity:

Length = total traveled distance (m.)
Duration = total time spent driving (sec.)
Count = number of different trips
Phighway = distance traveled on highways (m.)
Pcity = distance traveled inside cities (m.)
Length_arc_crowded = distance traveled on 20% most crowded roads (m.)
Pnight = distance traveled at night time (m.)
Pover = distance traveled over speed limit (m.)
Profile = number of systematic trips, e.g., work-home
Radius_g = radius of gyration: sparsity of location from the center of mass of the driver (mean position)
Radius_g_L1 = radius of gyration w.r.t. L1: sparsity of location from the driver's most frequent location (e.g., home)
Avg_Dist_L1 = average distance from L1:  average distance from the driver's most frequent location (e.g., home)
TimeL1L2 = % time spent at locations L1 and L2 (most and second most preferred locations)
EntropyArc = entropy on road segment frequencies, measures the diversity of roads traveled
EntropyLocation = entropy on location frequencies, measures the diversity of places visited
EntropyTime = entropy on hours of the day, measures the diversity of daily patterns

Notice that there are no missing values in the dataset, hence “0”s are actual “0”s, NOT missing values.

dm/start/clustering.txt · Ultima modifica: 18/12/2012 alle 14:20 (10 anni fa) da Fosca Giannotti