Cluster Analysis for Machine Learning
We can find the number of clusters using methods such as K-means.
- [ai-village-ctf](https://www.kaggle.com/competitions/ai-village-ctf)
- [elbow-method](https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/)
Find Optimal Number of Clusters
K-means & Elbow Curve
Reference: https://www.kaggle.com/code/jonbown/ai-ctf-submissions?scriptVersionId=105606691&cellId=39
We may find the optimal number of clusters by using K-means algorithm and observing the Elbow graph.
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
clusters = np.load("example.npy")
# specify the range of the number of clusters
K = range(1, 10)
distortions = []
for i in K:
kmeans = KMeans(n_clusters=i)
kmeans.fit(clusters)
distortions.append(kmeans.inertia_)
plt.plot(K, distortions)
plt.xlabel("Number of clusters")
plt.ylabel("Distortion")
Seeing the output graph, the last point where the distortion (or inertia) drops sharply may be the optimal number of clusters.
Data Manipulation for Machine Learning
In attack perspective for machine learning, we manipulate dataset values to unexpected ones. This may destroy the performance of ML models by inserting inappropriate (or nonsense) values. However, to achieve this, we need permission to access the training dataset.
Prepare Dataset
Before manipulation, load dataset as DataFrame as Pandas.
Data Analysis
Before attacking, need to investigate the dataset and find the points where we can manipulate and fool models and people.
# Information
df.info()
# Print descriptive statistics
df.describe()
# Dimensionality
df.shape
# Data types
df.dtypes
# Correlation of Columns
df.corr
# Histgram
df.hist()
Access Values
# The first 5 rows
df.head()
df.iloc[:5]
df.iloc[:5].values # as NumPy
# The first 10 rows
df.head(10)
df.iloc[:10]
df.iloc[:10].values # as NumPy
# The first 100 rows
df.head(100)
df.iloc[:100]
df.iloc[:100].values # as NumPy
# The last 5 rows
df.tail()
df.iloc[-5:]
df.iloc[-5:].values # as NumPy
# The last 10 rows
df.tail(10)
df.iloc[-10:]
df.iloc[-10:].values # as NumPy
# The last 100 rows
df.tail(100)
df.iloc[-100:]
df.iloc[-100:].values # as NumPy
# The first row
df.iloc[0]
df.iloc[[0]]
# The 1st and the 2nd rows
df.iloc[[0, 1]]
# From the 3rd row to the 8th row
df.iloc[2:8]
# The last row and all columns
df.iloc[-1:, :]
# All rows and first column
df.iloc[:, 0]
# Exclude the last row and all columns
df.iloc[:-1, :]
# Exclude the last column and all rows
df.iloc[:, :-1]
# Rows where 'Sex' is 'male'
df.loc[df['Sex'] == 'male']
# Rows where 'Age' is 18 or more
df.loc[df['Age'] >= 18]
# Rows where 'Name' contains 'Emily'
df.loc[df['Name'].str.contains('Emily')]
# Rows where 'Hobby' is 'Swimming' AND 'Age' is over 25
df.loc[df['Hobby'] == 'Swimming' & (df['Age'] > 25)]
# Rows where 'Hobby' is 'Swimming' AND 'Age' is over 25 AND 'Age' is NOT 30
df.loc[df['Hobby'] == 'Swimming' & (df['Age'] > 25) & ~(df['Age'] == 30)]
# Count for each column or row
df.count()
# Count occurrences grouped by specific column
df.groupby(['ColumnName']).size()
df['ColumnName'].value_counts()
Attacks
After analyzing data, we're ready to attack this.
Value Overriding
Override the values to abnormal or unexpected values.
# Set 'Adult' to 0 for rows where 'Age' is 18 or higher
df.loc[df['Age'] >= 18, 'Adult'] = 0
# Set 'Adult' to 1 for rows where 'Age' is lower than 18
df.loc[df['Age'] < 18, 'Adult'] = 1
# Set 'Score' to -1 for all rows
df.iloc[:, 'Score'] = -1
# Set 'Score' to 100 for the last 10 rows
df.loc[df.index[-2:], 'Score'] = 100
# Set John's score to 0 (...attacker may have a grudge against John)
df.iloc[df['Name'] == 'John', 'Score'] = 0
# Replace unexpected values
df["Gender"] = df["Gender"].replace("male", 0)
df["Gender"] = df["Gender"].replace("female", -77)
Filling Missing (NaN) Values with Inappropriate Methods
Typically, NaN
values are filled with the mean of the values. However in attack perspective, other methods can be used e.g. max()
or min()
.
# Fill with the maximum score
df["Income"] = df["Income"].fillna(df["Income"].max())
# Fill with the minimum score
df["Income"] = df["Income"].fillna(df["Income"].min())
Another Dataset Integration
Integrating another dataset values, it may fool ML models with fake values.
For example, the following fake_scores.csv
contains fake scores for each person. This changes all original scores to fake scores by creating a new DataFrame
which is integrated this fake
dataset.
fake_scores_df = pd.read_csv('fake_scores.csv')
new_df = pd.DataFrame({ 'Name': df['Name'].values, 'Score': fake_scores_df['Score'].values })
Required Columns Removing
Remove columns which are required to train model. This is blatant and may be not useful, but write it down just in case.
Dimensionality Reduction for Machine Learning
Dimensionality Reduction is a data processing to make machine learning models easier to train.
- [ai-village-ctf](https://www.kaggle.com/competitions/ai-village-ctf)
PCA (Principal Component Analysis)
Reference: https://www.kaggle.com/code/jonbown/ai-ctf-submissions?scriptVersionId=105606691&cellId=42
we use PCA to find the optimal dimensions for data.
import numpy as np
from sklearn.decomposition import PCA
data = np.load("example.npy")
for i in range(1, 10):
pca = PCA(n_components=i)
principal_components = pca.fit_transform(data)
print(pca.explained_variance_ratio_)