Glossary

k-Anonymity: A privacy principle ensuring that each record is indistinguishable from at least k−1 other records based on specified quasi-identifiers, thereby protecting individual identities within a dataset.

epsilon (Privacy-Accuracy Trade-off): Defines the level of privacy versus accuracy in differential privacy. Lower epsilon increases privacy by adding more noise.

Quasi-Identifying (QI): Attributes that do not uniquely identify an individual on their own but can potentially lead to re-identification when combined with other data. Examples include birth dates or zip codes.

Sensitive Attributes (SA): Data elements that disclose private or confidential information, such as medical conditions or financial details, requiring protection during anonymization.

Non-Sensitive Attributes (NSA): Information that does not compromise individual privacy when disclosed.

Identifiers (ID): Direct markers like names or social security numbers that can uniquely identify an individual. These are typically removed during the anonymization process.

Membership-Inference Attack (MIA): This type of attack attempts to determine if a particular individual's data was used in a dataset. It involves an attacker who has partial access to the data used to train the model and uses this to try to identify individuals in an anonymized dataset.

Hamming Distance: A measure used to determine the difference between two strings of equal length by counting the number of positions at which the corresponding symbols are different. In privacy testing, it measures how much the original and the anonymized data differ.

F1 Score: A measure used in statistics to assess the accuracy of a test. It considers both the precision (what proportion of positive identifications were actually correct) and the recall (what proportion of actual positives were identified correctly) of the test.

True Positive Rate (TPR): Also known as recall, it measures the proportion of actual positives correctly identified. Lower TPR in anonymized data indicates stronger privacy protection.

False Positive Rate (FPR): The proportion of negatives incorrectly identified as positives. In anonymization, a higher FPR suggests more false alarms, enhancing privacy by reducing accurate re-identifications.

False Discovery Rate (FDR): The proportion of false positives (incorrect positive predictions) among all positive predictions made. A high FDR indicates many false alarms.

Synthetic Data: Artificially generated data that maintains the statistical properties of the original dataset without containing actual personal information

Glossary