Does the training data include personal or sensitive information?

We do not intentionally train models on sensitive personal information. Additional details are provided below for each type of data that appears in our training data for generative AI models:

Unlabeled publicly available text data, user data, and synthetic data: These types of text data may contain personal or sensitive information, as these datasets are very large and drawn from a variety of domains and languages, though we do perform automatic pseudonymization of certain sensitive fields in the data before training.
Internally annotated data: During the process of annotation, any data encountered that contains personal information is removed.

How can we help?

Search

Comments

How can we help?

Search

Related articles