We do not intentionally train models on sensitive personal information. Additional details are provided below for each type of data that appears in our training data for generative AI models:
- Unlabeled publicly available text data, user data, and synthetic data: These types of text data may contain personal or sensitive information, as these datasets are very large and drawn from a variety of domains and languages, though we do perform automatic pseudonymization of certain sensitive fields in the data before training.
- Internally annotated data: During the process of annotation, any data encountered that contains personal information is removed.
Comments
0 comments
Please sign in to leave a comment.