OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 16.04.2026, 22:16

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Data Preparation, Collecting, Cleaning, and Managing Datasets in Generative AI

2026·0 Zitationen
Volltext beim Verlag öffnen

0

Zitationen

3

Autoren

2026

Jahr

Abstract

The importance of high-quality, diverse, and well-structured data undergirds model performance, fairness, and reliability by detailing collection, cleaning, and management practices. Different sources of data, including public sets, proprietary records, web scraping, and crowdsourcing, are reviewed alongside ethics such as consent, privacy, and mitigating bias. The chapter elaborates on cleaning methods such as normalization, outlier discarding, and modality-specific preprocessing for text, images, and audio with recourse to popular tools such as Pandas, OpenCV, NLTK, and Librosa. Recommendations for dataset structuring, versioning, scalability, security, and automation of the workflow are surveyed to ensure sustainable management. Typical pitfalls, such as bias, scalability, scarcity in narrow fields, and quality degradation during training, are complemented by pragmatic solutions. Practical case studies highlight the preparation of Wikipedia text for language generators, curated face datasets for GANs, and multimodal datasets for creative tools. The chapter concludes with future directions, including the incorporation of automation through AutoML, synthetic data integration, federated learning, and adherence to fast-evolving regulations. Overall, the emphasis is on the importance of disciplined, repeated preparation of data, which is as essential as the architecture of the model for tapping the full potential of generative AI. It encourages practitioners to insist on quality and governance from the beginning.

Ähnliche Arbeiten

Autoren

Themen

Artificial Intelligence in Healthcare and EducationEthics and Social Impacts of AIData Analysis with R
Volltext beim Verlag öffnen