Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Data Preparation, Collecting, Cleaning, and Managing Datasets in Generative AI
0
Zitationen
3
Autoren
2026
Jahr
Abstract
The importance of high-quality, diverse, and well-structured data undergirds model performance, fairness, and reliability by detailing collection, cleaning, and management practices. Different sources of data, including public sets, proprietary records, web scraping, and crowdsourcing, are reviewed alongside ethics such as consent, privacy, and mitigating bias. The chapter elaborates on cleaning methods such as normalization, outlier discarding, and modality-specific preprocessing for text, images, and audio with recourse to popular tools such as Pandas, OpenCV, NLTK, and Librosa. Recommendations for dataset structuring, versioning, scalability, security, and automation of the workflow are surveyed to ensure sustainable management. Typical pitfalls, such as bias, scalability, scarcity in narrow fields, and quality degradation during training, are complemented by pragmatic solutions. Practical case studies highlight the preparation of Wikipedia text for language generators, curated face datasets for GANs, and multimodal datasets for creative tools. The chapter concludes with future directions, including the incorporation of automation through AutoML, synthetic data integration, federated learning, and adherence to fast-evolving regulations. Overall, the emphasis is on the importance of disciplined, repeated preparation of data, which is as essential as the architecture of the model for tapping the full potential of generative AI. It encourages practitioners to insist on quality and governance from the beginning.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.460 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.341 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.791 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.536 Zit.