In this homework, you will investigate existing pre-training datasets. Then, you will replicate a data pre-processing pipeline that converts raw web-crawled data into a cleaned dataset that is easy and efficient to train on. Through this homework, you will explore (a) attributes of large scale pre-training datasets and their properties [Problem 1] (b) different steps involved in cleaning, reducing, diversifying [Problem 2.1-2.2] and de-duplicating pretraining data [Problem 2.3] and finally, (d) converting it into a standardized format for improved use in pre-training [Problem 2.4]. You should submit to GradeScope your code for mini_ccc.py and homework.py as well as a report.pdf with your answers to each of the deliverables.

If you wish to use late days, please fill out this form.