In this homework, you will investigate existing pre-trainign datasets then replicate a data pre-processing pipeline that converts raw web-crawled data into a cleaned dataset that is easy and efficient to train on. Through this homework, you should be able to briefly understand (a) attributes of large scale pre-training and their formats of availability; (b) different steps involved in cleaning, reducing and diversifying pretraining data; and (c) converting your pretraining data into a standardized format for improved use in training and accessibility. You should submit to GradeScope your code for mini_ccc.py and homework.py as well as a \texttt{report.pdf} with your answers to each the deliverables.

Revision Log

  • 09/02/2024 - starter code updated to v2
    • Line 66 of homework.py changed to cleaned_nopii_text = replace_pii(cleaned_text)
    • In utils.py, the count += 1 lines were moved just before the yields.