Assignment #1 - Preparing and Exploring Pre-Training Data
Download [questions] [starter code] [submission template]
In this homework, you will investigate existing pre-trainign datasets then replicate a data pre-processing pipeline that converts raw web-crawled data into a cleaned dataset that is easy and efficient to train on. Through this homework, you should be able to briefly understand (a) attributes of large scale pre-training and their formats of availability; (b) different steps involved in cleaning, reducing and diversifying pretraining data; and (c) converting your pretraining data into a standardized format for improved use in training and accessibility. You should submit to GradeScope your code for mini_ccc.py and homework.py as well as a \texttt{report.pdf} with your answers to each the deliverables.
Update: if you wish to use late days, please fill out this google form.
Revision Log
- 09/02/2024 - starter code updated to v2
- Line 66 of homework.py changed to
cleaned_nopii_text = replace_pii(cleaned_text)
- In utils.py, the
count += 1
lines were moved just before theyield
s.
- Line 66 of homework.py changed to