CISC-683: Data Mining

Public Data Repositories

    Links to sites with publicly available datasets --- There is overlap among the datasets provided at the different sites:

  1. University of California Irvine Data Mining Repository: a large repository of datasets supplied that serves as a benchmark for comparison of data mining techniques

  2. University of California Irvine Machine Learning Repository: a large repository of datasets supplied by individuals, with some overlap with the Data Mining Repository

  3. ACM Data Mining and Knowledge Discovery Cup Center: contains links to instructions and datasets for the annual KDD contest

  4. Links to a variety of large datasets: These are very large datasets, but many of them are not well-described

  5. Links to datasets: Many of these are statistical or done without descriptions of the attributes, and so may not be of much use. However, others (such as the baseball dataset) are interesting

  6. Financial and Economic Datasets --- lots of overlap among them:
    First link
    Second link
    Third link

  7. Asteroid dataset

  8. Insurance dataset: This dataset was used in the CoIL (Computational Intelligence and Learning Cluster) competition.