Synthetic Data Generation for Machine Learning Algorithms
Defense Date:
Machine learning has the potential to improve decisions and outcomes in different scientific and real world areas, yet there are many application areas that lack sufficient data for analyses, simulations, and the development of analytical approaches. One way to overcome the issue of data availability is to use synthetic data as an alternative to real data. Synthetic data are simulated from real data by using the underlying statistical properties of the real data to produce synthetic datasets that exhibit these same statistical properties. In this research, we will implement a framework that will generate synthetic data for music-streaming service using the Monte Carlo Simulation, which allows us to pre define the statistical properties of the generated datasets, and then test it on both predictor and recommender models to evaluate the level of control we have over the data and it’s impact on the result.
