Use of hash functions in solving machine learning tasks

Defense Date:

This thesis presents following subjects: usage of hash functions while preparation of data for machine learning tasks and the tries of optimization of selected methods. At the beginning, hashing functions are introduced. Their definition and classification is given with samples. The functions that were used in this thesis like MurmurHash3 are presented with more details. The principals of machine learning algorithms that are used in the thesis are also explained. In the further part, the experiment on classification accuracy in dependence of the reduced number of dimensions and number of collisions, which happened because of same hash function values for different inputs, is described. The duration of classification and memory consumption was also considered. Then the χ2 test values were verified between all the categorical variables in the task and the observations classes in dependence of possible hash function output values, when the feature values were hashed. It was affirmed that these relations clearly exist. It is proposed to create a method for choosing dimensionality basing on χ2 test without necessity to train the models to do this selection. Two methods that use hash functions for preparing data for machine learning tasks have been optimized.