Entity labeling in business documents based on contextual information

Defense Date: October 15, 2020

In this thesis, I raise the issue of automatic entity labeling in business documents. I propose a solution to this task for the document corpora prepared for this purpose, which correspond with their layout and content to typical texts in the domain. The designed and implemented mechanism improve the results of the reference solution, which is one of the information extraction system components. The annotated data is used to train the named entity recognition model. I develop the approach using contextual information about entity mentions. My main aim is to improve the precision. I use natural language processing and machine learning methods, in particular pretrained language models, clustering and outliers detection. I choose the used algorithms and techniques to be the most suitable for the specified datasets. After thorough experiments I compare the final results of the created mechanism with the reference solution. I also propose many ways to develop this project.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Paweł Zawistowski

Share on