An Extensible Parsing Pipeline for Unstructured Data Processing

Shubham Jain, Amy De Buitleir, Enda Fallon

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

Network monitoring and diagnostics systems depict the running system's state and generate enormous amounts of unstructured data through log files, print statements, and other reports. It is not feasible to manually analyze all these files due to limited resources and the need to develop custom parsers to convert unstructured data into desirable file formats. Prior research focuses on rule-based and relationship-based parsing methods to parse unstructured data into structured file formats; these methods are labor-intensive and need large annotated datasets. This paper presents an unsupervised text processing pipeline that analyses such text files, removes extraneous information, identifies tabular components, and parses them into a structured file format. The proposed approach is resilient to changes in the data structure, does not require training data, and is domain-independent. We experiment and compare topic modeling and clustering approaches to verify the accuracy of the proposed technique. Our findings indicate that combining similarity and clustering algorithms to identify data components had better accuracy than topic modeling.

Original languageEnglish
Title of host publication24th International Conference on Advanced Communication Technology
Subtitle of host publicationArtificial Intelligence Technologies toward Cybersecurity!!, ICACT 2022 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages312-318
Number of pages7
ISBN (Electronic)9791188428090
ISBN (Print)9791188428090
DOIs
Publication statusPublished - 2022
Event24th International Conference on Advanced Communication Technology, ICACT 2022 - Virtual, Online, Korea, Republic of
Duration: 13 Feb 202216 Feb 2022

Publication series

NameInternational Conference on Advanced Communication Technology, ICACT
Volume2022-February
ISSN (Print)1738-9445

Conference

Conference24th International Conference on Advanced Communication Technology, ICACT 2022
Country/TerritoryKorea, Republic of
CityVirtual, Online
Period13/02/2216/02/22

Keywords

  • Clustering
  • Information Extraction
  • Topic Modeling
  • Unsupervised Data Mining

Fingerprint

Dive into the research topics of 'An Extensible Parsing Pipeline for Unstructured Data Processing'. Together they form a unique fingerprint.

Cite this