TY - GEN
T1 - An Extensible Parsing Pipeline for Unstructured Data Processing
AU - Jain, Shubham
AU - Buitleir, Amy De
AU - Fallon, Enda
N1 - Publisher Copyright:
© 2022 Global IT Research Institute-GiRI.
PY - 2022
Y1 - 2022
N2 - Network monitoring and diagnostics systems depict the running system's state and generate enormous amounts of unstructured data through log files, print statements, and other reports. It is not feasible to manually analyze all these files due to limited resources and the need to develop custom parsers to convert unstructured data into desirable file formats. Prior research focuses on rule-based and relationship-based parsing methods to parse unstructured data into structured file formats; these methods are labor-intensive and need large annotated datasets. This paper presents an unsupervised text processing pipeline that analyses such text files, removes extraneous information, identifies tabular components, and parses them into a structured file format. The proposed approach is resilient to changes in the data structure, does not require training data, and is domain-independent. We experiment and compare topic modeling and clustering approaches to verify the accuracy of the proposed technique. Our findings indicate that combining similarity and clustering algorithms to identify data components had better accuracy than topic modeling.
AB - Network monitoring and diagnostics systems depict the running system's state and generate enormous amounts of unstructured data through log files, print statements, and other reports. It is not feasible to manually analyze all these files due to limited resources and the need to develop custom parsers to convert unstructured data into desirable file formats. Prior research focuses on rule-based and relationship-based parsing methods to parse unstructured data into structured file formats; these methods are labor-intensive and need large annotated datasets. This paper presents an unsupervised text processing pipeline that analyses such text files, removes extraneous information, identifies tabular components, and parses them into a structured file format. The proposed approach is resilient to changes in the data structure, does not require training data, and is domain-independent. We experiment and compare topic modeling and clustering approaches to verify the accuracy of the proposed technique. Our findings indicate that combining similarity and clustering algorithms to identify data components had better accuracy than topic modeling.
KW - Clustering
KW - Information Extraction
KW - Topic Modeling
KW - Unsupervised Data Mining
UR - http://www.scopus.com/inward/record.url?scp=85127516389&partnerID=8YFLogxK
U2 - 10.23919/ICACT53585.2022.9728823
DO - 10.23919/ICACT53585.2022.9728823
M3 - Conference contribution
AN - SCOPUS:85127516389
SN - 9791188428090
T3 - International Conference on Advanced Communication Technology, ICACT
SP - 312
EP - 318
BT - 24th International Conference on Advanced Communication Technology
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 24th International Conference on Advanced Communication Technology, ICACT 2022
Y2 - 13 February 2022 through 16 February 2022
ER -