(Publisher of Peer Reviewed Open Access Journals)

International Journal of Advanced Technology and Engineering Exploration (IJATEE)

ISSN (Print):2394-5443    ISSN (Online):2394-7454
Volume-9 Issue-90 May-2022
Full-Text PDF
Paper Title : ETL for disease indicators using brute force rule-based NLP algorithm and metadata exploration
Author Name : Ifra Altaf, Muheet Ahmed Butt and Majid Zaman
Abstract :

As data driven decisions are based on facts, data collection can be used to lay a foundation for decision-making irrespective of industry. With the decision-making capability provided by the data from various digital medical records, the doctors can provide a precise diagnosis and a sufficient treatment by fitting together fundamentally different disease symptoms. This data manuscript describes the preparation procedure of a diabetes dataset from the panels of liver and lipid profile. The data is collected from a medical center in Srinagar, Jammu and Kashmir in the form of unstructured data reports. The unstructured data is extracted on the basis of the metadata of the source document; the required data field values of different tests are extracted from the intermediate file using the brute force pattern matching heuristics and integrated together to fill the relational database. The database can be used for further descriptive, exploratory as well as predictive data analysis and can be helpful in diagnosing and predicting the diabetes disease of the liver and lipid panels. This paper presents a novel concept to predict and detect one disease from the markers of other related disease/s as a way to fill the theoretical research gap. The detection rate achieved by our proposed brute force rule-based natural language processing (NLP) algorithm is recorded as 98.44%.

Keywords : PDF scraping, Unstructured data, Diagnostic lab reports, Heuristics, Brute force, Natural language processing, Metadata, Information extraction.
Cite this article : Altaf I, Butt MA, Zaman M. ETL for disease indicators using brute force rule-based NLP algorithm and metadata exploration . International Journal of Advanced Technology and Engineering Exploration. 2022; 9(90):644-662. DOI:10.19101/IJATEE.2021.875069.
References :
[1]Soni J, Ansari U, Sharma D, Soni S. Predictive data mining for medical diagnosis: an overview of heart disease prediction. International Journal of Computer Applications. 2011; 17(8):43-8.
[Google Scholar]
[2]Natarajan Y, Kannan S, Mohanty SN. Survey of various statistical numerical and machine learning ontological models on infectious disease ontology. Data Analytics in Bioinformatics: a Machine Learning Perspective. 2021: 431-42.
[Crossref] [Google Scholar]
[3]Taylor‐weiner A, Pokkalla H, Han L, Jia C, Huss R, Chung C, et al. A machine learning approach enables quantitative measurement of liver histology and disease monitoring in NASH. Hepatology. 2021; 74(1):133-47.
[Crossref] [Google Scholar]
[4]Huang S, Yang J, Fong S, Zhao Q. Artificial intelligence in the diagnosis of COVID-19: challenges and perspectives. International Journal of Biological Sciences. 2021; 17(6).
[Google Scholar]
[5]Rehman A, Iqbal MA, Xing H, Ahmed I. COVID-19 detection empowered with machine learning and deep learning techniques: a systematic review. Applied Sciences. 2021; 11(8):3414.
[Crossref] [Google Scholar]
[6]Bhavsar KA, Abugabah A, Singla J, AlZubi AA, Bashir AK. A comprehensive review on medical diagnosis using machine learning. Computers, Materials and Continua. 2021; 67(2):1997-2014.
[Crossref] [Google Scholar]
[7]Ahsan MM, Siddique Z. Machine learning-based heart disease diagnosis: a systematic literature review. Artificial Intelligence in Medicine. 2022.
[Crossref] [Google Scholar]
[8]Ibrahim I, Abdulazeez A. The role of machine learning algorithms for diagnosing diseases. Journal of Applied Science and Technology Trends. 2021; 2(1):10-9.
[Crossref] [Google Scholar]
[9]Shaheen MY. Adoption of machine learning for medical diagnosis. ScienceOpen Preprints. 2021.
[Crossref] [Google Scholar]
[10]Ahsan MM, Mahmud MA, Saha PK, Gupta KD, Siddique Z. Effect of data scaling methods on machine learning algorithms and model performance. Technologies. 2021; 9(3):1-17.
[Crossref] [Google Scholar]
[11]Dash S, Shakyawar SK, Sharma M, Kaushik S. Big data in healthcare: management, analysis and future prospects. Journal of Big Data. 2019; 6(1):1-25.
[Crossref] [Google Scholar]
[12]Osop H, Sahama T. Data-driven and practice-based evidence: design and development of efficient and effective clinical decision support system. In improving health management through clinical decision support systems 2016 (pp. 295-328). IGI Global.
[Crossref] [Google Scholar]
[13]Bernell S, Howard SW. Use your words carefully: what is a chronic disease? Frontiers in Public Health. 2016.
[Crossref] [Google Scholar]
[14]Philip R, Mathias M, KM DG. Evalation of relationship between markers of liver function and the onset of type 2 diabetes. Journal of Health and Allied Sciences NU. 2014; 4(2):90-3.
[Crossref] [Google Scholar]
[15]Santos-Gallego CG, Rosenson RS. Role of HDL in those with diabetes. Current Cardiology Reports. 2014; 16(9):1-4.
[Google Scholar]
[16]https://www.astera.com/type/blog/pdf-scraping/. Accessed 20 September 2021.
[17]Blonce A, Filiol E, Frayssignes L. Portable document format (pdf) security analysis and malware threats. In presentations of Europe BlackHat 2008.
[Google Scholar]
[18]Sumathi S, Esakkirajan S. Fundamentals of relational database management systems. Springer; 2007.
[Google Scholar]
[19]Hashmi AM, Qayyum F, Afzal MT. Insights to the state-of-the-art PDF extraction techniques. IPSI Trans. Internet Res. 2020; 16(8):1-8.
[Google Scholar]
[20]Ahmad R, Afzal MT, Qadir MA. Information extraction from PDF sources based on rule-based system using integrated formats. In semantic web evaluation challenge 2016 (pp. 293-308). Springer, Cham.
[Crossref] [Google Scholar]
[21]Sateli B, Witte R. An automatic workflow for the formalization of scholarly articles’ structural and semantic elements. In semantic web evaluation challenge 2016 (pp. 309-20). Springer, Cham.
[Crossref] [Google Scholar]
[22]Klampfl S, Kern R. Reconstructing the logical structure of a scientific publication using machine learning. In semantic web evaluation challenge 2016 (pp. 255-68). Springer, Cham.
[Crossref] [Google Scholar]
[23]Azimjonov J, Alikhanov J. Rule based metadata extraction framework from academic articles. arXiv preprint arXiv:1807.09009. 2018.
[Google Scholar]
[24]Achilonu OJ, Singh E, Nimako G, Eijkemans RM, Musenge E. Rule-based information extraction from free-text pathology reports reveals trends in South African female breast cancer molecular subtypes and Ki67 expression. BioMed Research International. 2022.
[Crossref] [Google Scholar]
[25]Mandal A, Bhattarai B, Kafle P, Khalid M, Jonnadula SK, Lamicchane J, et al. Elevated liver enzymes in patients with type 2 diabetes mellitus and non-alcoholic fatty liver disease. Cureus. 2018; 10(11).
[Crossref] [Google Scholar]
[26]Bhowmik B, Siddiquee T, Mujumder A, Afsana F, Ahmed T, Mdala IA, et al. Serum lipid profile and its association with diabetes and prediabetes in a rural Bangladeshi population. International Journal of Environmental Research and Public Health. 2018; 15(9):1-12.
[Crossref] [Google Scholar]
[27]Singh A, Dalal D, Malik AK, Chaudhary A. Deranged liver function tests in type 2 diabetes: a retrospective study. International Journal of Science and Healthcare Research. 2019; 4(3):27-31.
[Google Scholar]
[28]Majid MA, Bashet MA, Moonajilin MS, Siddique M. A study on evaluating lipid profile of patients with diabetes mellitus. 2019.
[Google Scholar]
[29]Shahwan MJ, Khattab AH, Khattab MH, Jairoun AA. Association between abnormal serum hepatic enzymes, lipid levels and glycemic control in patients with type 2 diabetes mellitus. Obesity Medicine. 2019.
[Crossref] [Google Scholar]
[30]Islam S, Rahman S, Haque T, Sumon AH, Ahmed AM, Ali N. Prevalence of elevated liver enzymes and its association with type 2 diabetes: a cross‐sectional study in Bangladeshi adults. Endocrinology, Diabetes & Metabolism. 2020; 3(2).
[Crossref] [Google Scholar]
[31]Blomdahl J, Nasr P, Ekstedt M, Kechagias S. Moderate alcohol consumption is associated with advanced fibrosis in non-alcoholic fatty liver disease and shows a synergistic effect with type 2 diabetes mellitus. Metabolism. 2021.
[Crossref] [Google Scholar]
[32]Tham YK, Jayawardana KS, Alshehry ZH, Giles C, Huynh K, Smith AA, et al. Novel lipid species for detecting and predicting atrial fibrillation in patients with type 2 diabetes. Diabetes. 2021; 70(1):255-61.
[Crossref] [Google Scholar]
[33]Kosmalski M, Ziółkowska S, Czarny P, Szemraj J, Pietras T. The coexistence of nonalcoholic fatty liver disease and type 2 diabetes mellitus. Journal of Clinical Medicine. 2022; 11(5):1-24.
[Crossref] [Google Scholar]
[34]Altaf I, Butt MA, Zaman M. Disease detection and prediction using the liver function test data: a review of machine learning algorithms. In international conference on innovative computing and communications 2022 (pp. 785-800). Springer, Singapore.
[Crossref] [Google Scholar]
[35]Godfrey KR. Correlation methods. Automatica. 1980; 16(5):527-34.
[Crossref] [Google Scholar]
[36]Benesty J, Chen J, Huang Y, Cohen I. Pearson correlation coefficient. In noise reduction in speech processing 2009 (pp. 1-4). Springer, Berlin, Heidelberg.
[Crossref] [Google Scholar]
[37]Sedgwick P. Spearman’s rank correlation coefficient. BMJ. 2014.
[Crossref] [Google Scholar]
[38]Abdi H. The kendall rank correlation coefficient. Encyclopedia of Measurement and Statistics. Sage, Thousand Oaks, CA. 2007:508-10.
[Google Scholar]
[39]Aparicio M, Costa CJ. Data visualization. Communication design quarterly review. 2015; 3(1):7-11.
[Crossref] [Google Scholar]
[40]Vellido A. The importance of interpretability and visualization in machine learning for applications in medicine and health care. Neural Computing and Applications. 2020; 32(24):18069-83.
[Crossref] [Google Scholar]
[41]Wang Y, Han F, Zhu L, Deussen O, Chen B. Line graph or scatter plot? automatic selection of methods for visualizing trends in time series. IEEE Transactions on Visualization and Computer Graphics. 2017; 24(2):1141-54.
[Crossref] [Google Scholar]
[42]Moon KW. Bar plot (I). In Learn ggplot2 Using Shiny App 2016 (pp. 111-20). Springer, Cham.
[Crossref] [Google Scholar]
[43]Hicks SC, Okrah K, Paulson JN, Quackenbush J, Irizarry RA, Bravo HC. Smooth quantile normalization. Biostatistics. 2018; 19(2):185-98.
[Crossref] [Google Scholar]
[44]Zheng A, Casari A. Feature engineering for machine learning: principles and techniques for data scientists. OReilly Media, Inc.; 2018.
[Google Scholar]
[45]Köpp C, Von MHJ, Breitner MH. Decision analytics with heatmap visualization for multi-step ensemble data. Business & Information Systems Engineering. 2014; 6(3):131-40.
[Crossref] [Google Scholar]
[46]Friendly M. A brief history of the mosaic display. Journal of Computational and Graphical Statistics. 2002; 11(1):89-107.
[Crossref] [Google Scholar]
[47]Friendly M. Graphical methods for categorical data. Proceedings of SAS SUGI. 1992; 17:1-7.
[Google Scholar]
[48]Demšar J, Leban G, Zupan B. FreeViz-an intelligent multivariate visualization approach to explorative analysis of biomedical data. Journal of Biomedical Informatics. 2007; 40(6):661-71.
[Crossref] [Google Scholar]
[49]Demsar J, Leban G, Zupan B. Freeviz-an intelligent visualization approach for class-labeled multidimensional data sets. Proceedings of IDAMAP. 2005; 1:13-8.
[Google Scholar]