使用Tetrad实现数据集的因果分析并生成因果图

作者：-明知山发布时间：2024-08-03

Introduction：

本篇文章应 @Tetrad因果推断邀请详细介绍一下如何使用Tetrad的GUI版本实现对数据集的数据加载、因果分析以及最终的因果图生成。第一部分介绍具体问题要求，以及提供的不同软件包，此处我们选择的是提到的第一种——Tetrad。第二部分介绍我们选择的数据集以及关键步骤和最后的因果图。第三部分详细介绍每一步如何操作Tetrad得到结果。前两部分截取本人报告片段，主要语言为英语，其余部分使用中文介绍。

Question：

Apply one causal discovery algorithm on a real world problem. You need to specify the details of the problem, collect the data by yourself or from a public website, briefly summarize what algorithm you use, and explain the results. You may use any causal discovery algorithm described in the following paper [Spirtes et al., 2016], and use the software packages in Page 26 of the paper.

Peter Spirtes and Kun Zhang. Causal discovery and inference: concepts and recent methodological advances. Applied Informatics, 3:3, 2016 https://applied-informatics-j.springeropen.com/track/pdf/10.1186/ s40535-016-0018-x

Page 26: The following software packages are available online:

The Tetrad project webpage (Tetrad implements a large number of causal discovery meth ods, including PC and its variants, FCI, and LiNGAM): http://www.phil.cmu.edu/ tetrad/.
Kernel-based conditional independence test Zhang et al. (2011): http://people. tuebingen.mpg.de/kzhang/KCI-test.zip.
LiNGAMand its extensions, Shimizu et al. (2006, 2011): https://sites.google.com/ site/sshimizu06/lingam.
Fitting the nonlinear additive noise model Hoyer et al. (2009): http://webdav. tuebingen.mpg.de/causality/additive-noise.tar.gz
Distinguishing cause from effect based on the PNL causal model, Zhang and Hyväri nen (2009, 2010): http://webdav.tuebingen.mpg.de/causality/CauseOrEffect_ NICA.rar
Probabilistic latent variable models for distinguishing between cause and effect, Mooij et al. (2010): http://webdav.tuebingen.mpg.de/causality/nips2010-gpi-code.tar. gz
Information-geometric causal inference, Daniusis et al. (2010); Janzing et al. (2012): http://webdav.tuebingen.mpg.de/causality/igci.tar.gz

Solution：

According to a research released in 2020 by World Health Organization (WHO), the world’s biggest killer is ischaemic heart disease, responsible for 16% of the world’s total deaths. Since 2000, the largest increase in deaths has been for this disease, rising by more than 2 million to 8.9 million deaths in 2019.

Medical scholars have published numerous articles on factors associated with heart disease. In recent years, with the development of machine learning technology, studies have emerged that use machine learning methods to predict heart disease. Since heart disease may have a causal relationship with many factors, causal discovery algorithms are suitable for analyzing factors related to heart disease. The dataset used in this paper is from https://archive.ics.uci.edu/dataset/45/heart+disease, which contains a total of 303 instances and involves 13 valid features. The specific parameters have been marked in Table 2.

In this experiment, we utilized the Tetrad platform to implement causal discovery algorithms and generate causal graphs. Tetrad is a software platform for causal discovery and statistical analysis, providing a series of algorithms and tools to help researchers identify causal relationships between variables. We employed three algorithms in total: PC algorithm, FCI algorithm, and FAS algorithm.

First, we need to construct a network within Tetrad, consisting of data blocks, knowledge blocks, and search blocks. The network structure is depicted in Figure5. The data block is responsible for importing the heart disease dataset, the knowledge block is used to add prior knowledge, defining the order of causal relationships through the hierarchical definition of variables as shown in Figure 6. Finally, the search block utilizes different algorithms to obtain a graph representing causal relationships.

The resulting graphs obtained from the search are shown in Figure 7, and the outcomes from the three algorithms are similar. From the results, it is evident that only fbs (fasting blood sugar) and restecg (resting electrocardiographic results) do not correlate with other features, indicating no apparent causal relationship. This suggests that factors such as fasting blood sugar levels do not have a significant causal relationship with heart disease. Additionally, ca (number of major vessels) and thal (type of thalassemia) are directly related to the severity of the disease, indicating that the number of major vessels and other physical signs are highly correlated with heart disease. Furthermore, basic characteristics such as age and gender have a close causal relationship with a large number of other features, which is a result that aligns with our intuition.