当前位置:首页|资讯

使用Tetrad实现数据集的因果分析并生成因果图

作者:-明知山发布时间:2024-08-03

Introduction:

本篇文章应 @Tetrad因果推断 邀请详细介绍一下如何使用Tetrad的GUI版本实现对数据集的数据加载、因果分析以及最终的因果图生成。第一部分介绍具体问题要求,以及提供的不同软件包,此处我们选择的是提到的第一种——Tetrad。第二部分介绍我们选择的数据集以及关键步骤和最后的因果图。第三部分详细介绍每一步如何操作Tetrad得到结果。前两部分截取本人报告片段,主要语言为英语,其余部分使用中文介绍。

Question:

Apply one causal discovery algorithm on a real world problem. You need to specify the details of the problem, collect the data by yourself or from a public website, briefly summarize what algorithm you use, and explain the results. You may use any causal discovery algorithm described in the following paper [Spirtes et al., 2016], and use the software packages in Page 26 of the paper.

  • Peter Spirtes and Kun Zhang. Causal discovery and inference: concepts and recent methodological advances. Applied Informatics, 3:3, 2016 https://applied-informatics-j.springeropen.com/track/pdf/10.1186/ s40535-016-0018-x

Page 26: The following software packages are available online:

  • The Tetrad project webpage (Tetrad implements a large number of causal discovery meth ods, including PC and its variants, FCI, and LiNGAM): http://www.phil.cmu.edu/ tetrad/.

  • Kernel-based conditional independence test Zhang et al. (2011): http://people. tuebingen.mpg.de/kzhang/KCI-test.zip.

  • LiNGAMand its extensions, Shimizu et al. (2006, 2011): https://sites.google.com/ site/sshimizu06/lingam.

  • Fitting the nonlinear additive noise model Hoyer et al. (2009): http://webdav. tuebingen.mpg.de/causality/additive-noise.tar.gz

  • Distinguishing cause from effect based on the PNL causal model, Zhang and Hyväri nen (2009, 2010): http://webdav.tuebingen.mpg.de/causality/CauseOrEffect_ NICA.rar

  • Probabilistic latent variable models for distinguishing between cause and effect, Mooij et al. (2010): http://webdav.tuebingen.mpg.de/causality/nips2010-gpi-code.tar. gz

  • Information-geometric causal inference, Daniusis et al. (2010); Janzing et al. (2012): http://webdav.tuebingen.mpg.de/causality/igci.tar.gz

Solution:

According to a research released in 2020 by World Health Organization (WHO), the world’s biggest killer is ischaemic heart disease, responsible for 16% of the world’s total deaths. Since 2000, the largest increase in deaths has been for this disease, rising by more than 2 million to 8.9 million deaths in 2019.

Medical scholars have published numerous articles on factors associated with heart disease. In recent years, with the development of machine learning technology, studies have emerged that use machine learning methods to predict heart disease. Since heart disease may have a causal relationship with many factors, causal discovery algorithms are suitable for analyzing factors related to heart disease. The dataset used in this paper is from https://archive.ics.uci.edu/dataset/45/heart+disease, which contains a total of 303 instances and involves 13 valid features. The specific parameters have been marked in Table 2.

In this experiment, we utilized the Tetrad platform to implement causal discovery algorithms and generate causal graphs. Tetrad is a software platform for causal discovery and statistical analysis, providing a series of algorithms and tools to help researchers identify causal relationships between variables. We employed three algorithms in total: PC algorithm, FCI algorithm, and FAS algorithm.

First, we need to construct a network within Tetrad, consisting of data blocks, knowledge blocks, and search blocks. The network structure is depicted in Figure5. The data block is responsible for importing the heart disease dataset, the knowledge block is used to add prior knowledge, defining the order of causal relationships through the hierarchical definition of variables as shown in Figure 6. Finally, the search block utilizes different algorithms to obtain a graph representing causal relationships.

The resulting graphs obtained from the search are shown in Figure 7, and the outcomes from the three algorithms are similar. From the results, it is evident that only fbs (fasting blood sugar) and restecg (resting electrocardiographic results) do not correlate with other features, indicating no apparent causal relationship. This suggests that factors such as fasting blood sugar levels do not have a significant causal relationship with heart disease. Additionally, ca (number of major vessels) and thal (type of thalassemia) are directly related to the severity of the disease, indicating that the number of major vessels and other physical signs are highly correlated with heart disease. Furthermore, basic characteristics such as age and gender have a close causal relationship with a large number of other features, which is a result that aligns with our intuition.

Appendix:

首先我们对从上一部分得到的数据集需要做预处理,得到.csv文件格式。这里给出预处理的Python代码:

下载Tetrad,网上关于这部分也有很多教程,这里简单提几点步骤。首先官网是CMU的https://www.cmu.edu/dietrich/philosophy/tetrad/,进入之后选择Use Tetrad,并选择GUI版本,如下图所示:

选择最左侧GUI 版本的Tetrad

选择Get The Latest Executable,在下载页面选择launch版本的:

选择第五个launch.jar后缀的

 下载后如果本地环境有java并且路径正确,理论上可以直接打开,但是如果遇到找不到对应打开方式的情况,首先根据自己电脑的报错上网查找。我在打开时重装了一次java但是出现了一些路径问题,但是我可以直接使用命令行打开,如下图所示

windows powershell中进入到保存.jar文件的目录下使用

打开后得到如下页面:

初始化空白页面

点击左侧菜单栏Data得到一个数据模块,点击一个Knowledge模块加入后点击最上方的箭头,从Data连接到Knowledge。双击Data模块,打开数据加载页面如下:

数据加载页面

点击左上角File,并选择Load Data,载入刚刚预处理的.csv文件

加载页面如下,根据自己数据集的特征选择加载方式,比如此处我的数据集需要将Data type从Continuous改为Discrete

数据加载页面

加载无误页面如下:

成功加载

加载数据完成,打开Knowledge模块如下:

初始化Knowledge模块

根据变量的因果层次放入不同的层级中,例如我的数据集中num是最终患病概率,因此是结果,放入最低的Tier2中,而sex和age都是人本身的特征,不受其他因素影响,放入Tier0中,其余放入Tier1中,相互影响,最终示例如下:

Knowledge初始化完成

接下来选择左侧菜单栏的Search模块加入,并将Data和Knowledge同时指向Search模块,这样才能打开Search模块,如下图:

Search模块初始化

此时左侧是选择算法的filter,右侧是description,选择好算法后,点击Set Parameter,下图所示

set parameters

设置好需要的参数之后点击Run Search & Generate Graph就可以得到最终的因果图。选择新的search模块连好线之后可以使用别的算法再试。在最终的Graph页面可以拖动每个变量的名字改变位置,得到你需要的结构。


以上便是本文的全部内容,感谢阅读,由于本人并不十分精通Tetrad,如有问题请查阅别的资料,推荐阅读Tetrad教程


Copyright © 2024 aigcdaily.cn  北京智识时代科技有限公司  版权所有  京ICP备2023006237号-1