文章目录

1. HFE
- 1.1. Feature engineering phase
- 1.2. Correlation-based filtering phase
- 1.3. Information Gain (
- 1.4.
2. DOI

1. HFE

Hierarchical Feature Engineering，简写 HFE，包含四个阶段，分别是：

特征工程阶段（Feature engineering phase）
基于相关性的过滤阶段（Correlation-based filtering phase）
基于信息增益的过滤阶段（Information Gain based filtering phase）
基于信息增益的叶过滤阶段（IG-based leaf filtering phase）

1.1. Feature engineering phase

上图中，树结构共有 8 层。前七层是生物学的分类：界（Kingdom）、门（Phylum），纲（Class），目（Order）、科（Family）、属（Genus）和种（Species）。论文中额外在最底层增加了一层：OTU 层。

数据集中原有的特征向量表示为：

(

)

[

…

]

∈

[

…

]

∈

[

…

]

(o^i_j)_{n \times m}= \begin{bmatrix} o^1_1 & o^1_2 & \dots & o^1_m \\ o^2_1 & o^2_2 & \dots & o^2_m \\ \dots & \dots & \dots & \dots \\ o^n_1 & o^n_2 & \dots & o^n_m \\ \end{bmatrix}, i \in [1, 2, \dots, n], j \in [1, 2, \dots, m].

$(o_{j i})_{n \times m} = ⎣ ⎢ ⎢ ⎡ o_{11} o_{12} \dots o_{1 n} o_{21} o_{22} \dots o_{2 n} \dots \dots \dots \dots o_{m 1} o_{m 2} \dots o_{m n} ⎦ ⎥ ⎥ ⎤, i \in [1, 2, \dots, n], j \in [1, 2, \dots, m] .$

将较高分类单元

i_k

$i_{k}$ 视为潜在特征，其相对丰度是自下而上的树遍历中各自孩子

$C$ 的相对丰度的累加和：

∑

∈

(

)

o_{i_k} = \sum_{c \in C(i_k)} o_c.

$o_{i_{k}} = c \in C (i_{k}) \sum o_{c} .$

树结构中的某个非叶子节点，是一个具有较高层次的潜在特征，我们将其记为

i_k

$i_{k}$ ，它的孩子节点的集合记为

(

)

C(i_k)

$C (i_{k})$ ，则按照公式计算

i_k

$i_{k}$ 的相对丰度

o_{i_k}

$o_{i_{k}}$ ：

[

…

]

[

∑

∈

(

)

∑

∈

(

)

…

∑

∈

(

)

]

o_{i_k} = \begin{bmatrix} o^1_{i_k} \\ o^2_{i_k} \\ \dots \\ o^n_{i_k} \\ \end{bmatrix} = \begin{bmatrix} \sum_{c \in C(i_k)} o^1_c \\ \sum_{c \in C(i_k)} o^2_c \\ \dots \\ \sum_{c \in C(i_k)} o^n_c \\ \end{bmatrix}.

$o_{i_{k}} = ⎣ ⎢ ⎢ ⎡ o_{i_{k} 1} o_{i_{k} 2} \dots o_{i_{k} n} ⎦ ⎥ ⎥ ⎤ = ⎣ ⎢ ⎢ ⎡ \sum_{c \in C (i_{k})} o_{c 1} \sum_{c \in C (i_{k})} o_{c 2} \dots \sum_{c \in C (i_{k})} o_{c n} ⎦ ⎥ ⎥ ⎤ .$

所有较高层次的潜在特征，组成一个内部节点的特征集合，表示如下：

[

…

‾

…

‾

…

‾

]

\begin{bmatrix} o^1_{i_1} & o^1_{i_2} & \dots & o^1_{i_{\overline{m}}} \\ o^2_{i_1} & o^2_{i_2} & \dots & o^2_{i_{\overline{m}}} \\ \dots & \dots & \dots & \dots \\ o^n_{i_1} & o^n_{i_2} & \dots & o^n_{i_{\overline{m}}} \\ \end{bmatrix}

$⎣ ⎢ ⎢ ⎡ o_{i_{1} 1} o_{i_{1} 2} \dots o_{i_{1} n} o_{i_{2} 1} o_{i_{2} 2} \dots o_{i_{2} n} \dots \dots \dots \dots o_{i_{m} 1} o_{i_{m} 2} \dots o_{i_{m} n} ⎦ ⎥ ⎥ ⎤$

原始特征和内部节点衍生出来的特征，共同构成扩展特征向量，其表示形式如下所示：

[

…

‾

…

‾

…

‾

]

F = \begin{bmatrix} o^1_1 & o^1_2 & \dots & o^1_m & o^1_{i_1} & o^1_{i_2} & \dots & o^1_{i_{\overline{m}}} \\ o^2_1 & o^2_2 & \dots & o^2_m & o^2_{i_1} & o^2_{i_2} & \dots & o^2_{i_{\overline{m}}} \\ \dots & \dots & \dots & \dots & \dots & \dots & \dots & \dots \\ o^n_1 & o^n_2 & \dots & o^n_m & o^n_{i_1} & o^n_{i_2} & \dots & o^n_{i_{\overline{m}}} \\ \end{bmatrix}

$F = ⎣ ⎢ ⎢ ⎡ o_{11} o_{12} \dots o_{1 n} o_{21} o_{22} \dots o_{2 n} \dots \dots \dots \dots o_{m 1} o_{m 2} \dots o_{m n} o_{i_{1} 1} o_{i_{1} 2} \dots o_{i_{1} n} o_{i_{2} 1} o_{i_{2} 2} \dots o_{i_{2} n} \dots \dots \dots \dots o_{i_{m} 1} o_{i_{m} 2} \dots o_{i_{m} n} ⎦ ⎥ ⎥ ⎤$

1.2. Correlation-based filtering phase

对于层级中每对 “父亲-孩子”，皮尔逊相关系数（Pearson correlation coefficient）

\rho

$ρ$ 是父亲节点和孩子节点的一组向量计算出来的。
如果

\rho

$ρ$ 比预定义的阈值

\theta_{p}

$θ_{p}$ 大，那么移除孩子节点；否则保留孩子节点作为层级结构的一部分。

operation

{

remove

;

retain

otherwise.

\text{operation} = \begin{cases} \text{remove}, \text{ if } \rho > \theta_{p}; \\ \text{retain}, \text{ otherwise.} \end{cases}

$operation = {remove, if ρ > θ_{p}; retain, otherwise.$

对于任意的非叶子节点

i_k

$i_{k}$ ，它的孩子节点集合是

(

)

C(i_k)

$C (i_{k})$ ，则

∀

∈

(

)

\forall i_k, c \in C(i_k)

$\forall i_{k}, c \in C (i_{k})$ ,

operation

{

remove

(

)

;

retain

otherwise.

\text{operation } = \begin{cases} \text{remove } c, \text{ if } \rho(i_k, c) > \theta_{p}; \\ \text{retain } c, \text{ otherwise.} \end{cases}

$operation = {remove c, if ρ (i_{k}, c) > θ_{p}; retain c, otherwise.$

1.3. Information Gain (

根据上一阶段保留的节点，从叶子到根（即每个 OTU 的世系）构建所有路径。

对每条路径而言，计算路径上每个节点关于标签/类别

$L$ 的

$I G$ 。

平均

$I G$ 作为阈值

\theta

$θ$ ，用于丢弃具有较小

$I G$ 值或者零值的节点。

需要注意的是，具有不完整路径上的叶子节点不参与这一步，这些叶子节点将在 1.4. 中处理。

公式表示如下：

∑

∈

(

)

∣

\theta_{ig} = \frac{\sum_{p \in P} IG(o_p, L)}{\left| P \right|}

$θ_{i g} = \frac{\sum _{p \in P} I G ( o _{p} , L )}{∣ P ∣}$

∀

in a complete leaf-root path

\forall c \text{ in a complete leaf-root path } P \text{ in } T

$\forall c in a complete leaf-root path P in T$ ,

operation

{

remove

(

)

;

retain

otherwise.

\text{operation } = \begin{cases} \text{ remove } c, \text{ if } IG(o_c, L) < \theta_{ig}; \\ \text{ retain } c, \text{ otherwise.} \end{cases}

$operation = {remove c, if I G (o_{c}, L) < θ_{i g}; retain c, otherwise.$

1.4.

为了处理 OTUs 中完整的分类信息，

对于那些具有不完整分类信息的 OTU（路径不完整： incomplete paths），如果它的

$I G$ 大于 1.3. 中完整路径中所有节点的全局平均

$I G$ 值，那么保留该节点；否则，丢弃该节点。

用公式表示：

∑

∈

(

)

∣

\theta_{t} = \frac{\sum_{c \in T} IG(o_c, L)}{\left| T \right|}.

$θ_{t} = \frac{\sum _{c \in T} I G ( o _{c} , L )}{∣ T ∣} .$

operation

{

remove

(

)

;

retain

otherwise.

\text{operation } = \begin{cases} \text{ remove } c, \text{ if } IG(o_i, L) < \theta_{t}; \\ \text{ retain } c, \text{ otherwise.} \end{cases}

$operation = {remove c, if I G (o_{i}, L) < θ_{t}; retain c, otherwise.$

2. DOI

https://doi.org/10.1186/s12859-018-2205-3

本文地址：https://blog.csdn.net/PursueLuo/article/details/108754772

论文阅读报告：Taxonomy-aware feature engineering for microbiome classification，Mai Oudah and Andreas Hen

文章目录

1. HFE

1.1. Feature engineering phase

1.2. Correlation-based filtering phase

1.3. Information Gain (

1.4.

2. DOI

相关推荐

Programming abstractions in C阅读笔记:p91-p106

论文解读（AAD）《Knowledge distillation for BERT unsupervised domain adaptation》

论文解读（DWL）《Dynamic Weighted Learning for Unsupervised Domain Adaptation》

论文解读（APCA）《Adaptive prototype and consistency alignment for semi-supervised domain adaptation》

《Generative Adversarial Nets》论文精读

洛谷 P2048 [NOI2010]超级钢琴解题报告

【论文笔记】Deeplab系列

论文解读《ImageNet Classification with Deep Convolutional Neural Networks》

论文阅读报告：Taxonomy-aware feature engineering for microbiome classification，Mai Oudah and Andreas Hen

文章目录

1. HFE

1.1. Feature engineering phase

1.2. Correlation-based filtering phase

1.3. Information Gain ( I G IG IG) based filtering phase

1.4. I G IG IG-based leaf filtering phase

2. DOI

相关推荐

Programming abstractions in C阅读笔记:p91-p106

论文解读（AAD）《Knowledge distillation for BERT unsupervised domain adaptation》

论文解读（DWL）《Dynamic Weighted Learning for Unsupervised Domain Adaptation》

论文解读（APCA）《Adaptive prototype and consistency alignment for semi-supervised domain adaptation》

《Generative Adversarial Nets》论文精读

洛谷 P2048 [NOI2010]超级钢琴 解题报告

【论文笔记】Deeplab系列

论文解读《ImageNet Classification with Deep Convolutional Neural Networks》

1.3. Information Gain (

1.4.

洛谷 P2048 [NOI2010]超级钢琴解题报告