RESCAL、DistMult 与 HolE

Yu Bin原创2022年10月10日

KG
STAR-5

大约 9 分钟

在知识表示学习的 model 中，有一类是语义匹配模型（Semantic Matching Models），这种模型的含义借助文献 [1] 的解释如下：

Semantic matching models exploit similarity-based scoring functions. They measure plausibility of facts by matching latent semantics of entities and relations embodied in their vector space representations.

本文主要介绍其中的 RESCAL 及其各类扩展模型。

1. RESCAL

矩阵分解是得到低维向量表示的重要途径，因此，有研究者提出采用矩阵分解进行知识表示学习，这方面的代表方法是 RESCAL 模型^[2]，同时它也是一种双线性模型，发表的 paper 也是双线性模型的开山之作。

1.1 Model

在该模型中，知识库三元组构成了一个大的 tensor $\mathcal{X}$ ，如果三元组 (h, r, t) 存在，那么 $\mathcal{X}_{htr}=1$ ，否则为 0：

共有 n 个 entity，m 个 relation， $\mathcal{X}$ 的 size 是 $n \times n \times m$ 。
$\mathcal{X}_k$ refer to the k-th frontal slice of the tensor $\mathcal{X}$ ，即在 $R_k$ 对应的那个正方形。

RESCAL 将每个 entity 表示为一个 vector，将每个 relation 表示为一个 matrix，对每个 slice $\mathcal{X}_k$ 进行 rank- $r$ factorization：

\mathcal{X}_k \approx A R_k A^T, for \ k = 1,\dots, m

分解后的 A 是一个 $n \times r$ 的矩阵，每行对应一个 entity 的 latent-component representation。 $R_k$ 是一个非对称的 $r \times r$ 的 matrix，是第 k 个 relation 的 latent-component representation。

特别要提的是 $R_k$ 是非对称矩阵，这样可以建模非对称关系，在同一个实体作为头实体或尾实体时会得到不同的 latent component representation。

具体计算 $A$ 和 $R_k$ 是通过约束最小化问题来计算的：

\min_{A,R_k} {f(A, R_k)+g(A,R_k)}

$f$ 计算了分解前后的距离：

f(A,R_k)=\frac{1}{2}(\sum_k ||\mathcal{X}_k - AR_kA^T||^2_F)=\frac{1}{2}\sum_{i,j,k}(\mathcal{X}_{i,j,k} - a_i^TR_ka_j)^2

$g$ 是正则化项：

g(A,R_k)=\frac{1}{2} \lambda (||A||^2_F + \sum_K||R_k||^2_F)

1.2 score function

其实，纵观 RESCAL，它的 score function 就是下面这个双线性函数：

f_r(h,t)=h^TM_rt=\sum^{d-1}_{i=0}\sum^{d-1}_{j=0}[M_r]_{ij} \cdot [h]_i \cdot [t]_j

$h, t \in \mathbb{R}^d$ 是 entity 的 vector representation
$M_r \in \mathbb{R}^{d\times d}$ 是有关 relation r 的 matrix representation

This score captures pairwise interactions between all the components of $h$ and $t$ , which requires $O(d^2)$ parameters per relation. 计算 score 的方式如下图（假设 $h$ 与 $t$ 的 embedding 长度是 3）：

由上图可以看到，RESCAL 计算的 score 捕捉了成对的 $\textbf{h}$ 和 $\textbf{t}$ 之间各成分的 interaction。

1.3 代码实现

RESCAL 实现

import torch
import torch.nn as nn
import torch.nn.functional as F


class RESCAL(nn.Module):
    def __init__(self, ent_num, rel_num, device, dim=100, norm=1, alpha=0.001):
        super(RESCAL, self).__init__()
        self.ent_num = ent_num
        self.rel_num = rel_num
        self.device = device
        self.dim = dim
        self.norm = norm  # 使用L1范数还是L2范数
        self.alpha = alpha

        # 初始化实体向量
        self.ent_embeddings = nn.Embedding(self.ent_num, self.dim)
        torch.nn.init.xavier_uniform_(self.ent_embeddings.weight.data)
        self.ent_embeddings.weight.data = F.normalize(self.ent_embeddings.weight.data, 2, 1)

        # 初始化关系矩阵
        self.rel_embeddings = nn.Embedding(self.rel_num, self.dim * self.dim)
        torch.nn.init.xavier_uniform_(self.rel_embeddings.weight.data)
        self.rel_embeddings.weight.data = F.normalize(self.rel_embeddings.weight.data, 2, 1)

        # 损失函数
        self.criterion = nn.MSELoss()

    def get_ent_resps(self, ent_idx): #[batch]
        return self.ent_embeddings(ent_idx) # [batch, emb]

    # 越大越好，正例接近1，负例接近0
    def scoring(self, h_idx, r_idx, t_idx):
        h_embs = self.ent_embeddings(h_idx)  # [batch, emb]
        t_embs = self.ent_embeddings(t_idx)  # [batch, emb]
        r_mats = self.rel_embeddings(r_idx)  # [batch, emb * emb]

        norms = (torch.mean(h_embs ** 2) + torch.mean(t_embs ** 2) + torch.mean(r_mats ** 2)) / 3

        r_mats = r_mats.view(-1, self.dim, self.dim)
        t_embs = t_embs.view(-1, self.dim, 1)

        tr_embs = torch.matmul(r_mats, t_embs)
        tr_embs = tr_embs.view(-1, self.dim)

        return torch.sum(h_embs * tr_embs, -1), norms

    def forward(self, h_idx, r_idx, t_idx, labels):
        scores, norms = self.scoring(h_idx, r_idx, t_idx)

        tmp_loss = self.criterion(scores, labels.float())
        tmp_loss += self.alpha * norms

        return tmp_loss, scores

2. DistMult

本文提出了 neural-embedding 的通用框架，并把 NTN、TransE 等模型套在框架里进行对比；提出了将关系矩阵限制为对角矩阵的 DistMult；并用 embedding-based 方法挖掘逻辑规则。

2.1 Model

DistMult^[3] 通过限制双线性矩阵为对角阵来简化了 RESCAL。对每一个 relation $r$ ，它引入了一个 vector embedding $\textbf{r} \in \mathbb{R}^d$ 并且双线性矩阵 $M_r = diag(\textbf{r})$ ，因此其 score function 定义为如下：

f_r(h,t) = \textbf{h}^T diag(\textbf{r}) \textbf{t} = \sum^{d-1}_{i=0}[\textbf{r}]_i \cdot [\textbf{h}]_i \cdot [\textbf{t}]_i

计算 score 的方式如下图：

由上图可以看到，DistMult 计算的 score 只捕捉了成对的 $\textbf{h}$ 和 $\textbf{t}$ 之间在相同维度上的成分。同时 DistMult 将参数量减少到 $O(d)$ per relation，与 TransE 是相同的。然而，这种过于简单的模型由于对任何 h 和 t 都有 $\textbf{h}^T diag(\textbf{r}) \textbf{t} = \textbf{t}^T diag(\textbf{r}) \textbf{h}$ ，因此它只能处理对称关系。

paper 还给出了这个模型在规则抽取上的实验，可以参考学习。

2.2 代码实现

pykg2vec 上给出了 DistMult 的实现：

DistMult 实现

class DistMult(PointwiseModel):
    """
        `EMBEDDING ENTITIES AND RELATIONS FOR LEARNING AND INFERENCE IN KNOWLEDGE BASES`_ (DistMult) is a simpler model comparing with RESCAL in that it simplifies
        the weight matrix used in RESCAL to a diagonal matrix. The scoring
        function used DistMult can capture the pairwise interactions between
        the head and the tail entities. However, DistMult has limitation on modeling asymmetric relations.

        Args:
            config (object): Model configuration parameters.

        .. _EMBEDDING ENTITIES AND RELATIONS FOR LEARNING AND INFERENCE IN KNOWLEDGE BASES:
            https://arxiv.org/pdf/1412.6575.pdf

    """
    def __init__(self, **kwargs):
        super(DistMult, self).__init__(self.__class__.__name__.lower())
        param_list = ["tot_entity", "tot_relation", "hidden_size", "lmbda"]
        param_dict = self.load_params(param_list, kwargs)
        self.__dict__.update(param_dict)

        num_total_ent = self.tot_entity
        num_total_rel = self.tot_relation
        k = self.hidden_size

        self.ent_embeddings = NamedEmbedding("ent_embedding", num_total_ent, k)
        self.rel_embeddings = NamedEmbedding("rel_embedding", num_total_rel, k)
        nn.init.xavier_uniform_(self.ent_embeddings.weight)
        nn.init.xavier_uniform_(self.rel_embeddings.weight)

        self.parameter_list = [
            self.ent_embeddings,
            self.rel_embeddings,
        ]

        self.loss = Criterion.pointwise_logistic

    def embed(self, h, r, t):
        """Function to get the embedding value.

           Args:
               h (Tensor): Head entities ids.
               r (Tensor): Relation ids of the triple.
               t (Tensor): Tail entity ids of the triple.

            Returns:
                Tensors: Returns head, relation and tail embedding Tensors.
        """
        h_emb = self.ent_embeddings(h)
        r_emb = self.rel_embeddings(r)
        t_emb = self.ent_embeddings(t)

        return h_emb, r_emb, t_emb


    def forward(self, h, r, t):
        h_e, r_e, t_e = self.embed(h, r, t)
        return -torch.sum(h_e*r_e*t_e, -1)


    def get_reg(self, h, r, t, reg_type="F2"):
        h_e, r_e, t_e = self.embed(h, r, t)

        if reg_type.lower() == 'f2':
            regul_term = torch.mean(torch.sum(h_e ** 2, -1) + torch.sum(r_e ** 2, -1) + torch.sum(t_e ** 2, -1))
        elif reg_type.lower() == 'n3':
            regul_term = torch.mean(torch.sum(h_e ** 3, -1) + torch.sum(r_e ** 3, -1) + torch.sum(t_e ** 3, -1))
        else:
            raise NotImplementedError('Unknown regularizer type: %s' % reg_type)

        return self.lmbda*regul_term

3. HolE

3.1 compositional vector space models

论文首先提出了 compositional vector space models，它提供了一种优雅的方式来学习 KG 中 relation 的 characteristic function，并可以将这个学习任务转化为 supervised representation learning 问题。这种模型的形式如下：

Pr(\phi_p(s,o) = 1|\Theta)=\sigma(\eta_{spo})=\sigma(\textbf{r}_p^T(\textbf{e}_s \circ \textbf{e}_o))

$\textbf{r}_p \in \mathbb{R}^{d_r}, \textbf{e}_i \in \mathbb{R}^{d_e}$ 是 relation 和 entity 的 vector representation
$\sigma(x)$ 是 logistic function
$\Theta$ 代表所有 entity 和 relation 的 embedding
$\circ: \mathbb{R}^{d_e} \times \mathbb{R}^{d_e} \to \mathbb{R}^{d_p}$ 代表 compositional operator，它用于为 pair (s, o) 的 embedding $\textbf{e}_s, \textbf{e}_o$ 创建一个 composite vector representation

假设 $x_i$ 代表一个 triple， $y_i \in {\pm 1}$ 代表它的 label。给定一个 dataset $\mathcal{D} = {(x_i, y_i)}^m_{i=1}$ 包含正样本和负样本，然后我们想要学习的就是能够根据模型的 $Pr(\phi_p(s,o)=1|\Theta)$ 最好地解释 dataset $\mathcal{D}$ 的 vector representation of entities and relations $\Theta$ 。它可以通过最小化下面的 regularized logistic loss 来实现：

\min_\Theta \sum^m_{i=1} \log(1 + exp(-y_i \eta_i)) + \lambda ||\Theta||^2_2

对于关系数据，最小化 logistic 损失具有额外的优势，它可以帮助为复杂的关系模式找到低维的嵌入。

由于 KG 只存储正确三元组，这种情况我们可以使用 pairwise ranking loss 来 rank the probability of existing triples higher than the probability of non-existing ones：

\min_\Theta \sum_{i \in \mathcal{D}_+} \sum_{j\in \mathcal{D}_- } \max(0, \gamma+\sigma(\eta_j) - \sigma(\eta_i))

$\mathcal{D}_+, \mathcal{D}_-$ 代表 set of existing and non-existing triples
$\gamma \gt 0$ 代表 margin 的 width

RESCAL 套在上面的模型中就是 $\textbf{e}_s \circ \textbf{e}_o$ 取 tensor product 的情况。其他的情况可以参考原论文。

3.2 Model

HolE 就是将上面说的 compositional vector space models 的 $\textbf{e}_s \circ \textbf{e}_o$ 取 circular correlation 的情况。在HOLE中，不只是存储关联，而是学习能最好地解释所观察到数据的嵌入。

HolE^[4] combines the expressive power of RESCAL with the efficiency and simplicity of DistMult.

HolE 将每个 entity 和 relation 都表示为一个在 $\mathbb{R}^d$ 空间的 vector。给定一个 fact $(h,r,t)$ ，那 entity representation 首先会经过一个 circular correlation 的计算：

[\textbf{h} \star \textbf{t}]_k = \sum^{d-1}_{i=0}[\textbf{h}]_i \cdot [\textbf{t}]_{(k+i)\mod d}

这个计算得到的 compositional vector 然后去与 relation representation 进行匹配得到 fact 的 score：

f_r(h,t) = \sigma(\textbf{r}^T (\textbf{h} \star \textbf{t})) = \sigma(\sum^{d-1}_{i=0}[\textbf{r}]_i \sum^{d-1}_{i=0}[\textbf{h}]_i \cdot [\textbf{t}]_{(k+i)\mod d})

为了学习到 entity 和 relation 的 representation，我们可以使用 SGD 来优化：

\theta^{t+1} \leftarrow \theta^t - \mu \frac{\partial L}{\partial f_{spo}} \frac{\partial f_{spo}}{\partial \theta}

以优化 $\textbf{e}_o$ 为例，SGD 所做的是：

\textbf{e}_o^{t+1} \leftarrow \textbf{e}_o^t - \mu \frac{\partial L}{\partial f} \frac{\partial f}{\partial \eta} (\textbf{r}_p * \textbf{e}_s)

其他的公式可以参考原论文。

3.3 Circular Correlation

Circular Correlation 可以看成是 tensor product 的一种 compression，是一种 $\star: \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}^d$ 的运算，图示如下：

在具体实现时，circular correlation 可以通过如下的方式进行计算：

a \star b = \mathcal{F}^{-1} (\overline{\mathcal{F(a)}} \otimes \mathcal{F}(b))

$\mathcal{F}$ 和 $\mathcal{F}^{-1}$ 代表 fast Fourier transform（FFT）和它的逆，FFT 的计算复杂度时 $O(d \log d)$
$\overline{x}$ 代表 x 的复数共轭
$\otimes$ 代表 Hadamard (entrywise) product

Circular correlation 在成对的 h 和 t 的 interaction 中做了 compression，它只要求 $O(d)$ parameters per relation，这比 RESCAL 更加有效率。同时，circular correlation 运算不是可交换的，即 $a \star b \neq b \star a$ ，所以 HolE 能够像 RESCAL 那样建模非对称关系。

3.4 代码实现

HolE 实现

class HoLE(PairwiseModel):
    """
        `Holographic Embeddings of Knowledge Graphs`_. (HoLE) employs the circular correlation to create composition correlations. It
        is able to represent and capture the interactions betweek entities and relations
        while being efficient to compute, easier to train and scalable to large dataset.
        Args:
            config (object): Model configuration parameters.
        .. _Holographic Embeddings of Knowledge Graphs:
            https://arxiv.org/pdf/1510.04935.pdf
    """

    def __init__(self, **kwargs):
        super(HoLE, self).__init__(self.__class__.__name__.lower())
        param_list = ["tot_entity", "tot_relation", "hidden_size", "cmax", "cmin"]
        param_dict = self.load_params(param_list, kwargs)
        self.__dict__.update(param_dict)

        self.ent_embeddings = NamedEmbedding("ent_embedding", self.tot_entity, self.hidden_size)
        self.rel_embeddings = NamedEmbedding("rel_embedding", self.tot_relation, self.hidden_size)
        nn.init.xavier_uniform_(self.ent_embeddings.weight)
        nn.init.xavier_uniform_(self.rel_embeddings.weight)

        self.parameter_list = [
            self.ent_embeddings,
            self.rel_embeddings,
        ]

        self.loss = Criterion.pairwise_hinge

    def forward(self, h, r, t):
        h_e, r_e, t_e = self.embed(h, r, t)
        r_e = F.normalize(r_e, p=2, dim=-1)
        h_e = torch.stack((h_e, torch.zeros_like(h_e)), -1)
        t_e = torch.stack((t_e, torch.zeros_like(t_e)), -1)
        e, _ = torch.unbind(torch.ifft(torch.conj(torch.fft(h_e, 1)) * torch.fft(t_e, 1), 1), -1)
        return -F.sigmoid(torch.sum(r_e * e, 1))

    def embed(self, h, r, t):
        """
            Function to get the embedding value.
            Args:
                h (Tensor): Head entities ids.
                r  (Tensor): Relation ids of the triple.
                t (Tensor): Tail entity ids of the triple.
            Returns:
                tuple: Returns a 3-tuple of head, relation and tail embedding tensors.
        """
        emb_h = self.ent_embeddings(h)
        emb_r = self.rel_embeddings(r)
        emb_t = self.ent_embeddings(t)
        return emb_h, emb_r, emb_t

参考文献
[1] Wang, Q., Mao, Z., Wang, B. & Guo, L. Knowledge Graph Embedding: A Survey of Approaches and Applications. IEEE Transactions on Knowledge and Data Engineering 29, 2724–2743 (2017).
[2] Nickel, M., Tresp, V. & Kriegel, H.-P. A Three-Way Model for Collective Learning on Multi-Relational Data. in 809–816 (2011).
[3] Yang, B., Yih, W., He, X., Gao, J. & Deng, L. Embedding Entities and Relations for Learning and Inference in Knowledge Bases. Preprint at https://doi.org/10.48550/arXiv.1412.6575open in new window (2015).
[4] Nickel, M., Rosasco, L. & Poggio, T. Holographic Embeddings of Knowledge Graphs. Preprint at https://doi.org/10.48550/arXiv.1510.04935open in new window (2015).

RESCAL、DistMult 与 HolE

# 1. RESCAL

# 1.1 Model

# 1.2 score function

# 1.3 代码实现

# 2. DistMult

# 2.1 Model

# 2.2 代码实现

# 3. HolE

# 3.1 compositional vector space models

# 3.2 Model

# 3.3 Circular Correlation

# 3.4 代码实现