AttnMaskType.causal}, submodules=SelfAttentionSubmodules( linear_qkv=TEColumnParallelLinear, core_attention=TEDotProductAttention, linear_proj=TERowParallelLinear, q_layernorm=IdentityOp, k_layernorm=IdentityOp, ), ), input_layernorm=TENorm, self_attn_bda=get_bias_dropout_add, pre_mlp_layernorm=TENorm, mlp=mlp, mlp_bda=get_bias_dropout_add, ), TransformerLayerSubmodules( self_attention=ModuleSpec( module=SelfAttention, params={"attn_mask_type": AttnMaskType.no_mask}, submodules=SelfAttentionSubmodules( linear_qkv=TEColumnParallelLinear, core_attention=TEDotProductAttention, linear_proj=TERowParallelLinear, q_layernorm=IdentityOp, k_layernorm=IdentityOp, ), ), input_layernorm=TENorm, self_attn_bda=get_bias_dropout_add, pre_mlp_layernorm=TENorm, mlp=mlp, mlp_bda=get_bias_dropout_add, ), ポイント ・TransformerBlock内であれば(このレベルでは)概ね一緒 ・Attention Maskがcausalかどうかの違い cf. megatron/core/models/gpt/gpt_layer_specs.py