CN115116475A

CN115116475A - A method and device for automatic detection of speech depression based on time-delay neural network

Info

Publication number: CN115116475A
Application number: CN202210663429.3A
Authority: CN
Inventors: 李雅; 刘勇; 王栋
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2022-06-13
Filing date: 2022-06-13
Publication date: 2022-09-27
Anticipated expiration: 2042-06-13
Also published as: CN115116475B

Abstract

The invention provides a voice depression automatic detection method and a device based on a time delay neural network, wherein the method comprises the steps of obtaining an initial voice signal, dividing the initial voice signal into a plurality of voice sections, each voice section comprises at least one voice frame, and respectively calculating the short-time energy and the short-time zero-crossing rate of each voice section in the initial voice signal; obtaining effective voice fragments based on the short-time energy and the short-time zero crossing rate; carrying out pre-emphasis processing on each effective voice fragment, framing the pre-emphasized effective voice fragments based on time to obtain a plurality of frame fragments, and calculating a Mel frequency cepstrum coefficient corresponding to each frame fragment; inputting Mel frequency cepstrum coefficients into a preset time delay neural network model, extracting frame level characteristics by adopting a hierarchical residual convolution and compression excitation mechanism, merging the frame level characteristics based on statistics pooling of an attention mechanism, and obtaining probability parameters through a classification model; and finally, voting and integrating to obtain a prediction result.

Description

A method and device for automatic detection of speech depression based on time-delay neural network

技术领域technical field

本发明涉及语音处理技术领域，尤其涉及一种基于时延神经网络的语音抑郁症自动检测方法和装置。The invention relates to the technical field of speech processing, in particular to a method and device for automatic detection of speech depression based on a time-delay neural network.

背景技术Background technique

抑郁症是一种常见的精神疾病，主要表现为情绪低落，思维迟缓和意志减退，已经成为了目前世界范围内主要的健康问题之一。造成抑郁症危害严重的另一因素是目前对于抑郁症的诊断缺乏客观的检查手段，对于其评估和诊断主要依赖于神经科医生的精神检查，很大程度上依赖医生的主观经验，而诊断工具也仅限于调查问卷以及诊断量表。Depression is a common mental illness, mainly manifested as low mood, slow thinking and decreased willpower, and has become one of the major health problems worldwide. Another factor that causes the serious harm of depression is the lack of objective examination methods for the diagnosis of depression. For its evaluation and diagnosis, it mainly depends on the psychiatric examination of the neurologist, which largely depends on the subjective experience of the doctor, and the diagnostic tools. Also limited to questionnaires and diagnostic scales.

现有的抑郁症诊断方法主要依靠医生的诊断经验，因此对医生的经验要求较高，对于经验较少的医生难以保证诊断质量。Existing methods for diagnosing depression mainly rely on the diagnosis experience of doctors, so they have high requirements on the experience of doctors, and it is difficult for doctors with less experience to ensure the quality of diagnosis.

语音作为人类传递信息的最直接的方式，其中蕴含了大量人类健康状态的信息，已有大量研究表明抑郁症患者的发音特点正常人相比具有显著差异，比如基频、响度以及语速等相关特征会产生较大变化。As the most direct way for humans to transmit information, speech contains a lot of information about human health status. A large number of studies have shown that the pronunciation characteristics of patients with depression are significantly different from those of normal people, such as fundamental frequency, loudness, and speech rate. characteristics will vary greatly.

因此，现有技术亟需一种基于人工智能与语音信号处理技术的抑郁症诊断方法。Therefore, there is an urgent need for a depression diagnosis method based on artificial intelligence and speech signal processing technology in the prior art.

发明内容SUMMARY OF THE INVENTION

鉴于此，本发明的实施例提供了一种基于时延神经网络的语音抑郁症自动检测方法和装置，以消除或改善现有技术中存在的一个或更多个缺陷。In view of this, embodiments of the present invention provide a method and device for automatic detection of speech depression based on a time-delay neural network, so as to eliminate or improve one or more defects existing in the prior art.

本发明的第一方面提供了一种基于时延神经网络的语音抑郁症自动检测方法，所述方法的步骤包括，A first aspect of the present invention provides an automatic detection method for speech depression based on a time-delay neural network, and the steps of the method include:

获取初始语音信号，将所述初始语音信号划分为多个语音段，每个语音段包括至少一个语音帧，分别计算所述初始语音信号中每一个语音段的短时能量与短时过零率；Obtain an initial speech signal, divide the initial speech signal into multiple speech segments, each speech segment includes at least one speech frame, and separately calculate the short-term energy and short-term zero-crossing rate of each speech segment in the initial speech signal ;

基于所述短时能量获取所述初始语音信号的语音段中的浊音片段，基于所述短时过零率获取所述初始语音信号的语音段中的清音片段，组合初始语音信号中的所有浊音片段和清音片段，得到有效语音片段；Acquire the voiced segment in the speech segment of the initial speech signal based on the short-term energy, acquire the unvoiced segment in the speech segment of the initial speech signal based on the short-term zero-crossing rate, and combine all voiced segments in the initial speech signal Fragments and unvoiced fragments to obtain valid voice fragments;

对每个所述有效语音片段进行预加重处理，对预加重处理后的有效语音片段基于时间进行分帧，得到多个帧片段，计算每个帧片段对应的梅尔频率倒谱系数；Pre-emphasis is performed on each of the effective speech segments, and the pre-emphasized effective speech segments are divided into frames based on time to obtain multiple frame segments, and the Mel frequency cepstral coefficients corresponding to each frame segment are calculated;

将所述梅尔频率倒谱系数输入预设的时延神经网络模型中，基于所述时延神经网络模型的特征提取模块计算得到梅尔频率倒谱系数对应的特征向量，基于所述时延神经网络模型的特征汇聚模块计算得到每个特征向量对应的均值和方差，将每个特征向量对应的均值和方差输入所述时延神经网络模型的分类模块，得到概率参数。Inputting the Mel-frequency cepstral coefficients into a preset time-delay neural network model, and calculating a feature vector corresponding to the Mel-frequency cepstral coefficients based on the feature extraction module of the time-delay neural network model, based on the time delay The feature aggregation module of the neural network model calculates the mean and variance corresponding to each feature vector, and inputs the mean and variance corresponding to each feature vector into the classification module of the time-delay neural network model to obtain probability parameters.

采用上述方案，本发明与使用抑郁量表进行诊断的方法相比，无需依赖专业医生的经验进行诊断，且不需要昂贵的基础设施以及复杂的操作流程，本发明基于梅尔频率倒谱系数(MFCC)特征对语音特征提取，使用深度学习方法进行处理，对长段语音进行切分，作为时延神经网络输入分类结果，并进行集成以获得抑郁症诊断结果。By adopting the above scheme, compared with the method of using the depression scale for diagnosis, the present invention does not need to rely on the experience of professional doctors for diagnosis, and does not require expensive infrastructure and complicated operation procedures. The present invention is based on the Mel frequency cepstral coefficient ( MFCC) feature extraction of speech features, using deep learning methods to process, segmenting long segments of speech, as time-delay neural network input classification results, and integrating to obtain depression diagnosis results.

在本发明的一些实施方式中，在将所述梅尔频率倒谱系数输入预设的时延神经网络模型中的步骤之前还包括步骤，In some embodiments of the present invention, before the step of inputting the Mel-frequency cepstral coefficients into the preset time-delay neural network model, the step further includes,

对所述梅尔频率倒谱系数通过频谱遮罩进行特征数据增强，将增强后的梅尔频率倒谱系数输入预设的时延神经网络模型。The feature data enhancement is performed on the Mel-frequency cepstral coefficients through a spectral mask, and the enhanced Mel-frequency cepstral coefficients are input into a preset time-delay neural network model.

在本发明的一些实施方式中，所述频谱遮罩的方式包括但不限于时域遮罩或频域遮罩。In some embodiments of the present invention, the manner of spectral masking includes, but is not limited to, time-domain masking or frequency-domain masking.

在本发明的一些实施方式中，在分别计算所述初始语音信号中每一个语音段的短时能量与短时过零率的步骤中，基于如下公式计算短时能量：In some embodiments of the present invention, in the step of separately calculating the short-term energy and short-term zero-crossing rate of each speech segment in the initial speech signal, the short-term energy is calculated based on the following formula:

E_x表示语音段x的短时能量，N表示语音段x中的帧总数，n表示N个的帧中的任一个，x[n]表示N帧中的第n帧的幅值。Ex represents the short-term energy of speech segment _x , N represents the total number of frames in speech segment x, n represents any one of N frames, and x[n] represents the amplitude of the nth frame in N frames.

在本发明的一些实施方式中，在分别计算所述初始语音信号中每一个语音段的短时能量与短时过零率的步骤中，基于如下公式计算短时过零率：In some embodiments of the present invention, in the step of separately calculating the short-term energy and short-term zero-crossing rate of each speech segment in the initial speech signal, the short-term zero-crossing rate is calculated based on the following formula:

Z_x表示语音段x的短时过零率，N表示语音段x中的帧总数，n表示N个的帧中的任一个，x(n)表示N帧中的第n帧的幅值，x(n-1)表示N帧中的第n-1帧的幅值，sgn表示符号函数。Z _x represents the short-term zero-crossing rate of the speech segment x, N represents the total number of frames in the speech segment x, n represents any one of the N frames, x(n) represents the amplitude of the nth frame in the N frames, x(n-1) represents the magnitude of the n-1th frame in N frames, and sgn represents the sign function.

在本发明的一些实施方式中，在基于所述短时能量获取所述初始语音信号的语音段中的浊音片段，基于所述短时过零率获取所述初始语音信号的语音段中的清音片段的步骤中，In some embodiments of the present invention, when the voiced segment in the speech segment of the initial speech signal is acquired based on the short-term energy, the unvoiced segment in the speech segment of the initial speech signal is acquired based on the short-term zero-crossing rate Fragment steps,

预设短时能量阈值和短时过零率阈值；Preset short-term energy threshold and short-term zero-crossing rate threshold;

基于对比每个语音段的短时能量值和短时能量阈值，获取语音段中的浊音片段；Based on comparing the short-term energy value and the short-term energy threshold of each speech segment, the voiced segments in the speech segment are obtained;

基于对比每个语音段的短时过零率值和短时过零率阈值，获取语音段中的清音片段。Based on comparing the short-term zero-crossing rate value and the short-term zero-crossing rate threshold of each speech segment, the unvoiced segments in the speech segment are obtained.

在本发明的一些实施方式中，基于如下公式，对每个所述有效语音片段进行预加重处理；In some embodiments of the present invention, pre-emphasis processing is performed on each of the valid speech segments based on the following formula;

y(n)＝x(n)-αx(n-1)y(n)=x(n)-αx(n-1)

x(n)表示N帧中的第n帧的幅值，x(n-1)表示N帧中的第n-1帧的幅值，y(n)为经过预加重处理后有效语音片段的N帧中的第n帧的幅值，α为预加重因子。x(n) represents the amplitude of the nth frame in the N frames, x(n-1) represents the amplitude of the n-1th frame in the N frames, and y(n) is the amplitude of the valid speech segment after pre-emphasis processing. The amplitude of the nth frame among the N frames, and α is the pre-emphasis factor.

在本发明的一些实施方式中，在对预加重处理后的有效语音片段基于时间进行分帧，得到多个帧片段的步骤中，In some embodiments of the present invention, in the step of dividing the pre-emphasized effective speech segment into frames based on time to obtain multiple frame segments,

将每第一时间长度的有效语音片段划分为一个帧片段，相邻的帧片段存在第二时间长度的重合段。Each valid speech segment of the first time length is divided into a frame segment, and adjacent frame segments have overlapping segments of the second time length.

在本发明的一些实施方式中，计算每个帧片段对应的梅尔频率倒谱系数的步骤包括：In some embodiments of the present invention, the step of calculating the Mel frequency cepstral coefficient corresponding to each frame segment includes:

对每个帧片段基于窗函数进行加窗处理；Windowing is performed on each frame segment based on the window function;

对加窗后的帧片段进行快速傅里叶变换，将时域信号转化为频域信号；Perform fast Fourier transform on the windowed frame segment to convert the time domain signal into a frequency domain signal;

基于梅尔滤波器将频域信号的频率转化到梅尔频率，得到梅尔频率信号；Convert the frequency of the frequency domain signal to the Mel frequency based on the Mel filter to obtain the Mel frequency signal;

对梅尔频率信号进行反傅里叶变换，将梅尔频率信号转化到时域，得到梅尔频率倒谱系数。Perform inverse Fourier transform on the Mel-frequency signal, convert the Mel-frequency signal to the time domain, and obtain the Mel-frequency cepstral coefficient.

在本发明的一些实施方式中，所述特征提取模块包括多个连续的Se-Res2模块，每个 Se-Res2模块设置有Res2Net层进行卷积处理，所述特征提取模块采用分层残差连接提取帧级特征；所述特征汇聚模块包括注意力机制层，基于注意力机制计算每个特征向量对应的均值和方差；所述分类模块包括顺序连接的全连接层和Softmax层，由Softmax层输出概率参数。In some embodiments of the present invention, the feature extraction module includes a plurality of consecutive Se-Res2 modules, each Se-Res2 module is provided with a Res2Net layer for convolution processing, and the feature extraction module adopts hierarchical residual connection Extracting frame-level features; the feature aggregation module includes an attention mechanism layer, and calculates the mean and variance corresponding to each feature vector based on the attention mechanism; the classification module includes sequentially connected fully connected layers and Softmax layers, which are output by the Softmax layer probability parameter.

在本发明的一些实施方式中，在Res2Net层进行卷积处理步骤中，引入层次残差连接的模式，在一维空洞卷积时将特征在通道上进行拆分，进行不同尺度的抑郁相关特征提取，再融合分组的特征。嵌入压缩激励模块，该模块利用全局信息评估各个特征通道的重要程度，即学习到表征各个通道重要程度的权重信息，重新调整卷积后输出的各个通道的特征，实现突出对抑郁诊断更为关键的信息，抑制无关冗余信息。In some embodiments of the present invention, in the convolution processing step of the Res2Net layer, the mode of hierarchical residual connection is introduced, and the features are split on the channel during the one-dimensional hole convolution, and the depression-related features of different scales are carried out. Extract and then fuse the grouped features. Embedded compression excitation module, which uses global information to evaluate the importance of each feature channel, that is, learns the weight information that characterizes the importance of each channel, readjusts the features of each channel output after convolution, and realizes prominence is more critical for depression diagnosis information and suppress irrelevant redundant information.

在本发明的一些实施方式中，在所述特征提取模块采用分层残差连接提取帧级特征步骤中：In some embodiments of the present invention, in the step of extracting frame-level features by using hierarchical residual connections in the feature extraction module:

将梅尔频率倒谱系数经过一次一维卷积调整尺寸变为第一特征图，将第一特征图输入 Se-Res2模块，每次经过Se-Res2模块将输入数据平均分为四个特征子图并分别卷积，将卷积后的特征子图进行拼接，将拼接后的特征图再次经过一维卷积得到层次残差卷积的输出，并由最后一个Se-Res2模块输出第二特征图，根据如下公式，对特征子图进行卷积：The Mel frequency cepstral coefficients are adjusted into the first feature map through a one-dimensional convolution, and the first feature map is input into the Se-Res2 module, and the input data is divided into four feature sub-sections on average through the Se-Res2 module each time. The graphs are convolved separately, the convolved feature sub-maps are spliced, the spliced feature maps are again subjected to one-dimensional convolution to obtain the output of the hierarchical residual convolution, and the second feature is output by the last Se-Res2 module. Figure, according to the following formula, convolve the feature subgraph:

y_i表示卷积后的特征子图，i表示特征子图的序号，y_i-1表示第i-1个特征子图x_i-1卷积后的特征子图，K_i代表第i个特征子图x_i所对应的3x3卷积。y _i represents the feature submap after convolution, i represents the serial number of the feature submap, y _i-1 represents the i-1 th feature submap x _i-1 convolved feature submap, K _i represents the i th feature submap The 3x3 convolution corresponding to the feature submap x _i .

在本发明的一些实施方式中，在最后一个Se-Res2模块输出第二特征图的步骤中，基于预设的压缩激励模块对最后一个Se-Res2模块输出第二特征图进行调整，具体包括步骤，In some embodiments of the present invention, in the step of outputting the second feature map by the last Se-Res2 module, the second feature map output by the last Se-Res2 module is adjusted based on the preset compression excitation module, which specifically includes the steps of ,

基于预设的压缩激励模块得到权重因子，将权重因子加权到每个Se-Res2模块的输出特征图，权重因子加权到的特征图包括最后一个Se-Res2模块输出第二特征图，得到调整后第二特征图：The weight factor is obtained based on the preset compression excitation module, and the weight factor is weighted to the output feature map of each Se-Res2 module. The feature map weighted by the weight factor includes the second feature map output by the last Se-Res2 module. After adjustment Second feature map:

根据如下公式，基于预设的压缩激励模块得到权重因子：According to the following formula, the weight factor is obtained based on the preset compression excitation module:

s＝σ₁(W₂f₁(W₁z+b₁)+b₂)s=σ ₁ (W ₂ f ₁ (W ₁ z+b ₁ )+b ₂ )

z为通道描述符，R表示第一特征图的总帧数，r表示R帧中的第r帧，γ_r表示第一特征图第r帧的特征向量，W₁、W₂、b₁、b₂分别为两个全连接层的参数，f₁为relu激活函数，σ₁为sigmod激活函数，s表示权重因子。z is the channel descriptor, R represents the total number of frames of the first feature map, r represents the rth frame in the R frame, γ _r represents the feature vector of the rth frame of the first feature map, W ₁ , W ₂ , b ₁ , b ₂ are the parameters of the two fully connected layers, f ₁ is the relu activation function, σ ₁ is the sigmod activation function, and s is the weight factor.

在本发明的一些实施方式中，在基于注意力机制计算每个特征向量对应的均值和方差的步骤中：In some embodiments of the present invention, in the step of calculating the mean and variance corresponding to each feature vector based on the attention mechanism:

根据如下公式，计算每个特征向量对应的放缩因子，并进行归一化：According to the following formula, the scaling factor corresponding to each eigenvector is calculated and normalized:

e_t＝v^Tf₂(Wh_t+b)+k；e _t =v ^T f ₂ (Wh _t +b)+k;

e_t表示第t个帧片段的注意力得分，f₂表示非线性激活函数，W表示权重参数，h_t表示第t个帧片段的特征向量，b表示偏置参数，v^T与k均为预设的线性层学习的参数，α_t表示为经过softmax归一化后的注意力得分，T表示帧片段的总数；e _t represents the attention score of the t-th frame segment, f ₂ represents the nonlinear activation function, W represents the weight parameter, h _t represents the feature vector of the t-th frame segment, b represents the bias parameter, v ^T and k are both The preset linear layer learning parameters, α _t represents the attention score normalized by softmax, and T represents the total number of frame segments;

根据如下公式，基于放缩因子计算每个特征向量对应的均值和方差：Calculate the mean and variance corresponding to each eigenvector based on the scaling factor according to the following formula:

μ表示均值，σ₂表示方差，t表示第t个帧片段。μ represents the mean, σ ₂ represents the variance, and t represents the t-th frame segment.

本发明的第二方面提供了一种基于时延神经网络的语音抑郁症自动检测装置，该装置包括计算机设备，所述计算机设备包括处理器和存储器，所述存储器中存储有计算机指令，所述处理器用于��行所述存储器中存储的计算机指令，当所述计算机指令被处理器执行时该装置实现上述方法的步骤。A second aspect of the present invention provides a device for automatic detection of speech depression based on a time-delay neural network, the device includes a computer device, the computer device includes a processor and a memory, the memory stores computer instructions, the The processor is configured to execute computer instructions stored in the memory, and when the computer instructions are executed by the processor, the apparatus implements the steps of the above method.

本发明的第三方面提供了一种计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时以实现前述基于时延神经网络的语音抑郁症自动检测方法的步骤。。A third aspect of the present invention provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of the aforementioned automatic detection method for speech depression based on a time-delay neural network. .

本发明的附加优点、目的，以及特征将在下面的描述中将部分地加以阐述，且将对于本领域普通技术人员在研究下文后部分地变得明显，或者可以根据本发明的实践而获知。本发明的目的和其它优点可以通过在说明书以及附图中具体指出并获得。Additional advantages, objects, and features of the present invention will be set forth in part in the description that follows, and in part will become apparent to those of ordinary skill in the art upon study of the following, or may be learned from practice of the invention. The objectives and other advantages of the invention may be particularly pointed out and attained by the description and drawings.

本领域技术人员将会理解的是，能够用本发明实现的目的和优点不限于以上具体所述，并且根据以下详细说明将更清楚地理解本发明能够实现的上述和其他目的。Those skilled in the art will appreciate that the objects and advantages that can be achieved with the present invention are not limited to those specifically described above, and that the above and other objects that can be achieved by the present invention will be more clearly understood from the following detailed description.

附图说明Description of drawings

此处所说明的附图用来提供对本发明的进一步理解，构成本申请的一部分，并不构成对本发明的限定。The accompanying drawings described herein are used to provide a further understanding of the present invention, and constitute a part of the present application, and do not constitute a limitation to the present invention.

图1为本发明基于时延神经网络的语音抑郁症自动检测方法一种实施方式的示意图；1 is a schematic diagram of an embodiment of a method for automatic detection of speech depression based on a time-delay neural network according to the present invention;

图2为本发明基于时延神经网络的语音抑郁症自动检测方法的总体框架示意图；Fig. 2 is the overall frame schematic diagram of the automatic detection method of speech depression based on time delay neural network of the present invention;

图3为本发明获取梅尔频率倒谱系数的流程示意图；3 is a schematic flowchart of obtaining Mel frequency cepstral coefficients according to the present invention;

图4为本发明的时延神经网络模型的处理步骤示意图；4 is a schematic diagram of processing steps of the time delay neural network model of the present invention;

图5为采用三角滤波器方式的图谱示意图。FIG. 5 is a schematic diagram of a spectrum using a triangular filter method.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，下面结合实施方式和附图，对本发明做进一步详细说明。在此，本发明的示意性实施方式及其说明用于解释本发明，但并不作为对本发明的限定。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments and accompanying drawings. Here, the exemplary embodiments of the present invention and their descriptions are used to explain the present invention, but not to limit the present invention.

在此，还需要说明的是，为了避免因不必要的细节而模糊了本发明，在附图中仅仅示出了与根据本发明的方案密切相关的结构和/或处理步骤，而省略了与本发明关系不大的其他细节。Here, it should also be noted that, in order to avoid obscuring the present invention due to unnecessary details, only the structures and/or processing steps closely related to the solution according to the present invention are shown in the drawings, and the related structures and/or processing steps are omitted. Other details not relevant to the invention.

应该强调，术语“包括/包含”在本文使用时指特征、要素、步骤或组件的存在，但并不排除一个或更多个其它特征、要素、步骤或组件的存在或附加。It should be emphasized that the term "comprising/comprising" when used herein refers to the presence of a feature, element, step or component, but does not exclude the presence or addition of one or more other features, elements, steps or components.

在此，还需要说明的是，如果没有特殊说明，术语“连接”在本文不仅可以指直接连接，也可以表示存在中间物的间接连接。Here, it should also be noted that, if there is no special description, the term "connection" herein may not only refer to direct connection, but also to indicate indirect connection with intermediates.

在下文中，将参考附图描述本发明的实施例。在附图中，相同的附图标记代表相同或类似的部件，或者相同或类似的步骤。Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the drawings, the same reference numbers represent the same or similar parts, or the same or similar steps.

为解决以上问题，如图1、2、3所示，本发明提出一种基于时延神经网络的语音抑郁症自动检测方法，所述方法的步骤包括，In order to solve the above problems, as shown in Figures 1, 2, and 3, the present invention proposes a method for automatic detection of speech depression based on a time-delay neural network. The steps of the method include:

步骤S100，获取初始语音信号，将所述初始语音信号划分为多个语音段，每个语音段包括至少一个语音帧，分别计算所述初始语音信号中每一个语音段的短时能量与短时过零率；Step S100, obtaining an initial speech signal, dividing the initial speech signal into a plurality of speech segments, each speech segment including at least one speech frame, and calculating the short-term energy and short-term energy of each speech segment in the initial speech signal respectively. zero-crossing rate;

在本发明的一些实施方式中，所述语音帧的时间长度可以为20ms、30ms或50ms等。In some embodiments of the present invention, the time length of the speech frame may be 20ms, 30ms, or 50ms, or the like.

步骤S200，基于所述短时能量获取所述初始语音信号的语音段中的浊音片段，基于所述短时过零率获取所述初始语音信号的语音段中的清音片段，组合初始语音信号中的所有浊音片段和清音片段，得到有效语音片段；Step S200, acquiring voiced segments in the speech segment of the initial speech signal based on the short-term energy, acquiring unvoiced segments in the speech segment of the initial speech signal based on the short-term zero-crossing rate, combining the All voiced segments and unvoiced segments of , get valid speech segments;

在本发明的一些实施方式中，短时能量表示语音信号能量大小的平均值，短时过零率表示一帧语音信号的波形图曲线穿过横轴的次数；In some embodiments of the present invention, the short-term energy represents the average value of the energy of the speech signal, and the short-term zero-crossing rate represents the number of times that the waveform graph of a frame of speech signal crosses the horizontal axis;

采用上述方案，语音中可分为清音、浊音以及噪音部分，需要从中剔除噪音片段，而浊音的短时能量明显高于清音和噪声，且清音的短时过零率要高于噪声部分，因此可以通过设置合适的阈值达到目的，精确去除噪音片段。Using the above scheme, the speech can be divided into unvoiced, voiced and noise parts, and noise segments need to be removed from them. The short-term energy of voiced sounds is significantly higher than that of unvoiced sounds and noise, and the short-term zero-crossing rate of unvoiced sounds is higher than that of noise. Therefore, The purpose can be achieved by setting an appropriate threshold to accurately remove noise fragments.

步骤S300，对每个所述有效语音片段进行预加重处理，对预加重处理后的有效语音片段基于时间进行分帧，得到多个帧片段，计算每个帧片段对应的梅尔频率倒谱系数；Step S300, performing pre-emphasis processing on each of the valid speech segments, dividing the pre-emphasized valid speech segments into frames based on time to obtain multiple frame segments, and calculating the Mel frequency cepstral coefficients corresponding to each frame segment ;

在本发明的一些实施方式中，预加重处理能够提升语音信号中的高频部分，使信号的频谱变得平坦，便于进行频谱或声道参数分析。In some embodiments of the present invention, the pre-emphasis process can enhance the high frequency part in the speech signal, so that the frequency spectrum of the signal becomes flat, which is convenient for spectrum or channel parameter analysis.

在本发明的一些实施方式中，所述帧片段的时间长度可以为20ms、30ms或50ms等。In some embodiments of the present invention, the time length of the frame segment may be 20ms, 30ms, or 50ms, or the like.

在本发明的一些实施方式中，计算每个帧片段对应的梅尔频率倒谱系数可通过梅尔滤波器实现。In some embodiments of the present invention, the calculation of the mel-frequency cepstral coefficients corresponding to each frame segment may be implemented by a mel filter.

如图4所示，步骤S400，将所述梅尔频率倒谱系数输入预设的时延神经网络模型中，基于所述时延神经网络模型的特征提取模块计算得到梅尔频率倒谱系数对应的特征向量，基于所述时延神经网络模型的特征汇聚模块计算得到每个特征向量对应的均值和方差，将每个特征向量对应的均值和方差输入所述时延神经网络模型的分类模块，得到概率参数。As shown in FIG. 4 , in step S400, the Mel-frequency cepstral coefficients are input into a preset time-delay neural network model, and a feature extraction module based on the time-delay neural network model calculates and obtains the corresponding Mel-frequency cepstral coefficients The eigenvectors are calculated based on the feature aggregation module of the time-delay neural network model to obtain the mean and variance corresponding to each eigenvector, and the mean and variance corresponding to each eigenvector are input into the classification module of the time-delay neural network model, Get the probability parameter.

在本发明的一些实施方式中，特征提取模块基于时延神经网络实现，由三个连续且步幅逐渐增加的SE-Res2模块组成。时延神经网络是一种将历史、当前和未来帧的特征拼接起来的神经网络架构，从而引入了时序信息。该网络可以由多层一维空洞卷积神经网络来实现，可以使得整体参数量更少，降低计算量。In some embodiments of the present invention, the feature extraction module is implemented based on a time-delay neural network, and consists of three consecutive SE-Res2 modules with gradually increasing strides. A time-delay neural network is a neural network architecture that concatenates features from historical, current, and future frames, thereby introducing timing information. The network can be implemented by a multi-layer one-dimensional hole convolutional neural network, which can reduce the overall number of parameters and reduce the amount of calculation.

将梅尔频率倒谱系数经过一次一维卷积调整尺寸变为第一特征图，将第一特征图输入 Se-Res2模块，每次经过Se-Res2模块将输入数据平均分为四个特征子图并分别卷积，将卷积后的特征子图进行拼接，将拼接后的特征图再次经过一维卷积得到层次残差卷积的输出，并由最后一个Se-Res2模块输出第二特征图；The Mel frequency cepstral coefficients are adjusted into the first feature map through a one-dimensional convolution, and the first feature map is input into the Se-Res2 module, and the input data is divided into four feature sub-sections on average through the Se-Res2 module each time. The graphs are convolved separately, the convolved feature sub-maps are spliced, the spliced feature maps are again subjected to one-dimensional convolution to obtain the output of the hierarchical residual convolution, and the second feature is output by the last Se-Res2 module. picture;

在本发明的一些实施方式中，特征汇聚模块输入的特征向量即为调整后第二特征图。In some embodiments of the present invention, the feature vector input by the feature aggregation module is the adjusted second feature map.

在最后一个Se-Res2模块输出第二特征图的步骤中，基于预设的压缩激励模块对最后一个Se-Res2模块输出第二特征图进行调整，具体包括步骤，In the step of outputting the second feature map by the last Se-Res2 module, the second feature map output by the last Se-Res2 module is adjusted based on the preset compression excitation module, which specifically includes the steps:

基于预设的压缩激励模块得到权重因子，将权重因子加权到每个Se-Res2模块的输出特征图，权重因子加权到的特征图包括最后一个Se-Res2模块输出第二特征图，得到调整后第二特征图，前一Se-Res2模块的输入作为后一Se-Res2模块的输入，每个Se-Res2模块的输出均由权重因子进行加权处理，最后一个Se-Res2模块输出的第二特征图由权重因子进行加权处理进行调整。The weight factor is obtained based on the preset compression excitation module, and the weight factor is weighted to the output feature map of each Se-Res2 module. The feature map weighted by the weight factor includes the second feature map output by the last Se-Res2 module. After adjustment The second feature map, the input of the previous Se-Res2 module is used as the input of the next Se-Res2 module, the output of each Se-Res2 module is weighted by the weight factor, and the second feature output by the last Se-Res2 module The graph is adjusted by weighting by weighting factors.

特征提取模块引入Res2Net中层次残差连接的模式，在一维空洞卷积时将特征在通道上进行拆分，进行不同尺度的抑郁相关特征提取，再融合分组的特征，提高网络的表达能力。具体而言，在每个卷积模块中，输入的特征图经过一次一维卷积后，将特征图按顺序划分为四份，以x_i表示，i∈{1,2,3,4}。除特征子图x₁外，每个特征子图x_i都经过3x3卷积后，加上前一个特征子图卷积后的结果输出，对于每个特征子图x_i其对应输出y_i如下所示：The feature extraction module introduces the mode of hierarchical residual connection in Res2Net, splits the features on the channel during one-dimensional hole convolution, extracts depression-related features at different scales, and then fuses the grouped features to improve the expressiveness of the network. Specifically, in each convolution module, after a one-dimensional convolution of the input feature map, the feature map is divided into four parts in order, represented by x _i , i∈{1,2,3,4} . Except for the feature sub-map x ₁ , each feature sub-map x _i is convolved by 3x3, and the result output after the convolution of the previous feature sub-map is added. For each feature sub-map x _i , the corresponding output y _i is as follows shown:

其中K_i代表第i个特征子图x_i所对应的3x3卷积，y_i表示卷积后的特征子图，i表示特征子图的序号，y_i-1表示第i-1个特征子图x_i-1卷积后的特征子图，K_i代表第i个特征子图x_i所对应的3x3卷积。Among them, K _i represents the 3x3 convolution corresponding to the ith feature sub-map x _i , _yi represents the feature sub-map after convolution, i represents the sequence number of the feature sub-map, and y _i-1 represents the i-1 th feature sub-map Figure _xi-1 is the feature sub-map after convolution, and K _i represents the 3x3 convolution corresponding to the _i -th feature sub-map xi.

各个特征子图经过卷积后合并，在经过一维卷积得到层次残差卷积的输出。Each feature sub-map is merged after convolution, and the output of the hierarchical residual convolution is obtained after one-dimensional convolution.

采用上述方案，在一维空洞卷积中嵌入压缩激励模块，该模块利用全局信息评估各个特征通道的重要程度，即学习到表征各个通道重要程度的权重信息，重新调整卷积后输出的各个通道的特征，实现突出对抑郁诊断更为关键的信息，抑制无关冗余信息。压缩激励模块分为压缩即全局信息嵌入和激励即自适应重调两部分。全局信息嵌入即将时间域上的特征值取平均得到通道描述符z，如下所示：Using the above scheme, a compressed excitation module is embedded in the one-dimensional hole convolution, which uses global information to evaluate the importance of each feature channel, that is, learns the weight information representing the importance of each channel, and readjusts the output of each channel after the convolution. characteristics, to achieve highlighting the more critical information for the diagnosis of depression, and suppress irrelevant redundant information. The compression excitation module is divided into two parts: compression, namely global information embedding and excitation, namely adaptive retuning. The global information embedding is to average the eigenvalues in the time domain to obtain the channel descriptor z, as shown below:

在自适应重调中，使用带有sigmoid激活函数的全连接层，获得归一化后的权重因子 s，来表示各个通道的重要程度。In adaptive retuning, a fully connected layer with a sigmoid activation function is used to obtain a normalized weight factor s to represent the importance of each channel.

其中W₁、W₂、b₁、b₂分别为两个全连接层的参数，f为relu激活函数，σ为sigmod 激活函数。Among them, W ₁ , W ₂ , b ₁ , and b ₂ are the parameters of the two fully connected layers, respectively, f is the relu activation function, and σ is the sigmod activation function.

最后将权重因子加权到每个通道的特征上，完成在通道维度对原始特征的重标定。Finally, the weight factor is weighted to the features of each channel to complete the re-calibration of the original features in the channel dimension.

在本发明的一些实施方式中，特征汇聚模块通过计算帧级特征的均值和方差，将卷积后的帧级特征映射为固定长度的段级特征。In some embodiments of the present invention, the feature aggregation module maps the convolved frame-level features into fixed-length segment-level features by calculating the mean and variance of the frame-level features.

采用上述方案，特征汇聚模块将通过堆叠帧级特征各个通道的均值和方差，将特征提取模块帧级特征表示映射为整段语音的特征表示。此处在计算机引入了注意力机制，某些语音帧包含更多的抑郁线索，对最终结果的影响更大，通过使用注意力机制可对这些重要帧赋予更高的权重。With the above solution, the feature aggregation module will map the frame-level feature representation of the feature extraction module to the feature representation of the entire speech by stacking the mean and variance of each channel of the frame-level feature. Here the attention mechanism is introduced in the computer. Some speech frames contain more depressive cues and have a greater impact on the final result. By using the attention mechanism, these important frames can be given higher weights.

在本发明的一些实施方式中，所述分类模块包括两个全连接层和Softmax层，输出语音属于抑郁症或正常人的概率值。In some embodiments of the present invention, the classification module includes two fully connected layers and a Softmax layer, and outputs the probability value of the speech belonging to depression or normal people.

(1)时域遮罩：将梅尔频率倒谱系数频谱图中的相邻的几帧以0代替；(1) Time-domain masking: replace the adjacent frames in the Mel-frequency cepstral coefficient spectrogram with 0;

(2)频域遮罩：与时域遮罩同理，在频域上将相邻的几个频段用0替换。(2) Frequency-domain mask: Similar to the time-domain mask, several adjacent frequency bands are replaced with 0 in the frequency domain.

采用上述方案，频谱遮罩进行特征数据增强，数据增强能够扩充数据样本规模，提高深度学习模型的性能。Using the above scheme, the spectral mask is used for feature data enhancement, and the data enhancement can expand the data sample size and improve the performance of the deep learning model.

Z_x表示语音段x的短时过零率，N表示语音段x中的帧总数，n表示N个的帧中的任一个，x(n)表示N帧中的第n帧的幅值，x(n-1)表示N帧中的第n-1帧的幅值，sgn表示符号函数；Z _x represents the short-term zero-crossing rate of the speech segment x, N represents the total number of frames in the speech segment x, n represents any one of the N frames, x(n) represents the amplitude of the nth frame in the N frames, x(n-1) represents the amplitude of the n-1th frame in N frames, and sgn represents the sign function;

预设短时能量阈值��短时过零率阈值；Preset short-term energy threshold and short-term zero-crossing rate threshold;

在本发明的一些实施方式中，本方案可以为直接对比短时能量值和短时能量阈值，及短时过零率值和短时过零率阈值的方式，使二者相比较获取语音段中的浊音片段或清音片段；In some embodiments of the present invention, the solution may be a way of directly comparing the short-term energy value and the short-term energy threshold, and the short-term zero-crossing rate value and the short-term zero-crossing rate threshold, so that the two can be compared to obtain speech segments voiced or unvoiced segments in ;

也可以采用如下方式：You can also use the following methods:

设定一个短时能量高门限值T1和低门限值T2，进行第一次初判，首先依据高门限T1 设定的起点和终点，再根据T2对于选定范围起点向左、终点向右搜索，扩展语音选择范围，通过设定两个门限值能够实现有效的检出连续的浊音片段；Set a short-term energy high threshold value T1 and low threshold value T2, and perform the first initial judgment. First, according to the starting point and end point set by the high threshold T1, and then according to T2 for the starting point and ending point of the selected range to the left. Right search, expand the range of voice selection, and can effectively detect continuous voiced segments by setting two thresholds;

依据噪音的短时过零率，设置一个阈值T3，对于上步所选择的范围再次向前向后扩展，并合并重复区域，这样得到的范围就是原先语音中除噪声外的有声片段，即组合初始语音信号中的所有浊音片段和清音片段，得到有效语音片段。According to the short-term zero-crossing rate of noise, a threshold T3 is set, and the range selected in the previous step is expanded forward and backward again, and the repeated regions are merged. All voiced segments and unvoiced segments in the initial speech signal are used to obtain valid speech segments.

y(n)＝x(n)-αx(n-1)y(n)=x(n)-αx(n-1)

x(n)表示N帧中的第n帧的幅值，x(n-1)表示N帧中的第n-1帧的幅值，y(n)为经过预加重处理后有效语音片段的N帧中的第n帧的幅值，α为预加重因子，在本发明的一些实施方式中，α＝0.97。x(n) represents the amplitude of the nth frame in the N frames, x(n-1) represents the amplitude of the n-1th frame in the N frames, and y(n) is the amplitude of the valid speech segment after pre-emphasis processing. The amplitude of the nth frame in the N frames, α is a pre-emphasis factor, in some embodiments of the present invention, α=0.97.

在本发明的一些实施方式中，在对预加重处理后的有效语音片段基于时间进行分帧，得到多个帧片段的步骤中，把信号分成25ms一段的帧片段，为了避免两帧之间的差距过大，损失边界信息，另两帧之间有一段10ms的重合段。In some embodiments of the present invention, in the step of dividing the effective speech segment after pre-emphasis processing into frames based on time to obtain multiple frame segments, the signal is divided into frame segments of 25ms, in order to avoid frame segments between two frames. If the gap is too large, the boundary information is lost, and there is a 10ms overlap segment between the other two frames.

采用上述方案，有效保留了边界信息。By adopting the above scheme, the boundary information is effectively preserved.

采用上述方案，将信号分帧后，需将每一帧代入窗函数，窗外的值设定为0，以消除各个帧两端造成的信号的不连续性；Using the above scheme, after dividing the signal into frames, each frame needs to be substituted into the window function, and the value outside the window is set to 0, so as to eliminate the discontinuity of the signal caused by the two ends of each frame;

快速傅里叶变换，将时域信号转化为频域进行后续的频域分析。由于信号在时域上的变换难以看出信号的特征，因此对其进行快速傅里叶变换转换为频域上的能量分布来分析，不同的能量分布就能表示不同的语音特征；Fast Fourier Transform, which converts time domain signals into frequency domain for subsequent frequency domain analysis. Since it is difficult to see the characteristics of the signal from the transformation of the signal in the time domain, the fast Fourier transform is carried out to convert it into the energy distribution in the frequency domain for analysis, and different energy distributions can represent different speech features;

离散余弦变换，此处进行反傅里叶变换，将产生的梅尔频域信号转化到时域，得到梅尔频率倒谱系数。The discrete cosine transform, where the inverse Fourier transform is performed, converts the generated Mel frequency domain signal to the time domain, and obtains the Mel frequency cepstral coefficients.

在本发明的一些实施方式中，对每个帧片段基于窗函数进行��窗��理的步骤中，采用汉明窗函数；In some embodiments of the present invention, in the step of windowing processing each frame segment based on a window function, a Hamming window function is used;

根据如下公式得到汉明窗函数值：The Hamming window function value is obtained according to the following formula:

w(a)＝(1-α)-βcos[2πa/(A-1)]；w(a)=(1-α)-βcos[2πa/(A-1)];

w(a)表示汉明窗函数值，A表示窗口长度，a为窗口中的任一处的值，β窗参数。w(a) represents the value of the Hamming window function, A represents the length of the window, a is the value anywhere in the window, and the β window parameter.

在本发明的一些实施方式中，根据如下公式，对加窗后的帧片段进行快速傅里叶变换：In some embodiments of the present invention, fast Fourier transform is performed on the windowed frame segment according to the following formula:

δ(a)表示窗口长度中a处的幅值，δ_a(k)表示快速傅里叶变换后的参数值。δ( _a ) represents the amplitude at a in the window length, and δa(k) represents the parameter value after fast Fourier transform.

在本发明的一些实施方式中，根据如下公式，基于梅尔滤波器将频域信号的频率转化到梅尔频率：In some embodiments of the present invention, the frequency of the frequency domain signal is converted to the Mel frequency based on the mel filter according to the following formula:

H_m(k)表示梅尔滤波器的频率响应，M为滤波器的数量，0≤m≤M；取最大频率8kHZ和最小频率300Hz，将其转化到梅尔尺度，分别为401.25Mel与2834.99Mel，从最大频率与最小频率中等距离的选取M个点，分别定义为f(1)、f(2)、……、f(M)，则f(0)＝401.25， f(M+1)＝2834.99，f(0)<k<f(M+1)，s(m)表示滤波器组输出的对数能量，C(m)表示梅尔频率，g表示梅尔倒谱系数。H _m (k) represents the frequency response of the Mel filter, M is the number of filters, 0≤m≤M; take the maximum frequency of 8kHz and the minimum frequency of 300Hz, and convert them to the Mel scale, which are 401.25Mel and 2834.99 respectively. Mel, select M points from the middle distance between the maximum frequency and the minimum frequency, and define them as f(1), f(2), ..., f(M) respectively, then f(0)=401.25, f(M+1 )=2834.99, f(0)<k<f(M+1), s(m) represents the logarithmic energy of the filter bank output, C(m) represents the Mel frequency, and g represents the Mel cepstral coefficient.

梅尔滤波组过滤，由于人耳对于不同频率信号感知的灵敏度是不同的，通常会更多的关注于低频信号，因此通过使用梅尔滤波器组将原始的频率信号转化到梅尔频率，三角滤波器如图5所示。Mel filter group filtering, since the human ear has different sensitivity to different frequency signals, it usually pays more attention to the low frequency signal, so by using the Mel filter group to convert the original frequency signal to the Mel frequency, the triangular The filter is shown in Figure 5.

如图4所示，在本发明的一些实施方式中，所述特征提取模块包括多个连续的Se-Res2 模块，每个Se-Res2模块设置有Res2Net层进行卷积处理；所述特征汇聚模块包括注意力机制层，基于注意力机制计算每个特征向量对应的均值和方差；所述分类模块包括顺序连接的全连接层和Softmax层，由Softmax层输出概率参数。As shown in FIG. 4 , in some embodiments of the present invention, the feature extraction module includes a plurality of consecutive Se-Res2 modules, and each Se-Res2 module is provided with a Res2Net layer for convolution processing; the feature aggregation module It includes an attention mechanism layer, and calculates the mean and variance corresponding to each feature vector based on the attention mechanism; the classification module includes a sequence-connected fully connected layer and a Softmax layer, and the Softmax layer outputs probability parameters.

特征提取模块在处理过程中首先在时间域上的特征值取平均产生通道描述符z，然后计算每个通道的权重，最后将权重值s乘以原先的特征，得到加权的特征。During the processing, the feature extraction module first averages the feature values in the time domain to generate the channel descriptor z, then calculates the weight of each channel, and finally multiplies the weight value s by the original feature to obtain the weighted feature.

e_t＝v^Tf₂(Wh_t+b)+k；e _t =v ^T f ₂ (Wh _t +b)+k;

采用上述方案，特征汇聚模块通过计算帧级特征的均值和方差，将卷积后的帧级特征映射为固定长度的段级特征，此处在计算时引入了注意力机制。某些语音帧包含更多的抑郁线索，对最终结果的影响更大，通过使用注意力机制可对这些重要帧赋予更高的权重。Using the above scheme, the feature aggregation module maps the convolved frame-level features into fixed-length segment-level features by calculating the mean and variance of the frame-level features. Here, an attention mechanism is introduced during the calculation. Certain speech frames contain more depressive cues and have a greater impact on the final result, and these important frames can be given higher weights by using the attention mechanism.

本申请的时延神经网络模型采用交叉熵损失函数计算网络的输出值和真实值的误差，使用反向传播算法传播误差值，不断优化更新网络参数的权重。The time-delay neural network model of the present application uses the cross-entropy loss function to calculate the error between the output value of the network and the real value, uses the back-propagation algorithm to propagate the error value, and continuously optimizes and updates the weight of the network parameters.

在本发明的一些实施方式中，本申请的时延神经网络模型可以为ECAPA-TDNN网络模型。In some embodiments of the present invention, the time-delay neural network model of the present application may be an ECAPA-TDNN network model.

在本发明的一些实施方式中，由于ECAPA-TDNN输入长度的限制，将每个人的语音切分为多个初始语音信号后，将每个初始语音信号均作为模型的输入，输出每个初始语音信号的结果的概率参数，最后结合多个结果输出该人是否患有抑郁症的结果，最终每个人会产生很多段语音，该步骤将一个人产生不同语音段的预测结果采用投票表决的方法，获得该人是否患有抑郁症的结果，集成每个人产生的多段语音的预测结果。In some embodiments of the present invention, due to the limitation of the input length of ECAPA-TDNN, after dividing each person's speech into multiple initial speech signals, each initial speech signal is used as the input of the model, and each initial speech signal is output. The probability parameter of the result of the signal, and finally combined with multiple results to output the result of whether the person suffers from depression, and finally each person will produce many segments of speech. Get a result of whether the person is depressed or not, integrating the predictions of the multi-speech speech produced by each person.

在本发明的一些实施方式中，在结合多个结果输出该人是否患有抑郁症的结果的步骤中，也可以采用为时间较长的初始语音信号赋予较大的权重，计算多个结果的概率参数加权平均值的方式，得到最终的预测参数，将最终的预测参数与预设的预测阈值相比较，得到该人是否患有抑郁症的结果；In some embodiments of the present invention, in the step of outputting the result of whether the person suffers from depression in combination with multiple results, it is also possible to assign a greater weight to the initial speech signal with a longer time, and calculate the result of the multiple results. The final prediction parameter is obtained by means of the weighted average of probability parameters, and the final prediction parameter is compared with the preset prediction threshold to obtain the result of whether the person suffers from depression;

具体可以为若预测参数大于预设的预测阈值则该人患有抑郁症；Specifically, if the prediction parameter is greater than the preset prediction threshold, the person suffers from depression;

若预测参数不大于预设的预测阈值则该人未患有抑郁症。If the prediction parameter is not greater than the preset prediction threshold, the person does not suffer from depression.

综上所述，抑郁症的语音数据容易获取，只需记录患者与医生按照诊断流程进行访谈的过程即可，是一种方便快捷的方式，本发明实验的平均预测准确率为90.3％，重复五次实验模型实验结果变化幅度很小，在预测抑郁症方面稳定性和准确性表现良好，证明了方法的有效性。该发明采取人工智能与语音信号处理技术，解决实际医疗问题，具有较高的实用价值。To sum up, the voice data of depression is easy to obtain. It is only necessary to record the interview process between the patient and the doctor according to the diagnosis process, which is a convenient and fast way. The average prediction accuracy of the experiment of the present invention is 90.3%. The experimental results of the five experimental models have a small variation range, and the stability and accuracy are good in predicting depression, which proves the effectiveness of the method. The invention adopts artificial intelligence and speech signal processing technology to solve practical medical problems and has high practical value.

本发明的第二方面提供了一种基于时延神经网络的语音抑郁症自动检测装置，该装置包括计算机设备，所述计算机设备包括处理器和存储器，所述存储器中存储有计算机指令，所述处理器用于执行所述存储器中存储的计算机指令，当所述计算机指令被处理器执行时该装置实现上述方法的步骤。A second aspect of the present invention provides a device for automatic detection of speech depression based on a time-delay neural network, the device includes a computer device, the computer device includes a processor and a memory, the memory stores computer instructions, the The processor is configured to execute computer instructions stored in the memory, and when the computer instructions are executed by the processor, the apparatus implements the steps of the above method.

本发明的第三方面提供了一种计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时以实现前述基于时延神经网络的语音抑郁症自动检测方法的步骤。该计算机可读存储介质可以是有形存储介质，诸如随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、软盘、硬盘、可移动存储盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质。A third aspect of the present invention provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of the aforementioned automatic detection method for speech depression based on a time-delay neural network. The computer-readable storage medium may be a tangible storage medium such as random access memory (RAM), memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, floppy disk, hard disk, removable storage disk, CD-ROM, or any other form of storage medium known in the art.

本领域普通技术人员应该可以明白，结合本文中所公开的实施方式描述的各示例性的组成部分、系统和方法，��够以硬件、软件或者二者的结合来实现。具体究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。当以硬件方式实现时，其可以例如是电子电路、专用集成电路(ASIC)、适当的固件、插件、功能卡等等。当以软件方式实现时，本发明的元素是被用于执行所需任务的程序或者代码段。程序或者代码段可以存储在机器可读介质中，或者通过载波中携带的数据信号在传输介质或者通信链路上传送。It should be understood by those of ordinary skill in the art that the various exemplary components, systems and methods described in conjunction with the embodiments disclosed herein can be implemented in hardware, software or a combination of the two. Whether it is implemented in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an application specific integrated circuit (ASIC), suitable firmware, a plug-in, a function card, or the like. When implemented in software, elements of the invention are programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted over a transmission medium or communication link by a data signal carried in a carrier wave.

需要明确的是，本发明并不局限于上文所描述并在图中示出的特定配置和处理。为了简明起见，这里省略了对已知方法的详细描述。在上述实施例中，描述和示出了若干具体的步骤作为示例。但是，本发明的方法过程并不限于所描述和示出的具体步骤，本领域的技术人员可以在领会本发明的精神后，做出各种改变、修改和添加，或者改变步骤之间的顺序。It is to be understood that the present invention is not limited to the specific arrangements and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above-described embodiments, several specific steps are described and shown as examples. However, the method process of the present invention is not limited to the specific steps described and shown, and those skilled in the art can make various changes, modifications and additions, or change the sequence of steps after comprehending the spirit of the present invention .

本发明中，针对一个实施方式描述和/或例示的特征，可以在一个或更多个其它实施方式中以相同方式或以类似方式使用，和/或与其他实施方式的特征相结合或代替其他实施方式的特征。In the present invention, features described and/or illustrated with respect to one embodiment may be used in the same or similar manner in one or more other embodiments, and/or in combination with or in place of features of other embodiments Features of the implementation.

以上所述仅为本发明的优选实施例，并不用于限制本发明，对于本领域的技术人员来说，本发明实施例可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, various modifications and changes may be made to the embodiments of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims

1. A speech depression automatic detection method based on a time delay neural network is characterized by comprising the following steps,

acquiring an initial voice signal, dividing the initial voice signal into a plurality of voice sections, wherein each voice section comprises at least one voice frame, and respectively calculating the short-time energy and the short-time zero crossing rate of each voice section in the initial voice signal;

obtaining a voiced segment in a voice segment of the initial voice signal based on the short-time energy, obtaining an unvoiced segment in the voice segment of the initial voice signal based on the short-time zero crossing rate, and combining all voiced segments and unvoiced segments in the initial voice signal to obtain an effective voice segment;

carrying out pre-emphasis processing on each effective voice fragment, framing the pre-emphasized effective voice fragments based on time to obtain a plurality of frame fragments, and calculating a Mel frequency cepstrum coefficient corresponding to each frame fragment;

inputting the Mel frequency cepstrum coefficient into a preset time delay neural network model, calculating to obtain a feature vector corresponding to the Mel frequency cepstrum coefficient based on a feature extraction module of the time delay neural network model, calculating to obtain a mean value and a variance corresponding to each feature vector based on a feature aggregation module of the time delay neural network model, and inputting the mean value and the variance corresponding to each feature vector into a classification module of the time delay neural network model to obtain a probability parameter.

2. The method for automatically detecting speech depression based on time delay neural network as claimed in claim 1, further comprising a step before the step of inputting the Mel frequency cepstral coefficient into a preset time delay neural network model,

and enhancing the characteristic data of the Mel frequency cepstrum coefficient through a frequency spectrum mask, and inputting the enhanced Mel frequency cepstrum coefficient into a preset time delay neural network model.

3. The method according to claim 1, wherein in the step of separately calculating the short-term energy and the short-term zero-crossing rate of each speech segment in the initial speech signal, the short-term energy is calculated based on the following formula:

E _x representing the short-time energy of a speech segment x, N representing the total number of frames in the speech segment x, N representing any of the N frames, x [ N ]]Representing the amplitude of the nth frame of the N frames;

in the step of calculating the short-time energy and the short-time zero-crossing rate of each speech segment in the initial speech signal respectively, the short-time zero-crossing rate is calculated based on the following formula:

Z _x represents the short-time zero-crossing rate of a speech segment x, N represents the total number of frames in the speech segment x, N represents any one of N frames, x (N) represents the amplitude of the nth frame in the N frames, x (N-1) represents the amplitude of the nth-1 frame in the N frames, and sgn represents a sign function.

4. The method according to claim 1, wherein in the steps of obtaining voiced segments of the speech segments of the initial speech signal based on the short-time energy, obtaining unvoiced segments of the speech segments of the initial speech signal based on the short-time zero-crossing rate,

presetting a short-time energy threshold and a short-time zero-crossing rate threshold;

obtaining voiced sound segments in the voice segments based on comparing the short-time energy value and the short-time energy threshold value of each voice segment;

and acquiring unvoiced segments in the voice segments based on the comparison of the short-time zero-crossing rate value and the short-time zero-crossing rate threshold of each voice segment.

5. The method as claimed in claim 1, wherein the step of framing the pre-emphasized effective speech segment based on time to obtain a plurality of frame segments, and calculating mel-frequency cepstrum coefficients corresponding to each frame segment comprises:

in the step of framing the pre-emphasized effective speech segment based on time to obtain a plurality of frame segments,

dividing each effective voice fragment with a first time length into a frame fragment, wherein adjacent frame fragments have a superposition section with a second time length;

windowing each frame segment based on a window function;

performing fast Fourier transform on the windowed frame segment, and converting a time domain signal into a frequency domain signal;

converting the frequency of the frequency domain signal to a Mel frequency based on a Mel filter to obtain a Mel frequency signal;

and performing inverse Fourier transform on the Mel frequency signal, and converting the Mel frequency signal into a time domain to obtain a Mel frequency cepstrum coefficient.

6. The automatic voice depression detection method based on the time delay neural network as claimed in any one of claims 1 to 5, wherein the feature extraction module comprises a plurality of continuous Se-Res2 modules, each Se-Res2 module is provided with a Res2Net layer for convolution processing, and the feature extraction module adopts layered residual connection to extract frame-level features; the feature aggregation module comprises an attention mechanism layer, and the mean value and the variance corresponding to each feature vector are calculated based on the attention mechanism; the classification module comprises a full connection layer and a Softmax layer which are connected in sequence, and probability parameters are output by the Softmax layer.

7. The method for automatically detecting speech depression based on time delay neural network as claimed in claim 6, wherein in the step of extracting frame-level features by the feature extraction module using hierarchical residual connection:

the method comprises the following steps of changing a Mel frequency cepstrum coefficient into a first feature graph through one-dimensional convolution size adjustment, inputting the first feature graph into a Se-Res2 module, averagely dividing input data into four feature sub-graphs through a Se-Res2 module each time, respectively convolving the feature sub-graphs, splicing the convolved feature sub-graphs, obtaining output of hierarchical residual convolution through one-dimensional convolution of the spliced feature graph again, outputting a second feature graph through a last Se-Res2 module, and convolving the feature sub-graphs according to the following formula:

y _i representing the convolved feature subgraphs, i representing the serial number of the feature subgraphs, y _i-1 Represents the i-1 th characteristic sub-graph x _i-1 Convolved feature subgraph, K _i Representing the ith characteristic sub-graph x _i The corresponding 3x3 convolution.

8. The automatic voice depression detection method based on the time delay neural network as claimed in claim 7, wherein in the step of outputting the second feature map by the last Se-Res2 module, the step of adjusting the output second feature map by the last Se-Res2 module based on the preset compressed excitation module comprises the steps of,

weighting the weighting factors to the output characteristic diagram of each Se-Res2 module based on the weighting factors obtained by the preset compressed excitation module, wherein the characteristic diagram weighted by the weighting factors comprises a second characteristic diagram output by the last Se-Res2 module, and the adjusted second characteristic diagram is obtained:

obtaining a weight factor based on a preset compression excitation module according to the following formula:

s＝σ ₁ (W ₂ f ₁ (W ₁ z+b ₁ )+b ₂ )

z is a channel descriptor, R represents the total number of frames of the first feature map, R represents the R-th frame in the R frames, and gamma _r Feature vector, W, representing the mth frame of the first feature map ₁ 、W ₂ 、b ₁ 、b ₂ Respectively the parameters of two fully-connected layers, f ₁ For relu activation function, σ ₁ For sigmod activation functions, s denotes the weighting factor.

9. The method for automatically detecting the speech depression based on the time-delay neural network as claimed in claim 1, wherein in the step of calculating the mean and the variance corresponding to each feature vector based on the attention mechanism:

calculating a scaling factor corresponding to each feature vector according to the following formula, and normalizing;

e _t ＝v ^T f ₂ (Wh _t +b)+k；

e _t denotes the attention score, f, of the t-th frame segment ₂ Denotes a non-linear activation function, W denotes a weight parameter, h _t Feature vector representing the t-th frame segment, b bias parameter, v ^T And k are the preset parameters of linear layer learning, alpha _t Expressed as attention score normalized by softmax, T represents the total number of frame segments;

calculating the mean value and the variance corresponding to each feature vector based on the scaling factors according to the following formula;

μ denotes mean, σ ₂ Representing the variance, and t represents the t-th frame segment.

10. An apparatus for automatic voice depression detection based on a time-delay neural network, the apparatus comprising a computer device, the computer device comprising a processor and a memory, the memory having stored therein computer instructions, the processor being configured to execute the computer instructions stored in the memory, the apparatus implementing the steps of the method according to any one of claims 1-9 when the computer instructions are executed by the processor.