CN114970808A - Quantization method and device of neural network, storage medium and processor - Google Patents

Quantization method and device of neural network, storage medium and processor Download PDF

Info

Publication number
CN114970808A
CN114970808A CN202210428098.5A CN202210428098A CN114970808A CN 114970808 A CN114970808 A CN 114970808A CN 202210428098 A CN202210428098 A CN 202210428098A CN 114970808 A CN114970808 A CN 114970808A
Authority
CN
China
Prior art keywords
data
target
neural network
target data
preset number
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210428098.5A
Other languages
Chinese (zh)
Inventor
方菲菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou C Sky Microsystems Co Ltd
Original Assignee
Pingtouge Shanghai Semiconductor Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pingtouge Shanghai Semiconductor Co Ltd filed Critical Pingtouge Shanghai Semiconductor Co Ltd
Priority to CN202210428098.5A priority Critical patent/CN114970808A/en
Publication of CN114970808A publication Critical patent/CN114970808A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Image Processing (AREA)

Abstract

本发明公开了一种神��网络的量化方法和装置、存储介质及处理器。其中,该方法包括:获取目标神经网络的目标缩放因子;依据目标缩放因子,将目标神经网络中的浮点数据转换为第一预设位数的第一目标数据,其中,第一目标数据为定点数据;依据目标神经网络的权重数据和目标神经网络的偏置数据对第一目标数据进行处理,得到第二预设位数的第二目标数据,其中,权重数据为已量化的第一预设位数的定点数据;对第二目标数据进行动态移位,得到第一预设位数的第三目标数据,其中,第三目标数据为量化后的定点数据。本发明解决了相关技术中对神经网络进行量化时量化参数是统计产生的一组数值,导致量化精度比较低的技术问题。

Figure 202210428098

The invention discloses a quantization method and device of a neural network, a storage medium and a processor. The method includes: obtaining a target scaling factor of the target neural network; converting floating-point data in the target neural network into first target data of a first preset number of digits according to the target scaling factor, wherein the first target data is Fixed-point data; the first target data is processed according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset number of digits, wherein the weight data is the quantized first preset data. Set the fixed-point data of the number of bits; dynamically shift the second target data to obtain the third target data of the first preset number of bits, wherein the third target data is the quantized fixed-point data. The invention solves the technical problem in the related art that the quantization parameter is a set of numerical values generated by statistics when the neural network is quantized, resulting in a relatively low quantization precision.

Figure 202210428098

Description

Neural network quantization method and device, storage medium and processor
Technical Field
The invention relates to the technical field of information processing, in particular to a quantization method and device of a neural network, a storage medium and a processor.
Background
The neural network quantization is to convert the floating-point neural network into the fixed-point neural network and match with hardware supporting fixed-point calculation, so that the storage and calculation efficiency is improved. Neural network quantization is very common and necessary for use on devices. The following two main types of quantization methods exist: training quantization and post-training quantization. The former requires complete training data and training flow, and reduces the precision loss after quantization by introducing quantization loss into training. The latter does not need a complete training process, but fixes the existing model through a model quantification tool. At present, the quantification after the training in the industry can be divided into several categories:
1, half-precision floating-point quantization, typically used for acceleration on GPUs (not discussed in the context of this application);
2, the mixed precision quantization (weight only quantization weight), as shown in fig. 1, can reduce the model size, but since the activation (the activation value output by the upper layer neural network) is floating point, it needs to run quant (quantization) and dequantization in real time, resulting in low computational efficiency.
And 3, full integer quantization, as shown in fig. 2, namely weight and activation are fixed-point, so that the calculation efficiency can be improved while the size of the model is reduced. But it requires a verification process, i.e. partial data is needed to count the range of activation, resulting in overall accuracy not as high as the quantization of hybrid accuracy.
Aiming at the problem that quantization precision is low because quantization parameters are a group of numerical values generated by statistics when a neural network is quantized in the related technology, an effective solution is not provided at present.
Disclosure of Invention
The embodiment of the invention provides a quantization method and device for a neural network, a storage medium and a processor, which are used for at least solving the technical problem of low quantization precision caused by a group of numerical values generated by statistics of quantization parameters when the neural network is quantized in the related technology.
According to an aspect of an embodiment of the present invention, there is provided a quantization method of a neural network, including: obtaining a target scaling factor of a target neural network, wherein the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by the value range of fixed point data of a first preset digit; converting floating point data in the target neural network into first target data with a first preset number of bits according to the target scaling factor, wherein the first target data is fixed point data; processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset number of bits, wherein the weight data is quantized fixed point data with a first preset number of bits; and dynamically shifting the second target data to obtain third target data of the first preset digit, wherein the third target data is quantized fixed point data.
Further, obtaining the target scaling factor of the target neural network comprises: establishing the first lookup table according to the value range of the fixed point data of the first preset digit; determining a first quantization parameter in the first look-up table based on the floating point data; and acquiring a target scaling factor of the target neural network according to the first quantization parameter.
Further, the processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data of the second preset number of bits includes: performing dot product processing on the weight data of the target neural network and the first target data to obtain processed first target data, wherein the data bit number of the processed first target data is a second preset bit number; and acquiring bias data of the target neural network, and processing the processed first target data according to the bias data to obtain second target data with a second preset digit.
Further, obtaining offset data of the target neural network, and processing the processed first target data according to the offset data to obtain second target data of the second preset number of bits includes: carrying out displacement processing on the bias data according to the first quantization parameter to obtain processed bias data; and adding the processed first target data and the processed offset data to obtain second target data of the second preset digit.
Further, before dynamically shifting the second target data to obtain third target data of the first preset number of bits, the method further includes: acquiring a second quantization parameter corresponding to the weight data; and adding the first quantization parameter and the second quantization parameter to obtain a target quantization parameter.
Further, before dynamically shifting the second target data to obtain third target data of the first preset number of bits, the method further includes: establishing a second lookup table according to the value range of the fixed point data of the second preset digit; and obtaining a third quantization parameter in the second lookup table according to the second target data, wherein the third quantization parameter represents the number of times of dynamic shifting of the second target data.
Further, dynamically shifting the second target data to obtain third target data of the first preset number of bits includes: determining a starting point position when the second target data is dynamically shifted according to the target quantization parameter; and dynamically shifting the second target data according to the starting point position and the third quantization parameter to obtain third target data of the first preset digit.
According to another aspect of the embodiments of the present invention, there is also provided a quantization method of a neural network, including: the method comprises the steps of obtaining a target scaling factor of a target neural network sent by a client, wherein the target scaling factor is obtained from a first query table, and the first query table is determined by the value range of fixed point data of a first preset digit; converting floating point data in the target neural network into first target data with a first preset number of bits in a server according to the target scaling factor, wherein the first target data is a fixed point number; processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset number of bits, wherein the weight data is quantized fixed point data with a first preset number of bits; dynamically shifting the second target data to obtain third target data of the first preset digit, wherein the third target data is quantized fixed point data; and returning the third target data to the client.
According to another aspect of the embodiments of the present invention, there is also provided a quantization apparatus of a neural network, including: the device comprises a first obtaining unit, a second obtaining unit and a third obtaining unit, wherein the first obtaining unit is used for obtaining a target scaling factor of a target neural network, the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by the value range of fixed point data of a first preset digit; the conversion unit is used for converting floating point data in the target neural network into first target data with a first preset number of bits according to the target scaling factor, wherein the first target data is fixed point data; the processing unit is used for processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset digit, wherein the weight data is quantized fixed point data with a first preset digit; and the shifting unit is used for dynamically shifting the second target data to obtain third target data of the first preset digit, wherein the third target data is quantized fixed point data.
Further, the first acquisition unit includes: the establishing module is used for establishing the first query table according to the value range of the fixed point data of the first preset digit; a first determining module, configured to determine a first quantization parameter in the first lookup table according to the floating point data; and the obtaining module is used for obtaining a target scaling factor of the target neural network according to the first quantization parameter.
Further, the processing unit includes: the first processing module is used for performing dot product processing on the weight data of the target neural network and the first target data to obtain processed first target data, wherein the data bit number of the processed first target data is a second preset bit number; and the second processing module is used for acquiring the offset data of the target neural network and processing the processed first target data according to the offset data to obtain second target data of the second preset digit.
Further, the second processing module comprises: the first processing submodule is used for carrying out displacement processing on the bias data according to the first quantization parameter to obtain processed bias data; and the second processing submodule is used for adding the processed first target data and the processed offset data to obtain second target data of the second preset digit.
Further, the apparatus further comprises: the second obtaining unit is used for obtaining a second quantization parameter corresponding to the weight data before dynamically shifting the second target data to obtain third target data of the first preset digit; and the adding unit is used for adding the first quantization parameter and the second quantization parameter to obtain a target quantization parameter.
Further, the apparatus further comprises: the establishing unit is used for establishing a second lookup table according to the value range of the fixed point data of the second preset digit before dynamically shifting the second target data to obtain third target data of the first preset digit; and the query unit is used for obtaining a third quantization parameter in the second query table according to the second target data, wherein the third quantization parameter represents the number of times of dynamic shift of the second target data.
Further, the shift unit includes: a second determining module, configured to determine, according to the target quantization parameter, a starting point position when the second target data is dynamically shifted; and the shifting module is used for dynamically shifting the second target data according to the starting point position and the third quantization parameter to obtain third target data of the first preset digit.
According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium storing a program, wherein when the program runs, a device on which the storage medium is located is controlled to execute the neural network quantization method described in any one of the above.
According to another aspect of the embodiments of the present invention, there is also provided a processor for executing a program, where the program executes to perform the quantization method of the neural network described in any one of the above.
In the embodiment of the invention, a target scaling factor of a target neural network is obtained, wherein the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by the value range of fixed point data of a first preset digit; converting floating point data in the target neural network into first target data with a first preset number of bits according to the target scaling factor, wherein the first target data is fixed point data; processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset digit, wherein the weight data is quantized fixed point data of the first preset digit; and dynamically shifting the second target data to obtain third target data with a first preset digit, wherein the third target data is quantized fixed point data, so that the technical problem of low quantization precision caused by a group of numerical values generated by statistics of quantization parameters when a neural network is quantized in the related technology is solved. The target scaling factor is obtained from the first lookup table, floating point data are converted into first target data through the target scaling factor, then the first target data are converted into second target data through weight data and bias data, then the second target data are dynamically shifted, third target data are obtained, and the effect of improving quantization precision is achieved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow chart of hybrid precision quantization in the prior art;
FIG. 2 is a flow diagram of a prior art full integer quantization;
fig. 3 is a block diagram of a hardware configuration of a computer terminal according to an embodiment of the present invention;
FIG. 4 is a flow chart of a quantization method of a neural network according to an embodiment of the present invention;
FIG. 5 is a first flowchart of an alternative method for quantifying neural networks according to an embodiment of the present invention;
FIG. 6 is a second flowchart of an alternative neural network quantization method according to the first embodiment of the present invention;
FIG. 7 is a prior art computational flow diagram of full integer quantization;
FIG. 8 is a diagram comparing quantization of full integer with the quantization method of neural network according to the first embodiment of the present invention;
FIG. 9 is a flowchart of a quantization method of a neural network according to a second embodiment of the present invention;
FIG. 10 is a diagram of a quantization apparatus of a neural network according to a third embodiment of the present invention;
fig. 11 is a block diagram of an alternative computer terminal according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
There is also provided, in accordance with an embodiment of the present invention, an embodiment of a method for quantization of neural networks, the steps illustrated in the flowchart of the figure may be performed in a computer system, such as a set of computer-executable instructions, and although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than that presented herein.
The method provided by the first embodiment of the present invention may be executed in a mobile terminal, a computer terminal, or a similar computing device. Fig. 3 shows a block diagram of a hardware structure of a computer terminal (or mobile device) for implementing the quantization method of the neural network. As shown in fig. 3, the computer terminal 10 (or mobile device 10) may include one or more processors (shown as 102a, 102b, … …, 102n in the figure) which may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, a memory 104 for storing data, and a transmission module 106 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 3 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 3, or have a different configuration than shown in FIG. 3.
It should be noted that the one or more processors and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuit may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the embodiments of the invention, the data processing circuit acts as a processor control (e.g. selection of a variable resistance termination path connected to the interface).
The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the quantization method of the neural network in the embodiment of the present invention, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory 104, that is, implementing the quantization method of the neural network. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located from the processor, which may be connected to the computer terminal 10 over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.
The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).
Under the above operating environment, the present application provides a method for quantifying neural networks as shown in fig. 4. Fig. 4 is a flowchart of a quantization method of a neural network according to a first embodiment of the present invention.
Step S401, a target scaling factor of the target neural network is obtained, wherein the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by the value range of fixed point data of a first preset digit.
Specifically, generally, 32-bit floating point data in the neural network is converted into 8-bit fixed point data, then a first lookup table is determined according to a value range of the 8-bit (i.e. the first preset bit) fixed point data, a target scaling factor is obtained from the first lookup table, and the target scaling factor may adopt 2 n I.e. an index of 2. By means of 2 n In the form of (1), when 32-bit floating point data is converted into 8-bit fixed point data, the conversion can be realized by shifting, namely, multiplication is replaced by shifting, so that the hardware is more friendly, and the execution efficiency is higher.
Step S402, floating point data in the target neural network is converted into first target data with a first preset digit according to the target scaling factor, wherein the first target data is fixed point data.
Specifically, 32-bit floating-point data is converted into 8-bit fixed-point data (i.e., the first target data described above) according to the target scaling factor.
Step S403, processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset number of bits, where the weight data is quantized fixed-point data with a first preset number of bits.
Specifically, weight data of the target neural network is obtained, and the weight data is 8-bit fixed point data that has been quantized. Offset data of the target neural network is obtained, and the offset data is 32-bit fixed point data. The first target data is processed by the weight data and the offset data to obtain 32-bit fixed point data (i.e., the second target data).
Step S404, dynamically shifting the second target data to obtain third target data with a first preset number of bits, where the third target data is quantized fixed-point data.
Specifically, the second target data is 32-bit fixed point data, and therefore the second target data is changed to 8-bit fixed point data (i.e., the third target data) by dynamically shifting the second target data.
Through the steps S401 to S404, the target scaling factor is obtained from the first lookup table, the floating point data is converted into the first target data by the target scaling factor, then the first target data is converted into the second target data by the weight data and the offset data, and then the second target data is dynamically shifted, so that the third target data can be obtained, and the calculation efficiency and the quantization precision are improved.
It should be noted that the target neural network may include a plurality of neural networks, and when the target neural network is not the first layer neural network, the steps S403 to S404 may be directly performed, because the neural network of the first layer has obtained 8-bit fixed point data through the steps 401 to S404, and therefore the steps 401 to S402 need not be performed in the processing process of the next layer neural network.
Optionally, in the quantization method of a neural network provided in the first embodiment of the present invention, the obtaining a target scaling factor of a target neural network includes: establishing a first lookup table according to the value range of the fixed point data of the first preset digit; determining a first quantization parameter in a first look-up table based on the floating point data; and acquiring a target scaling factor of the target neural network according to the first quantization parameter.
In particular, the target scaling factor is 2 n Firstly, a first lookup table is established according to the value range of the 8-bit fixed point data, wherein the value range of the 8-bit fixed point data is [ -127, 128 ]]. According to floating point data of the neural network, inquiring in a first inquiry table to determine n value (namely the first quantization parameter), and then calculating according to the n value to obtain a target scaling factor 2 n . Through the steps, the first lookup table can be used for quickly obtaining the first lookup tableAnd quantizing the parameters, and then directly obtaining the target scaling factor according to the first quantization parameter, so that the calculation efficiency and the quantization precision are improved.
Optionally, in the quantization method of a neural network provided in the first embodiment of the present invention, the processing the first target data according to the weight data of the target neural network and the bias data of the target neural network, and obtaining second target data with a second preset number of bits includes: performing dot product processing on the weight data of the target neural network and the first target data to obtain processed first target data, wherein the data bit number of the processed first target data is a second preset bit number; and acquiring offset data of the target neural network, and processing the processed first target data according to the offset data to obtain second target data with a second preset digit.
Specifically, weight data of the target neural network is obtained, and the weight data is quantized 8-bit fixed point data. And performing dot product processing on the weight data and the first target data to obtain the processed 32-bit first target data. And then acquiring offset data of the target neural network, and processing the second target data according to the offset data to obtain 32-bit second target data. Through the steps, the accuracy of floating point data quantization is guaranteed.
Optionally, in the quantization method of a neural network provided in the first embodiment of the present invention, obtaining offset data of a target neural network, and processing the processed first target data according to the offset data to obtain second target data of a second preset number of bits includes: carrying out displacement processing on the bias data according to the first quantization parameter to obtain processed bias data; and adding the processed first target data and the processed offset data to obtain second target data with a second preset digit.
Specifically, the offset data needs to be processed first, and the second target data meeting the requirement can be obtained according to the offset data. Due to the adopted target scaling factor of 2 n Therefore, the offset data only needs to be shifted according to the value of n (i.e., the first quantization parameter). By shifting the offset dataAnd processing to obtain processed offset data, and then adding the offset data and the processed first target data to obtain 32-bit (namely, the second preset bit) second target data. By adding the bias data to the processed first target data, the accuracy of the quantization is further improved.
Optionally, in the quantization method of the neural network provided in the first embodiment of the present invention, before dynamically shifting the second target data to obtain the third target data with the first preset number of bits, the method further includes: acquiring a second quantization parameter corresponding to the weight data; and adding the first quantization parameter and the second quantization parameter to obtain a target quantization parameter.
Specifically, since the weight data of the neural network is fixed-point data after quantization, it is necessary to obtain a second quantization parameter of the weight data when quantization is performed, and then add the first quantization parameter and the second quantization parameter to obtain a target quantization parameter. The target quantization parameter is used for determining the position of the starting point for dynamically shifting the second target data, so that the dynamic shifting work of the second target data can be quickly realized.
Optionally, in the quantization method of the neural network provided in the first embodiment of the present invention, before dynamically shifting the second target data to obtain the third target data with the first preset number of bits, the method further includes: establishing a second lookup table according to the value range of the fixed point data of the second preset digit; and obtaining a third quantization parameter in the second lookup table according to the second target data, wherein the third quantization parameter represents the number of times of dynamic shifting of the second target data.
Specifically, according to the value range of the 32-bit fixed point data, a second lookup table is constructed, which is the values Int8_ max 2^ n (n is 8, … 32). Then, when the 32-bit second target data is dynamically shifted to obtain 8-bit third target data, the number of times that the second target data needs to be moved (i.e. the third quantization parameter mentioned above) can be quickly found by looking up the second lookup table. The calculation efficiency and the quantization precision during quantization are further improved through the steps.
Optionally, in the quantization method of a neural network provided in the first embodiment of the present invention, dynamically shifting the second target data to obtain third target data with a first preset number of bits includes: determining a starting point position when the second target data is dynamically shifted according to the target quantization parameter; and dynamically shifting the second target data according to the starting point position and the third quantization parameter to obtain third target data with a first preset digit.
Specifically, the starting point position is determined by the target quantization parameter, and then the 32-bit second table data is shifted according to the starting point position and the third quantization parameter, so that the 8-bit third target data can be accurately obtained.
In the quantization method of the neural network provided in the first embodiment of the present invention, a target scaling factor of a target neural network is obtained, where the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by a value range of fixed point data of a first preset number of bits; converting floating point data in the target neural network into first target data with a first preset number of bits according to the target scaling factor, wherein the first target data is fixed point data; processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset digit, wherein the weight data is quantized fixed point data of the first preset digit; and dynamically shifting the second target data to obtain third target data with a first preset digit, wherein the third target data is quantized fixed point data, so that the technical problem of low quantization precision caused by a group of numerical values generated by statistics of quantization parameters when a neural network is quantized in the related technology is solved. The target scaling factor is obtained from the first lookup table, floating point data are converted into first target data through the target scaling factor, then the first target data are converted into second target data through weight data and bias data, then the second target data are dynamically shifted, third target data are obtained, and the effect of improving quantization precision is achieved.
Fig. 5 is a first flowchart of an alternative neural network quantization method according to an embodiment of the present disclosure. First target data of 8 bits is obtained through a target scaling factor, Dot Product (Dot Product) is carried out on the first target data of 8 bits and weight data to obtain first target data after 32 bits of processing, and then 32 bits of second target data are obtained by adding offset data of 32 bits after shifting and the first target data after 32 bits of processing. And shifting the 32-bit second target data to obtain 8-bit third target data. Fig. 6 is a second flowchart of an alternative neural network quantization method according to the first embodiment of the present application, and it can be clearly seen from fig. 6 which calculation processes are required for quantizing a target neural network. Fig. 7 is a flowchart of the computation of full integer quantization in the prior art. Through comparison of the two flowcharts, the quantization method provided by the application is more efficient in calculation. As shown in fig. 8, in a comparison diagram of two quantization methods, when the floating point data of the target neural network is [ M, N ] and the bias data (bias) is [1, N ], the quantization method of the neural network provided by the present application reduces multiplication (mul) by M + N times and addition (Add) by N times compared with full integer quantization in the prior art, so that the present application improves the calculation efficiency of neural network quantization.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Example 2
Under the above operating environment, the present application provides a method for quantifying neural networks as shown in fig. 9. Fig. 9 is a flowchart of a quantization method of a neural network according to a second embodiment of the present invention.
Step S901, obtaining a target scaling factor of a target neural network sent by a client, where the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by a value range of fixed point data of a first preset number of bits.
Specifically, the target scaling factor of the target neural network is sent to the processor through the client, and the quantization process is completed through the processor.
Step S902, floating point data in a target neural network is converted into first target data with a first preset number of bits in a server according to a target scaling factor, wherein the first target data is a fixed point number; processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset digit, wherein the weight data is quantized fixed point data of the first preset digit; and dynamically shifting the second target data to obtain third target data with a first preset digit, wherein the third target data is quantized fixed point data.
Specifically, floating point data is converted into 8-bit first target data through a target scaling factor in the server, then the first target data is converted into 32-bit second target data through weight data and offset data, and then the second target data is dynamically shifted to obtain 8-bit third target data.
Step S903, returns the third target data to the client.
In the server, the specific method for quantizing the floating point data of the neural network is the same as that in embodiment 1, and is not described herein again.
The server quantizes the neural network, so that the efficiency of quantizing the neural network is improved, and the storage pressure of the local terminal is reduced.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Example 3
According to an embodiment of the present invention, there is also provided a quantization apparatus for implementing the neural network, as shown in fig. 10, the apparatus including: a first acquisition unit 1001, a conversion unit 1002, a processing unit 1003 and a shift unit 1004.
Specifically, the first obtaining unit 1001 is configured to obtain a target scaling factor of the target neural network, where the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by a value range of fixed-point data of a first preset number of bits.
The converting unit 1002 is configured to convert floating point data in the target neural network into first target data with a first preset number of bits according to the target scaling factor, where the first target data is fixed point data.
The processing unit 1003 is configured to process the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset number of bits, where the weight data is quantized fixed-point data with a first preset number of bits.
And a shifting unit 1004, configured to dynamically shift the second target data to obtain third target data with a first preset number of bits, where the third target data is quantized fixed-point data.
In the quantization apparatus for a neural network provided in the third embodiment of the present invention, a first obtaining unit 1001 obtains a target scaling factor of a target neural network, where the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by a value range of fixed point data of a first preset number of bits; the conversion unit 1002 converts floating point data in the target neural network into first target data of a first preset number of bits according to the target scaling factor, wherein the first target data is fixed point data; the processing unit 1003 processes the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data of a second preset number of bits, wherein the weight data is quantized fixed point data of the first preset number of bits; the shifting unit 1004 dynamically shifts the second target data to obtain third target data with a first preset number, wherein the third target data is quantized fixed point data, and therefore the technical problem that quantization precision is low due to the fact that quantization parameters are a group of numerical values generated by statistics when a neural network is quantized in the related technology is solved.
Optionally, in the quantization apparatus of a neural network provided in the third embodiment of the present invention, the first obtaining unit 1001 includes: the establishing module is used for establishing a first query table according to the value range of the fixed point data of the first preset digit; a first determining module for determining a first quantization parameter in a first look-up table based on the floating point data; and the obtaining module is used for obtaining a target scaling factor of the target neural network according to the first quantization parameter.
Optionally, in the apparatus for quantizing a neural network provided in the third embodiment of the present invention, the processing unit 1003 includes: the first processing module is used for performing dot product processing on the weight data of the target neural network and the first target data to obtain processed first target data, wherein the data bit number of the processed first target data is a second preset bit number; and the second processing module is used for acquiring the offset data of the target neural network and processing the processed first target data according to the offset data to obtain second target data with a second preset digit.
Optionally, in the quantization method of a neural network provided in the third embodiment of the present invention, the second processing module includes: the first processing submodule is used for carrying out displacement processing on the bias data according to the first quantization parameter to obtain processed bias data; and the second processing submodule is used for adding the processed first target data and the processed offset data to obtain second target data with a second preset digit.
Optionally, in the apparatus for quantifying a neural network provided in the third embodiment of the present invention, the apparatus further includes: the second acquisition unit is used for acquiring a second quantization parameter corresponding to the weight data before dynamically shifting the second target data to obtain third target data with a first preset digit; and the adding unit is used for adding the first quantization parameter and the second quantization parameter to obtain a target quantization parameter.
Optionally, in the apparatus for quantifying a neural network provided in the third embodiment of the present invention, the apparatus further includes: the establishing unit is used for establishing a second lookup table according to the value range of the fixed point data of the second preset digit before dynamically shifting the second target data to obtain third target data of the first preset digit; and the query unit is used for obtaining a third quantization parameter in the second query table according to the second target data, wherein the third quantization parameter represents the number of times of dynamic shift of the second target data.
Optionally, in the quantization apparatus of a neural network provided in the third embodiment of the present invention, the shifting unit 1004 includes: the second determining module is used for determining the starting point position when the second target data is dynamically shifted according to the target quantization parameter; and the shifting module is used for dynamically shifting the second target data according to the starting point position and the third quantization parameter to obtain third target data with a first preset digit.
It should be noted here that the first acquiring unit 1001, the converting unit 1002, the processing unit 1003 and the shifting unit 1004 described above correspond to steps S401 to S404 in embodiment 1, and the two modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in the first embodiment. It should be noted that the above modules may be operated in the computer terminal 10 provided in embodiment 1 as a part of the apparatus.
Example 4
The embodiment of the invention can provide a computer terminal which can be any computer terminal device in a computer terminal group. Optionally, in this embodiment, the computer terminal may also be replaced with a terminal device such as a mobile terminal.
Optionally, in this embodiment, the computer terminal may be located in at least one network device of a plurality of network devices of a computer network.
In this embodiment, the computer terminal may execute the program code of the following steps in the quantization method of the neural network: acquiring a target scaling factor of a target neural network, wherein the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by the value range of fixed point data of a first preset digit; converting floating point data in the target neural network into first target data with a first preset number of bits according to the target scaling factor, wherein the first target data is fixed point data; processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset digit, wherein the weight data is quantized fixed point data of the first preset digit; and dynamically shifting the second target data to obtain third target data with a first preset digit, wherein the third target data is quantized fixed point data.
Optionally, the computer terminal may further execute program codes of the following steps in the quantization method of the neural network: obtaining a target scaling factor for a target neural network includes: establishing a first lookup table according to the value range of the fixed point data of the first preset digit; determining a first quantization parameter in a first look-up table based on the floating point data; and acquiring a target scaling factor of the target neural network according to the first quantization parameter.
Optionally, the computer terminal may further execute program codes of the following steps in the quantization method of the neural network: processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset digit, wherein the second target data comprises: performing dot product processing on the weight data of the target neural network and the first target data to obtain processed first target data, wherein the data bit number of the processed first target data is a second preset bit number; and acquiring bias data of the target neural network, and processing the processed first target data according to the bias data to obtain second target data with a second preset digit.
Optionally, the computer terminal may further execute program codes of the following steps in the quantization method of the neural network: acquiring offset data of a target neural network, and processing the processed first target data according to the offset data to obtain second target data with a second preset digit, wherein the second target data comprises: carrying out displacement processing on the bias data according to the first quantization parameter to obtain processed bias data; and adding the processed first target data and the processed offset data to obtain second target data with a second preset digit.
Optionally, the computer terminal may further execute program codes of the following steps in the quantization method of the neural network: before dynamically shifting the second target data to obtain third target data with a first preset number of bits, the method further includes: acquiring a second quantization parameter corresponding to the weight data; and adding the first quantization parameter and the second quantization parameter to obtain a target quantization parameter.
Optionally, the computer terminal may further execute program codes of the following steps in the quantization method of the neural network: before dynamically shifting the second target data to obtain third target data with a first preset number of bits, the method further includes: establishing a second lookup table according to the value range of the fixed point data of the second preset digit; and obtaining a third quantization parameter in the second lookup table according to the second target data, wherein the third quantization parameter represents the number of times of dynamic shifting of the second target data.
Optionally, the computer terminal may further execute program codes of the following steps in the quantization method of the neural network: dynamically shifting the second target data to obtain third target data with a first preset number of bits includes: determining a starting point position when the second target data is dynamically shifted according to the target quantization parameter; and dynamically shifting the second target data according to the starting point position and the third quantization parameter to obtain third target data with a first preset digit.
Fig. 11 is a block diagram of a computer terminal according to an embodiment of the present invention. As shown in fig. 11, the computer terminal 10 may include: one or more (only one shown in fig. 11) processors, memory.
The memory may be configured to store software programs and modules, such as program instructions/modules corresponding to the neural network quantization method and apparatus in the embodiments of the present invention, and the processor executes various functional applications and data processing by operating the software programs and modules stored in the memory, so as to implement the above-described neural network quantization method. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memories may further include a memory located remotely from the processor, which may be connected to the terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: acquiring a target scaling factor of a target neural network, wherein the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by the value range of fixed point data of a first preset digit; converting floating point data in the target neural network into first target data with a first preset number of bits according to the target scaling factor, wherein the first target data is fixed point data; processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset digit, wherein the weight data is quantized fixed point data of the first preset digit; and dynamically shifting the second target data to obtain third target data with a first preset digit, wherein the third target data is quantized fixed point data.
Optionally, the processor may further execute the program code of the following steps: obtaining a target scaling factor for a target neural network includes: establishing a first lookup table according to the value range of the fixed point data of the first preset digit; determining a first quantization parameter in a first look-up table based on the floating point data; and acquiring a target scaling factor of the target neural network according to the first quantization parameter.
Optionally, the processor may further execute the program code of the following steps: processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset digit, wherein the second target data comprises: performing dot product processing on the weight data of the target neural network and the first target data to obtain processed first target data, wherein the data bit number of the processed first target data is a second preset bit number; and acquiring offset data of the target neural network, and processing the processed first target data according to the offset data to obtain second target data with a second preset digit.
Optionally, the processor may further execute the program code of the following steps: acquiring offset data of a target neural network, and processing the processed first target data according to the offset data to obtain second target data with a second preset digit, wherein the second target data comprises: shifting the bias data according to the first quantization parameter to obtain processed bias data; and adding the processed first target data and the processed offset data to obtain second target data with a second preset digit.
Optionally, the processor may further execute the program code of the following steps: before dynamically shifting the second target data to obtain third target data with a first preset number of bits, the method further includes: acquiring a second quantization parameter corresponding to the weight data; and adding the first quantization parameter and the second quantization parameter to obtain a target quantization parameter.
Optionally, the processor may further execute the program code of the following steps: before dynamically shifting the second target data to obtain third target data with a first preset number of bits, the method further includes: establishing a second lookup table according to the value range of the fixed point data of the second preset digit; and obtaining a third quantization parameter in the second lookup table according to the second target data, wherein the third quantization parameter represents the number of times of dynamic shifting of the second target data.
Optionally, the processor may further execute the program code of the following steps: dynamically shifting the second target data to obtain third target data with a first preset number of bits includes: determining a starting point position when the second target data is dynamically shifted according to the target quantization parameter; and dynamically shifting the second target data according to the starting point position and the third quantization parameter to obtain third target data with a first preset digit.
The embodiment of the invention provides a quantization scheme of a neural network. The method comprises the steps of converting floating point data into first target data through a target scaling factor, converting the first target data into second target data through weight data and bias data, and then dynamically shifting the second target data to obtain third target data, so that the technical problem that quantization precision is low due to the fact that quantization parameters are a group of numerical values generated by statistics when a neural network is quantized in the related technology is solved.
It can be understood by those skilled in the art that the structure shown in fig. 11 is only an illustration, and the computer terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 11 is a diagram illustrating a structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 11, or have a different configuration than shown in FIG. 11.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
Embodiments of the present invention also provide a computer-readable storage medium. Optionally, in this embodiment, the storage medium may be configured to store program codes executed by the quantization method of the neural network provided in the first embodiment.
Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring a target scaling factor of a target neural network, wherein the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by the value range of fixed point data of a first preset digit; converting floating point data in the target neural network into first target data with a first preset number of bits according to the target scaling factor, wherein the first target data is fixed point data; processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset digit, wherein the weight data is quantized fixed point data of the first preset digit; and dynamically shifting the second target data to obtain third target data with a first preset digit, wherein the third target data is quantized fixed point data.
Optionally, the storage medium is further configured to store program code for performing the following steps: obtaining a target scaling factor for a target neural network includes: establishing a first lookup table according to the value range of the fixed point data of the first preset digit; determining a first quantization parameter in a first look-up table based on the floating point data; and acquiring a target scaling factor of the target neural network according to the first quantization parameter.
Optionally, the storage medium is further configured to store program code for performing the following steps: processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset digit, wherein the second target data comprises: performing dot product processing on the weight data of the target neural network and the first target data to obtain processed first target data, wherein the data bit number of the processed first target data is a second preset bit number; and acquiring offset data of the target neural network, and processing the processed first target data according to the offset data to obtain second target data with a second preset digit.
Optionally, the storage medium is further configured to store program code for performing the following steps: acquiring offset data of a target neural network, and processing the processed first target data according to the offset data to obtain second target data with a second preset digit, wherein the second target data comprises: carrying out displacement processing on the bias data according to the first quantization parameter to obtain processed bias data; and adding the processed first target data and the processed offset data to obtain second target data with a second preset digit.
Optionally, the storage medium is further configured to store program code for performing the following steps: before dynamically shifting the second target data to obtain third target data with a first preset number of bits, the method further includes: acquiring a second quantization parameter corresponding to the weight data; and adding the first quantization parameter and the second quantization parameter to obtain a target quantization parameter.
Optionally, the storage medium is further configured to store program code for performing the following steps: before dynamically shifting the second target data to obtain third target data with a first preset number of bits, the method further includes: establishing a second lookup table according to the value range of the fixed point data of the second preset digit; and obtaining a third quantization parameter in the second lookup table according to the second target data, wherein the third quantization parameter represents the number of times of dynamic shifting of the second target data.
Optionally, the storage medium is further configured to store program code for performing the following steps: dynamically shifting the second target data to obtain third target data with a first preset number of bits includes: determining a starting point position when the second target data is dynamically shifted according to the target quantization parameter; and dynamically shifting the second target data according to the starting point position and the third quantization parameter to obtain third target data with a first preset digit.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (11)

1.一种神经网络的量化方法,其特征在于,包括:1. a quantization method of neural network, is characterized in that, comprises: 获取目标神经网络的目标缩放因子,其中,所述目标缩放因子由第一查询表中得到,所述第一查询表由第一预设位数的定点数据的取值范围确定;acquiring the target scaling factor of the target neural network, wherein the target scaling factor is obtained from a first look-up table, and the first look-up table is determined by the value range of the fixed-point data of the first preset number of digits; 依据所述目标缩放因子,将所述目标神经网络中的浮点数据转换为第一预设位数的第一目标数据,其中,所述第一目标数据为定点数据;converting the floating-point data in the target neural network into first target data of a first preset number of digits according to the target scaling factor, wherein the first target data is fixed-point data; 依据所述目标神经网络的权重数据和所述目标神经网络的偏置数据对所述第一目标数据进行处理,得到第二预设位数的第二目标数据,其中,所述权重数据为已量化的第一预设位数的定点数据;The first target data is processed according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data of a second preset number of digits, wherein the weight data is quantized fixed-point data of the first preset number of bits; 对所述第二目标数据进行动态移位,���到所述第一预设位数的第三目标数据,其中,所述第三目标数据为量化后的定点数据。Dynamically shifting the second target data to obtain third target data of the first preset number of bits, wherein the third target data is quantized fixed-point data. 2.根据权利要求1所述的方法,其特征在于,获取目标神经网络的目标缩放因子包括:2. The method according to claim 1, wherein obtaining the target scaling factor of the target neural network comprises: 依据所述第一预设位数的定点数据的���值范围,���立所述第一查询表;establishing the first look-up table according to the value range of the fixed-point data of the first preset number of digits; 依据所述浮点数据,在所述第一查询表中确定第一量化参数;determining a first quantization parameter in the first look-up table according to the floating-point data; 依据所述第一量化参数,获取所述目标神经网络的目标缩放因子。According to the first quantization parameter, the target scaling factor of the target neural network is obtained. 3.根据权利要求2所述的方法,其特征在于,依据所述目标神经网络的权重数据和所述目标神经网络的偏置数据对所述第一目标数据进行处理,得到所述第二预设位数的第二目标数据包括:3 . The method according to claim 2 , wherein the first target data is processed according to the weight data of the target neural network and the bias data of the target neural network to obtain the second predetermined target data. 4 . The second target data of the set number of digits includes: 将所述目标神经网络的权重数据和所述第一目标数据进行点积处理,得到处理后的第一目标数据,其中,所述处理后的第一目标数据的数据位数为第二预设位数;Performing dot product processing on the weight data of the target neural network and the first target data to obtain processed first target data, wherein the number of data digits of the processed first target data is the second preset number of digits; 获取所述目标神经网络的偏置数据,并依据所述偏置数据对所述处理后的第一目标数据进行处理,得到所述第二预设位数的第二目标数据。Acquire bias data of the target neural network, and process the processed first target data according to the bias data to obtain second target data of the second preset number of digits. 4.根据权利要求3所述的方法,其特征在于,获取所述目标神经网络的偏置数据,并依据所述偏置数据对所述处理后的第一目标数据进行处理,得到所述第二预设位数的第二目标数据包括:4 . The method according to claim 3 , wherein the bias data of the target neural network is obtained, and the processed first target data is processed according to the bias data to obtain the first target data. 5 . The second target data of two preset digits includes: 依据所述第一量化参数对所述偏置数据进行移位处理,得到处理后的偏置数据;performing shift processing on the offset data according to the first quantization parameter to obtain processed offset data; 将所述处理后的第一目标数据与所述处理后的偏置数据进行相加处理,得到所述第二预设位数的第二目标数据。The processed first target data and the processed offset data are added to obtain the second target data of the second preset number of bits. 5.根据权利要求4所述的方法,其特征在于,在对所述第二目标数据进行动态移位,得到所述第一预设位数的第三目标数据之前,所述方法还包括:5. The method according to claim 4, wherein before dynamically shifting the second target data to obtain the third target data of the first preset number of digits, the method further comprises: 获取所述权重数据对应的第二量化参数;obtaining the second quantization parameter corresponding to the weight data; 将所述第一量化参数和所述第二量化参数相加,得到目标量化参数。The first quantization parameter and the second quantization parameter are added to obtain a target quantization parameter. 6.根据权利要求5所述的方法,其特征在于,在对所述第二目标数据进行动态移位,得到所述第一预设位数的第三目标数据之前,所述方法还包括:6. The method according to claim 5, wherein before dynamically shifting the second target data to obtain the third target data of the first preset number of digits, the method further comprises: 依据所述第二预设位数的定点数据的取值范围,建立第二查询表;establishing a second look-up table according to the value range of the fixed-point data of the second preset number of digits; 依据所述第二目标数据,在所述第二查询表中得到第三量化参数,其中,所述第三量化参数表征对所述第二目标数据进行动态移位的次数。According to the second target data, a third quantization parameter is obtained in the second look-up table, wherein the third quantization parameter represents the number of times the second target data is dynamically shifted. 7.根据权利要求6所述的方法,其特征在于,对所述第二目标数据进行动态移位,得到所述第一预设位数的第三目标数据包括:7. The method according to claim 6, wherein the second target data is dynamically shifted to obtain the third target data of the first preset number of digits, comprising: 依据所述目标量化参数,确定对所述第二目标数据进行动态移位时的起点位置;According to the target quantization parameter, determine the starting point position when dynamically shifting the second target data; 依据所述起点位置和所述第三量化参数对所述第二目标数据进行动态移位,得到所述第一预设位数的第三目标数据。The second target data is dynamically shifted according to the starting point position and the third quantization parameter to obtain the third target data of the first preset number of bits. 8.一种神经网络的量化方法,其特征在于,包括:8. A quantization method of neural network, characterized in that, comprising: 获取客户端发送的目标神经网络的目标缩放因子,其中,所述目标缩放因子由第一查询表中得到,所述第一查询表由第一预设位数的定点数据的取值范围确定;obtaining the target scaling factor of the target neural network sent by the client, wherein the target scaling factor is obtained from a first look-up table, and the first look-up table is determined by the value range of the fixed-point data of the first preset number of digits; 在服务器中依据所述目标缩放因子,将所述目标神经网络中的浮点数据转换为第一预设位数的第一目标数据,其中,所述第一目标数据为定点数;依据所述目标神经网络的权重数据和所述目标神经网络的偏置数据对所述第一目标数据进行处理,得到第二预设位数的第二目标数据,其中,所述权重数据为已量化的第一预设位数的定点数据;对所述第二目标数据进行动态移位,得到所述第一预设位数的第三目标数据,其中,所述第三目标数据为量化后的定点数据;In the server, according to the target scaling factor, the floating-point data in the target neural network is converted into first target data with a first preset number of digits, wherein the first target data is a fixed-point number; according to the The weight data of the target neural network and the bias data of the target neural network process the first target data to obtain second target data of a second preset number of digits, wherein the weight data is the quantized first target data. Fixed-point data of a preset number of digits; dynamically shifting the second target data to obtain third target data of the first preset number of digits, wherein the third target data is quantized fixed-point data ; 将所述第三目标数据返回至所述客户端。Return the third target data to the client. 9.一种神经网络的量化装置,其特征在于,包括:9. A quantization device of a neural network, characterized in that, comprising: 第一获取单元,用于获取目标神经网络的目标缩放因子,其中,所述目标缩放因子由第一查询表中得到,所述第一查询表由第一预设位数的定点数据的取值范围确定;a first obtaining unit, configured to obtain a target scaling factor of the target neural network, wherein the target scaling factor is obtained from a first look-up table, and the first look-up table is obtained from the value of the fixed-point data of the first preset number of digits Scope is determined; 转换单元,用于依据所述目标缩放因子,将所述目标神经网络中的浮点数据转换为第一预设位数的第一目标数据,其中,所述第一目标数据为定点数据;a conversion unit, configured to convert the floating-point data in the target neural network into first target data of a first preset number of digits according to the target scaling factor, wherein the first target data is fixed-point data; 处理单元,用于依据所述目标神经网络的权重数据和所述目标神经网络的偏置数据对所述第一目标数据进行处理,得到第二预设位数的第二目标数据,其中,所述权重数据为已量化的第一预设位数的定点数据;a processing unit, configured to process the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data of a second preset number of digits, wherein the The weight data is the quantized fixed-point data of the first preset number of digits; 移位单元,用于对所述第二目标数据进行动态移位,得到所述第一预设位数的第三目标数据,其中,所述第三目标数据为量化后的定点数据。A shifting unit, configured to dynamically shift the second target data to obtain third target data of the first preset number of bits, wherein the third target data is quantized fixed-point data. 10.一种计算机可读存储介质,其特征在于,所述存储介质存储程序,其中,在所述程序运行时控制所述存储介质所在设备执行权利要求1至7中任意一项所述的神经网络的量化方法。10. A computer-readable storage medium, characterized in that the storage medium stores a program, wherein, when the program is run, a device where the storage medium is located is controlled to execute the neural network according to any one of claims 1 to 7 Quantization methods for networks. 11.一种处理器,其特征在于,所述处理器用于运行程序,其中,所述程序运行时执行权利要求1至7中任意一项所述的神经网络的量化方法。11 . A processor, characterized in that the processor is configured to run a program, wherein when the program runs, the method for quantizing a neural network according to any one of claims 1 to 7 is executed.
CN202210428098.5A 2022-04-22 2022-04-22 Quantization method and device of neural network, storage medium and processor Pending CN114970808A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210428098.5A CN114970808A (en) 2022-04-22 2022-04-22 Quantization method and device of neural network, storage medium and processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210428098.5A CN114970808A (en) 2022-04-22 2022-04-22 Quantization method and device of neural network, storage medium and processor

Publications (1)

Publication Number Publication Date
CN114970808A true CN114970808A (en) 2022-08-30

Family

ID=82980172

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210428098.5A Pending CN114970808A (en) 2022-04-22 2022-04-22 Quantization method and device of neural network, storage medium and processor

Country Status (1)

Country Link
CN (1) CN114970808A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108345939A (en) * 2017-01-25 2018-07-31 微软技术许可有限责任公司 Neural network based on fixed-point calculation
CN109002881A (en) * 2018-06-28 2018-12-14 郑州云海信息技术有限公司 The fixed point calculation method and device of deep neural network based on FPGA
CN111695671A (en) * 2019-03-12 2020-09-22 北京地平线机器人技术研发有限公司 Method and device for training neural network and electronic equipment
CN112381205A (en) * 2020-09-29 2021-02-19 北京清微智能科技有限公司 Neural network low bit quantization method
CN112712164A (en) * 2020-12-30 2021-04-27 上海熠知电子科技有限公司 Non-uniform quantization method of neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108345939A (en) * 2017-01-25 2018-07-31 微软技术许可有限责任公司 Neural network based on fixed-point calculation
CN109002881A (en) * 2018-06-28 2018-12-14 郑州云海信息技术有限公司 The fixed point calculation method and device of deep neural network based on FPGA
CN111695671A (en) * 2019-03-12 2020-09-22 北京地平线机器人技术研发有限公司 Method and device for training neural network and electronic equipment
CN112381205A (en) * 2020-09-29 2021-02-19 北京清微智能科技有限公司 Neural network low bit quantization method
CN112712164A (en) * 2020-12-30 2021-04-27 上海熠知电子科技有限公司 Non-uniform quantization method of neural network

Similar Documents

Publication Publication Date Title
CN111008230B (en) Data storage method, device, computer equipment and storage medium
CN112416531B (en) Digital twin system simulation method, system, computer device and storage medium
CN110266834B (en) Area searching method and device based on internet protocol address
WO2019019649A1 (en) Method and apparatus for generating investment portfolio product, storage medium and computer device
US8838550B1 (en) Readable text-based compression of resource identifiers
CN113723161A (en) Neural network model training method, storage medium and electronic device
CN111552715A (en) User query method and device
CN107315729A (en) For the data processing method of chart, medium, device and computing device
CN112328804A (en) Method, apparatus and storage medium for determining learning situation
CN110796247B (en) Data processing method, device, processor and computer readable storage medium
CN110929866A (en) Training method, device and system for neural network model
CN109583579B (en) Computing device and related product
CN114970808A (en) Quantization method and device of neural network, storage medium and processor
CN114896994A (en) Operation method, operation device, operation chip, electronic device and storage medium
CN104827780A (en) Printing method and printing platform
CN107301017B (en) Data storage method and device
CN109460533A (en) A kind of method and device improving GEMM calculated performance
WO2024251004A1 (en) Model training method and apparatus, electronic device, and storage medium
CN111384971A (en) Data processing method, apparatus and decoder in finite field
CN112863475B (en) Speech synthesis method, apparatus and medium
CN109361399A (en) A kind of method, apparatus, equipment and storage medium obtaining byte sequence
CN114398367A (en) Data storage method, device and storage medium
CN114385540A (en) A data unit conversion method and device
CN114781331B (en) Text generation method and device, storage medium and processor
CN115796256A (en) Model quantization method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20240221

Address after: 310052 Room 201, floor 2, building 5, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Applicant after: C-SKY MICROSYSTEMS Co.,Ltd.

Country or region after: China

Address before: 200131 floor 5, No. 366, Shangke road and No. 2, Lane 55, Chuanhe Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai

Applicant before: Pingtouge (Shanghai) semiconductor technology Co.,Ltd.

Country or region before: China

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20220830

RJ01 Rejection of invention patent application after publication