CN114970808A

CN114970808A - Quantization method and device of neural network, storage medium and processor

Info

Publication number: CN114970808A
Application number: CN202210428098.5A
Authority: CN
Inventors: 方菲菲
Original assignee: Pingtouge Shanghai Semiconductor Co Ltd
Current assignee: Hangzhou C Sky Microsystems Co Ltd
Priority date: 2022-04-22
Filing date: 2022-04-22
Publication date: 2022-08-30

Abstract

The invention discloses a quantization method and device of a neural network, a storage medium and a processor. The method includes: obtaining a target scaling factor of the target neural network; converting floating-point data in the target neural network into first target data of a first preset number of digits according to the target scaling factor, wherein the first target data is Fixed-point data; the first target data is processed according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset number of digits, wherein the weight data is the quantized first preset data. Set the fixed-point data of the number of bits; dynamically shift the second target data to obtain the third target data of the first preset number of bits, wherein the third target data is the quantized fixed-point data. The invention solves the technical problem in the related art that the quantization parameter is a set of numerical values generated by statistics when the neural network is quantized, resulting in a relatively low quantization precision.

Description

Neural network quantization method and device, storage medium and processor

Technical Field

The invention relates to the technical field of information processing, in particular to a quantization method and device of a neural network, a storage medium and a processor.

Background

The neural network quantization is to convert the floating-point neural network into the fixed-point neural network and match with hardware supporting fixed-point calculation, so that the storage and calculation efficiency is improved. Neural network quantization is very common and necessary for use on devices. The following two main types of quantization methods exist: training quantization and post-training quantization. The former requires complete training data and training flow, and reduces the precision loss after quantization by introducing quantization loss into training. The latter does not need a complete training process, but fixes the existing model through a model quantification tool. At present, the quantification after the training in the industry can be divided into several categories:

1, half-precision floating-point quantization, typically used for acceleration on GPUs (not discussed in the context of this application);

2, the mixed precision quantization (weight only quantization weight), as shown in fig. 1, can reduce the model size, but since the activation (the activation value output by the upper layer neural network) is floating point, it needs to run quant (quantization) and dequantization in real time, resulting in low computational efficiency.

And 3, full integer quantization, as shown in fig. 2, namely weight and activation are fixed-point, so that the calculation efficiency can be improved while the size of the model is reduced. But it requires a verification process, i.e. partial data is needed to count the range of activation, resulting in overall accuracy not as high as the quantization of hybrid accuracy.

Aiming at the problem that quantization precision is low because quantization parameters are a group of numerical values generated by statistics when a neural network is quantized in the related technology, an effective solution is not provided at present.

Disclosure of Invention

The embodiment of the invention provides a quantization method and device for a neural network, a storage medium and a processor, which are used for at least solving the technical problem of low quantization precision caused by a group of numerical values generated by statistics of quantization parameters when the neural network is quantized in the related technology.

According to an aspect of an embodiment of the present invention, there is provided a quantization method of a neural network, including: obtaining a target scaling factor of a target neural network, wherein the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by the value range of fixed point data of a first preset digit; converting floating point data in the target neural network into first target data with a first preset number of bits according to the target scaling factor, wherein the first target data is fixed point data; processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset number of bits, wherein the weight data is quantized fixed point data with a first preset number of bits; and dynamically shifting the second target data to obtain third target data of the first preset digit, wherein the third target data is quantized fixed point data.

Further, obtaining the target scaling factor of the target neural network comprises: establishing the first lookup table according to the value range of the fixed point data of the first preset digit; determining a first quantization parameter in the first look-up table based on the floating point data; and acquiring a target scaling factor of the target neural network according to the first quantization parameter.

Further, the processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data of the second preset number of bits includes: performing dot product processing on the weight data of the target neural network and the first target data to obtain processed first target data, wherein the data bit number of the processed first target data is a second preset bit number; and acquiring bias data of the target neural network, and processing the processed first target data according to the bias data to obtain second target data with a second preset digit.

Further, obtaining offset data of the target neural network, and processing the processed first target data according to the offset data to obtain second target data of the second preset number of bits includes: carrying out displacement processing on the bias data according to the first quantization parameter to obtain processed bias data; and adding the processed first target data and the processed offset data to obtain second target data of the second preset digit.

Further, before dynamically shifting the second target data to obtain third target data of the first preset number of bits, the method further includes: acquiring a second quantization parameter corresponding to the weight data; and adding the first quantization parameter and the second quantization parameter to obtain a target quantization parameter.

Further, before dynamically shifting the second target data to obtain third target data of the first preset number of bits, the method further includes: establishing a second lookup table according to the value range of the fixed point data of the second preset digit; and obtaining a third quantization parameter in the second lookup table according to the second target data, wherein the third quantization parameter represents the number of times of dynamic shifting of the second target data.

Further, dynamically shifting the second target data to obtain third target data of the first preset number of bits includes: determining a starting point position when the second target data is dynamically shifted according to the target quantization parameter; and dynamically shifting the second target data according to the starting point position and the third quantization parameter to obtain third target data of the first preset digit.

According to another aspect of the embodiments of the present invention, there is also provided a quantization method of a neural network, including: the method comprises the steps of obtaining a target scaling factor of a target neural network sent by a client, wherein the target scaling factor is obtained from a first query table, and the first query table is determined by the value range of fixed point data of a first preset digit; converting floating point data in the target neural network into first target data with a first preset number of bits in a server according to the target scaling factor, wherein the first target data is a fixed point number; processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset number of bits, wherein the weight data is quantized fixed point data with a first preset number of bits; dynamically shifting the second target data to obtain third target data of the first preset digit, wherein the third target data is quantized fixed point data; and returning the third target data to the client.

According to another aspect of the embodiments of the present invention, there is also provided a quantization apparatus of a neural network, including: the device comprises a first obtaining unit, a second obtaining unit and a third obtaining unit, wherein the first obtaining unit is used for obtaining a target scaling factor of a target neural network, the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by the value range of fixed point data of a first preset digit; the conversion unit is used for converting floating point data in the target neural network into first target data with a first preset number of bits according to the target scaling factor, wherein the first target data is fixed point data; the processing unit is used for processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset digit, wherein the weight data is quantized fixed point data with a first preset digit; and the shifting unit is used for dynamically shifting the second target data to obtain third target data of the first preset digit, wherein the third target data is quantized fixed point data.

Further, the first acquisition unit includes: the establishing module is used for establishing the first query table according to the value range of the fixed point data of the first preset digit; a first determining module, configured to determine a first quantization parameter in the first lookup table according to the floating point data; and the obtaining module is used for obtaining a target scaling factor of the target neural network according to the first quantization parameter.

Further, the processing unit includes: the first processing module is used for performing dot product processing on the weight data of the target neural network and the first target data to obtain processed first target data, wherein the data bit number of the processed first target data is a second preset bit number; and the second processing module is used for acquiring the offset data of the target neural network and processing the processed first target data according to the offset data to obtain second target data of the second preset digit.

Further, the second processing module comprises: the first processing submodule is used for carrying out displacement processing on the bias data according to the first quantization parameter to obtain processed bias data; and the second processing submodule is used for adding the processed first target data and the processed offset data to obtain second target data of the second preset digit.

Further, the apparatus further comprises: the second obtaining unit is used for obtaining a second quantization parameter corresponding to the weight data before dynamically shifting the second target data to obtain third target data of the first preset digit; and the adding unit is used for adding the first quantization parameter and the second quantization parameter to obtain a target quantization parameter.

Further, the apparatus further comprises: the establishing unit is used for establishing a second lookup table according to the value range of the fixed point data of the second preset digit before dynamically shifting the second target data to obtain third target data of the first preset digit; and the query unit is used for obtaining a third quantization parameter in the second query table according to the second target data, wherein the third quantization parameter represents the number of times of dynamic shift of the second target data.

Further, the shift unit includes: a second determining module, configured to determine, according to the target quantization parameter, a starting point position when the second target data is dynamically shifted; and the shifting module is used for dynamically shifting the second target data according to the starting point position and the third quantization parameter to obtain third target data of the first preset digit.

According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium storing a program, wherein when the program runs, a device on which the storage medium is located is controlled to execute the neural network quantization method described in any one of the above.

According to another aspect of the embodiments of the present invention, there is also provided a processor for executing a program, where the program executes to perform the quantization method of the neural network described in any one of the above.

In the embodiment of the invention, a target scaling factor of a target neural network is obtained, wherein the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by the value range of fixed point data of a first preset digit; converting floating point data in the target neural network into first target data with a first preset number of bits according to the target scaling factor, wherein the first target data is fixed point data; processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset digit, wherein the weight data is quantized fixed point data of the first preset digit; and dynamically shifting the second target data to obtain third target data with a first preset digit, wherein the third target data is quantized fixed point data, so that the technical problem of low quantization precision caused by a group of numerical values generated by statistics of quantization parameters when a neural network is quantized in the related technology is solved. The target scaling factor is obtained from the first lookup table, floating point data are converted into first target data through the target scaling factor, then the first target data are converted into second target data through weight data and bias data, then the second target data are dynamically shifted, third target data are obtained, and the effect of improving quantization precision is achieved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow chart of hybrid precision quantization in the prior art;

FIG. 2 is a flow diagram of a prior art full integer quantization;

fig. 3 is a block diagram of a hardware configuration of a computer terminal according to an embodiment of the present invention;

FIG. 4 is a flow chart of a quantization method of a neural network according to an embodiment of the present invention;

FIG. 5 is a first flowchart of an alternative method for quantifying neural networks according to an embodiment of the present invention;

FIG. 6 is a second flowchart of an alternative neural network quantization method according to the first embodiment of the present invention;

FIG. 7 is a prior art computational flow diagram of full integer quantization;

FIG. 8 is a diagram comparing quantization of full integer with the quantization method of neural network according to the first embodiment of the present invention;

FIG. 9 is a flowchart of a quantization method of a neural network according to a second embodiment of the present invention;

FIG. 10 is a diagram of a quantization apparatus of a neural network according to a third embodiment of the present invention;

fig. 11 is a block diagram of an alternative computer terminal according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

There is also provided, in accordance with an embodiment of the present invention, an embodiment of a method for quantization of neural networks, the steps illustrated in the flowchart of the figure may be performed in a computer system, such as a set of computer-executable instructions, and although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than that presented herein.

The method provided by the first embodiment of the present invention may be executed in a mobile terminal, a computer terminal, or a similar computing device. Fig. 3 shows a block diagram of a hardware structure of a computer terminal (or mobile device) for implementing the quantization method of the neural network. As shown in fig. 3, the computer terminal 10 (or mobile device 10) may include one or more processors (shown as 102a, 102b, … …, 102n in the figure) which may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, a memory 104 for storing data, and a transmission module 106 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 3 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 3, or have a different configuration than shown in FIG. 3.

It should be noted that the one or more processors and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuit may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the embodiments of the invention, the data processing circuit acts as a processor control (e.g. selection of a variable resistance termination path connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the quantization method of the neural network in the embodiment of the present invention, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory 104, that is, implementing the quantization method of the neural network. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located from the processor, which may be connected to the computer terminal 10 over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

Under the above operating environment, the present application provides a method for quantifying neural networks as shown in fig. 4. Fig. 4 is a flowchart of a quantization method of a neural network according to a first embodiment of the present invention.

Step S401, a target scaling factor of the target neural network is obtained, wherein the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by the value range of fixed point data of a first preset digit.

Specifically, generally, 32-bit floating point data in the neural network is converted into 8-bit fixed point data, then a first lookup table is determined according to a value range of the 8-bit (i.e. the first preset bit) fixed point data, a target scaling factor is obtained from the first lookup table, and the target scaling factor may adopt 2 ⁿ I.e. an index of 2. By means of 2 ⁿ In the form of (1), when 32-bit floating point data is converted into 8-bit fixed point data, the conversion can be realized by shifting, namely, multiplication is replaced by shifting, so that the hardware is more friendly, and the execution efficiency is higher.

Step S402, floating point data in the target neural network is converted into first target data with a first preset digit according to the target scaling factor, wherein the first target data is fixed point data.

Specifically, 32-bit floating-point data is converted into 8-bit fixed-point data (i.e., the first target data described above) according to the target scaling factor.

Step S403, processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset number of bits, where the weight data is quantized fixed-point data with a first preset number of bits.

Specifically, weight data of the target neural network is obtained, and the weight data is 8-bit fixed point data that has been quantized. Offset data of the target neural network is obtained, and the offset data is 32-bit fixed point data. The first target data is processed by the weight data and the offset data to obtain 32-bit fixed point data (i.e., the second target data).

Step S404, dynamically shifting the second target data to obtain third target data with a first preset number of bits, where the third target data is quantized fixed-point data.

Specifically, the second target data is 32-bit fixed point data, and therefore the second target data is changed to 8-bit fixed point data (i.e., the third target data) by dynamically shifting the second target data.

Through the steps S401 to S404, the target scaling factor is obtained from the first lookup table, the floating point data is converted into the first target data by the target scaling factor, then the first target data is converted into the second target data by the weight data and the offset data, and then the second target data is dynamically shifted, so that the third target data can be obtained, and the calculation efficiency and the quantization precision are improved.

It should be noted that the target neural network may include a plurality of neural networks, and when the target neural network is not the first layer neural network, the steps S403 to S404 may be directly performed, because the neural network of the first layer has obtained 8-bit fixed point data through the steps 401 to S404, and therefore the steps 401 to S402 need not be performed in the processing process of the next layer neural network.

Optionally, in the quantization method of a neural network provided in the first embodiment of the present invention, the obtaining a target scaling factor of a target neural network includes: establishing a first lookup table according to the value range of the fixed point data of the first preset digit; determining a first quantization parameter in a first look-up table based on the floating point data; and acquiring a target scaling factor of the target neural network according to the first quantization parameter.

In particular, the target scaling factor is 2 ⁿ Firstly, a first lookup table is established according to the value range of the 8-bit fixed point data, wherein the value range of the 8-bit fixed point data is [ -127, 128 ]]. According to floating point data of the neural network, inquiring in a first inquiry table to determine n value (namely the first quantization parameter), and then calculating according to the n value to obtain a target scaling factor 2 ⁿ . Through the steps, the first lookup table can be used for quickly obtaining the first lookup tableAnd quantizing the parameters, and then directly obtaining the target scaling factor according to the first quantization parameter, so that the calculation efficiency and the quantization precision are improved.

Optionally, in the quantization method of a neural network provided in the first embodiment of the present invention, the processing the first target data according to the weight data of the target neural network and the bias data of the target neural network, and obtaining second target data with a second preset number of bits includes: performing dot product processing on the weight data of the target neural network and the first target data to obtain processed first target data, wherein the data bit number of the processed first target data is a second preset bit number; and acquiring offset data of the target neural network, and processing the processed first target data according to the offset data to obtain second target data with a second preset digit.

Specifically, weight data of the target neural network is obtained, and the weight data is quantized 8-bit fixed point data. And performing dot product processing on the weight data and the first target data to obtain the processed 32-bit first target data. And then acquiring offset data of the target neural network, and processing the second target data according to the offset data to obtain 32-bit second target data. Through the steps, the accuracy of floating point data quantization is guaranteed.

Optionally, in the quantization method of a neural network provided in the first embodiment of the present invention, obtaining offset data of a target neural network, and processing the processed first target data according to the offset data to obtain second target data of a second preset number of bits includes: carrying out displacement processing on the bias data according to the first quantization parameter to obtain processed bias data; and adding the processed first target data and the processed offset data to obtain second target data with a second preset digit.

Specifically, the offset data needs to be processed first, and the second target data meeting the requirement can be obtained according to the offset data. Due to the adopted target scaling factor of 2 ⁿ Therefore, the offset data only needs to be shifted according to the value of n (i.e., the first quantization parameter). By shifting the offset dataAnd processing to obtain processed offset data, and then adding the offset data and the processed first target data to obtain 32-bit (namely, the second preset bit) second target data. By adding the bias data to the processed first target data, the accuracy of the quantization is further improved.

Optionally, in the quantization method of the neural network provided in the first embodiment of the present invention, before dynamically shifting the second target data to obtain the third target data with the first preset number of bits, the method further includes: acquiring a second quantization parameter corresponding to the weight data; and adding the first quantization parameter and the second quantization parameter to obtain a target quantization parameter.

Specifically, since the weight data of the neural network is fixed-point data after quantization, it is necessary to obtain a second quantization parameter of the weight data when quantization is performed, and then add the first quantization parameter and the second quantization parameter to obtain a target quantization parameter. The target quantization parameter is used for determining the position of the starting point for dynamically shifting the second target data, so that the dynamic shifting work of the second target data can be quickly realized.

Optionally, in the quantization method of the neural network provided in the first embodiment of the present invention, before dynamically shifting the second target data to obtain the third target data with the first preset number of bits, the method further includes: establishing a second lookup table according to the value range of the fixed point data of the second preset digit; and obtaining a third quantization parameter in the second lookup table according to the second target data, wherein the third quantization parameter represents the number of times of dynamic shifting of the second target data.

Specifically, according to the value range of the 32-bit fixed point data, a second lookup table is constructed, which is the values Int8_ max 2^ n (n is 8, … 32). Then, when the 32-bit second target data is dynamically shifted to obtain 8-bit third target data, the number of times that the second target data needs to be moved (i.e. the third quantization parameter mentioned above) can be quickly found by looking up the second lookup table. The calculation efficiency and the quantization precision during quantization are further improved through the steps.

Optionally, in the quantization method of a neural network provided in the first embodiment of the present invention, dynamically shifting the second target data to obtain third target data with a first preset number of bits includes: determining a starting point position when the second target data is dynamically shifted according to the target quantization parameter; and dynamically shifting the second target data according to the starting point position and the third quantization parameter to obtain third target data with a first preset digit.

Specifically, the starting point position is determined by the target quantization parameter, and then the 32-bit second table data is shifted according to the starting point position and the third quantization parameter, so that the 8-bit third target data can be accurately obtained.

In the quantization method of the neural network provided in the first embodiment of the present invention, a target scaling factor of a target neural network is obtained, where the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by a value range of fixed point data of a first preset number of bits; converting floating point data in the target neural network into first target data with a first preset number of bits according to the target scaling factor, wherein the first target data is fixed point data; processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset digit, wherein the weight data is quantized fixed point data of the first preset digit; and dynamically shifting the second target data to obtain third target data with a first preset digit, wherein the third target data is quantized fixed point data, so that the technical problem of low quantization precision caused by a group of numerical values generated by statistics of quantization parameters when a neural network is quantized in the related technology is solved. The target scaling factor is obtained from the first lookup table, floating point data are converted into first target data through the target scaling factor, then the first target data are converted into second target data through weight data and bias data, then the second target data are dynamically shifted, third target data are obtained, and the effect of improving quantization precision is achieved.

Fig. 5 is a first flowchart of an alternative neural network quantization method according to an embodiment of the present disclosure. First target data of 8 bits is obtained through a target scaling factor, Dot Product (Dot Product) is carried out on the first target data of 8 bits and weight data to obtain first target data after 32 bits of processing, and then 32 bits of second target data are obtained by adding offset data of 32 bits after shifting and the first target data after 32 bits of processing. And shifting the 32-bit second target data to obtain 8-bit third target data. Fig. 6 is a second flowchart of an alternative neural network quantization method according to the first embodiment of the present application, and it can be clearly seen from fig. 6 which calculation processes are required for quantizing a target neural network. Fig. 7 is a flowchart of the computation of full integer quantization in the prior art. Through comparison of the two flowcharts, the quantization method provided by the application is more efficient in calculation. As shown in fig. 8, in a comparison diagram of two quantization methods, when the floating point data of the target neural network is [ M, N ] and the bias data (bias) is [1, N ], the quantization method of the neural network provided by the present application reduces multiplication (mul) by M + N times and addition (Add) by N times compared with full integer quantization in the prior art, so that the present application improves the calculation efficiency of neural network quantization.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

Under the above operating environment, the present application provides a method for quantifying neural networks as shown in fig. 9. Fig. 9 is a flowchart of a quantization method of a neural network according to a second embodiment of the present invention.

Step S901, obtaining a target scaling factor of a target neural network sent by a client, where the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by a value range of fixed point data of a first preset number of bits.

Specifically, the target scaling factor of the target neural network is sent to the processor through the client, and the quantization process is completed through the processor.

Step S902, floating point data in a target neural network is converted into first target data with a first preset number of bits in a server according to a target scaling factor, wherein the first target data is a fixed point number; processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset digit, wherein the weight data is quantized fixed point data of the first preset digit; and dynamically shifting the second target data to obtain third target data with a first preset digit, wherein the third target data is quantized fixed point data.

Specifically, floating point data is converted into 8-bit first target data through a target scaling factor in the server, then the first target data is converted into 32-bit second target data through weight data and offset data, and then the second target data is dynamically shifted to obtain 8-bit third target data.

Step S903, returns the third target data to the client.

In the server, the specific method for quantizing the floating point data of the neural network is the same as that in embodiment 1, and is not described herein again.

The server quantizes the neural network, so that the efficiency of quantizing the neural network is improved, and the storage pressure of the local terminal is reduced.

Example 3

According to an embodiment of the present invention, there is also provided a quantization apparatus for implementing the neural network, as shown in fig. 10, the apparatus including: a first acquisition unit 1001, a conversion unit 1002, a processing unit 1003 and a shift unit 1004.

Specifically, the first obtaining unit 1001 is configured to obtain a target scaling factor of the target neural network, where the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by a value range of fixed-point data of a first preset number of bits.

The converting unit 1002 is configured to convert floating point data in the target neural network into first target data with a first preset number of bits according to the target scaling factor, where the first target data is fixed point data.

The processing unit 1003 is configured to process the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset number of bits, where the weight data is quantized fixed-point data with a first preset number of bits.

And a shifting unit 1004, configured to dynamically shift the second target data to obtain third target data with a first preset number of bits, where the third target data is quantized fixed-point data.

In the quantization apparatus for a neural network provided in the third embodiment of the present invention, a first obtaining unit 1001 obtains a target scaling factor of a target neural network, where the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by a value range of fixed point data of a first preset number of bits; the conversion unit 1002 converts floating point data in the target neural network into first target data of a first preset number of bits according to the target scaling factor, wherein the first target data is fixed point data; the processing unit 1003 processes the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data of a second preset number of bits, wherein the weight data is quantized fixed point data of the first preset number of bits; the shifting unit 1004 dynamically shifts the second target data to obtain third target data with a first preset number, wherein the third target data is quantized fixed point data, and therefore the technical problem that quantization precision is low due to the fact that quantization parameters are a group of numerical values generated by statistics when a neural network is quantized in the related technology is solved.

Optionally, in the quantization apparatus of a neural network provided in the third embodiment of the present invention, the first obtaining unit 1001 includes: the establishing module is used for establishing a first query table according to the value range of the fixed point data of the first preset digit; a first determining module for determining a first quantization parameter in a first look-up table based on the floating point data; and the obtaining module is used for obtaining a target scaling factor of the target neural network according to the first quantization parameter.

Optionally, in the apparatus for quantizing a neural network provided in the third embodiment of the present invention, the processing unit 1003 includes: the first processing module is used for performing dot product processing on the weight data of the target neural network and the first target data to obtain processed first target data, wherein the data bit number of the processed first target data is a second preset bit number; and the second processing module is used for acquiring the offset data of the target neural network and processing the processed first target data according to the offset data to obtain second target data with a second preset digit.

Optionally, in the quantization method of a neural network provided in the third embodiment of the present invention, the second processing module includes: the first processing submodule is used for carrying out displacement processing on the bias data according to the first quantization parameter to obtain processed bias data; and the second processing submodule is used for adding the processed first target data and the processed offset data to obtain second target data with a second preset digit.

Optionally, in the apparatus for quantifying a neural network provided in the third embodiment of the present invention, the apparatus further includes: the second acquisition unit is used for acquiring a second quantization parameter corresponding to the weight data before dynamically shifting the second target data to obtain third target data with a first preset digit; and the adding unit is used for adding the first quantization parameter and the second quantization parameter to obtain a target quantization parameter.

Optionally, in the apparatus for quantifying a neural network provided in the third embodiment of the present invention, the apparatus further includes: the establishing unit is used for establishing a second lookup table according to the value range of the fixed point data of the second preset digit before dynamically shifting the second target data to obtain third target data of the first preset digit; and the query unit is used for obtaining a third quantization parameter in the second query table according to the second target data, wherein the third quantization parameter represents the number of times of dynamic shift of the second target data.

Optionally, in the quantization apparatus of a neural network provided in the third embodiment of the present invention, the shifting unit 1004 includes: the second determining module is used for determining the starting point position when the second target data is dynamically shifted according to the target quantization parameter; and the shifting module is used for dynamically shifting the second target data according to the starting point position and the third quantization parameter to obtain third target data with a first preset digit.

It should be noted here that the first acquiring unit 1001, the converting unit 1002, the processing unit 1003 and the shifting unit 1004 described above correspond to steps S401 to S404 in embodiment 1, and the two modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in the first embodiment. It should be noted that the above modules may be operated in the computer terminal 10 provided in embodiment 1 as a part of the apparatus.

Example 4

The embodiment of the invention can provide a computer terminal which can be any computer terminal device in a computer terminal group. Optionally, in this embodiment, the computer terminal may also be replaced with a terminal device such as a mobile terminal.

Optionally, in this embodiment, the computer terminal may be located in at least one network device of a plurality of network devices of a computer network.

In this embodiment, the computer terminal may execute the program code of the following steps in the quantization method of the neural network: acquiring a target scaling factor of a target neural network, wherein the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by the value range of fixed point data of a first preset digit; converting floating point data in the target neural network into first target data with a first preset number of bits according to the target scaling factor, wherein the first target data is fixed point data; processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset digit, wherein the weight data is quantized fixed point data of the first preset digit; and dynamically shifting the second target data to obtain third target data with a first preset digit, wherein the third target data is quantized fixed point data.

Optionally, the computer terminal may further execute program codes of the following steps in the quantization method of the neural network: obtaining a target scaling factor for a target neural network includes: establishing a first lookup table according to the value range of the fixed point data of the first preset digit; determining a first quantization parameter in a first look-up table based on the floating point data; and acquiring a target scaling factor of the target neural network according to the first quantization parameter.

Optionally, the computer terminal may further execute program codes of the following steps in the quantization method of the neural network: processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset digit, wherein the second target data comprises: performing dot product processing on the weight data of the target neural network and the first target data to obtain processed first target data, wherein the data bit number of the processed first target data is a second preset bit number; and acquiring bias data of the target neural network, and processing the processed first target data according to the bias data to obtain second target data with a second preset digit.

Optionally, the computer terminal may further execute program codes of the following steps in the quantization method of the neural network: acquiring offset data of a target neural network, and processing the processed first target data according to the offset data to obtain second target data with a second preset digit, wherein the second target data comprises: carrying out displacement processing on the bias data according to the first quantization parameter to obtain processed bias data; and adding the processed first target data and the processed offset data to obtain second target data with a second preset digit.

Optionally, the computer terminal may further execute program codes of the following steps in the quantization method of the neural network: before dynamically shifting the second target data to obtain third target data with a first preset number of bits, the method further includes: acquiring a second quantization parameter corresponding to the weight data; and adding the first quantization parameter and the second quantization parameter to obtain a target quantization parameter.

Optionally, the computer terminal may further execute program codes of the following steps in the quantization method of the neural network: before dynamically shifting the second target data to obtain third target data with a first preset number of bits, the method further includes: establishing a second lookup table according to the value range of the fixed point data of the second preset digit; and obtaining a third quantization parameter in the second lookup table according to the second target data, wherein the third quantization parameter represents the number of times of dynamic shifting of the second target data.

Optionally, the computer terminal may further execute program codes of the following steps in the quantization method of the neural network: dynamically shifting the second target data to obtain third target data with a first preset number of bits includes: determining a starting point position when the second target data is dynamically shifted according to the target quantization parameter; and dynamically shifting the second target data according to the starting point position and the third quantization parameter to obtain third target data with a first preset digit.

Fig. 11 is a block diagram of a computer terminal according to an embodiment of the present invention. As shown in fig. 11, the computer terminal 10 may include: one or more (only one shown in fig. 11) processors, memory.

The memory may be configured to store software programs and modules, such as program instructions/modules corresponding to the neural network quantization method and apparatus in the embodiments of the present invention, and the processor executes various functional applications and data processing by operating the software programs and modules stored in the memory, so as to implement the above-described neural network quantization method. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memories may further include a memory located remotely from the processor, which may be connected to the terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: acquiring a target scaling factor of a target neural network, wherein the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by the value range of fixed point data of a first preset digit; converting floating point data in the target neural network into first target data with a first preset number of bits according to the target scaling factor, wherein the first target data is fixed point data; processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset digit, wherein the weight data is quantized fixed point data of the first preset digit; and dynamically shifting the second target data to obtain third target data with a first preset digit, wherein the third target data is quantized fixed point data.

Optionally, the processor may further execute the program code of the following steps: obtaining a target scaling factor for a target neural network includes: establishing a first lookup table according to the value range of the fixed point data of the first preset digit; determining a first quantization parameter in a first look-up table based on the floating point data; and acquiring a target scaling factor of the target neural network according to the first quantization parameter.

Optionally, the processor may further execute the program code of the following steps: processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset digit, wherein the second target data comprises: performing dot product processing on the weight data of the target neural network and the first target data to obtain processed first target data, wherein the data bit number of the processed first target data is a second preset bit number; and acquiring offset data of the target neural network, and processing the processed first target data according to the offset data to obtain second target data with a second preset digit.

Optionally, the processor may further execute the program code of the following steps: acquiring offset data of a target neural network, and processing the processed first target data according to the offset data to obtain second target data with a second preset digit, wherein the second target data comprises: shifting the bias data according to the first quantization parameter to obtain processed bias data; and adding the processed first target data and the processed offset data to obtain second target data with a second preset digit.

Optionally, the processor may further execute the program code of the following steps: before dynamically shifting the second target data to obtain third target data with a first preset number of bits, the method further includes: acquiring a second quantization parameter corresponding to the weight data; and adding the first quantization parameter and the second quantization parameter to obtain a target quantization parameter.

Optionally, the processor may further execute the program code of the following steps: before dynamically shifting the second target data to obtain third target data with a first preset number of bits, the method further includes: establishing a second lookup table according to the value range of the fixed point data of the second preset digit; and obtaining a third quantization parameter in the second lookup table according to the second target data, wherein the third quantization parameter represents the number of times of dynamic shifting of the second target data.

Optionally, the processor may further execute the program code of the following steps: dynamically shifting the second target data to obtain third target data with a first preset number of bits includes: determining a starting point position when the second target data is dynamically shifted according to the target quantization parameter; and dynamically shifting the second target data according to the starting point position and the third quantization parameter to obtain third target data with a first preset digit.

The embodiment of the invention provides a quantization scheme of a neural network. The method comprises the steps of converting floating point data into first target data through a target scaling factor, converting the first target data into second target data through weight data and bias data, and then dynamically shifting the second target data to obtain third target data, so that the technical problem that quantization precision is low due to the fact that quantization parameters are a group of numerical values generated by statistics when a neural network is quantized in the related technology is solved.

It can be understood by those skilled in the art that the structure shown in fig. 11 is only an illustration, and the computer terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 11 is a diagram illustrating a structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 11, or have a different configuration than shown in FIG. 11.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

Embodiments of the present invention also provide a computer-readable storage medium. Optionally, in this embodiment, the storage medium may be configured to store program codes executed by the quantization method of the neural network provided in the first embodiment.

Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring a target scaling factor of a target neural network, wherein the target scaling factor is obtained from a first lookup table, and the first lookup table is determined by the value range of fixed point data of a first preset digit; converting floating point data in the target neural network into first target data with a first preset number of bits according to the target scaling factor, wherein the first target data is fixed point data; processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset digit, wherein the weight data is quantized fixed point data of the first preset digit; and dynamically shifting the second target data to obtain third target data with a first preset digit, wherein the third target data is quantized fixed point data.

Optionally, the storage medium is further configured to store program code for performing the following steps: obtaining a target scaling factor for a target neural network includes: establishing a first lookup table according to the value range of the fixed point data of the first preset digit; determining a first quantization parameter in a first look-up table based on the floating point data; and acquiring a target scaling factor of the target neural network according to the first quantization parameter.

Optionally, the storage medium is further configured to store program code for performing the following steps: processing the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data with a second preset digit, wherein the second target data comprises: performing dot product processing on the weight data of the target neural network and the first target data to obtain processed first target data, wherein the data bit number of the processed first target data is a second preset bit number; and acquiring offset data of the target neural network, and processing the processed first target data according to the offset data to obtain second target data with a second preset digit.

Optionally, the storage medium is further configured to store program code for performing the following steps: acquiring offset data of a target neural network, and processing the processed first target data according to the offset data to obtain second target data with a second preset digit, wherein the second target data comprises: carrying out displacement processing on the bias data according to the first quantization parameter to obtain processed bias data; and adding the processed first target data and the processed offset data to obtain second target data with a second preset digit.

Optionally, the storage medium is further configured to store program code for performing the following steps: before dynamically shifting the second target data to obtain third target data with a first preset number of bits, the method further includes: acquiring a second quantization parameter corresponding to the weight data; and adding the first quantization parameter and the second quantization parameter to obtain a target quantization parameter.

Optionally, the storage medium is further configured to store program code for performing the following steps: before dynamically shifting the second target data to obtain third target data with a first preset number of bits, the method further includes: establishing a second lookup table according to the value range of the fixed point data of the second preset digit; and obtaining a third quantization parameter in the second lookup table according to the second target data, wherein the third quantization parameter represents the number of times of dynamic shifting of the second target data.

Optionally, the storage medium is further configured to store program code for performing the following steps: dynamically shifting the second target data to obtain third target data with a first preset number of bits includes: determining a starting point position when the second target data is dynamically shifted according to the target quantization parameter; and dynamically shifting the second target data according to the starting point position and the third quantization parameter to obtain third target data with a first preset digit.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. a quantization method of neural network, is characterized in that, comprises:

acquiring the target scaling factor of the target neural network, wherein the target scaling factor is obtained from a first look-up table, and the first look-up table is determined by the value range of the fixed-point data of the first preset number of digits;

converting the floating-point data in the target neural network into first target data of a first preset number of digits according to the target scaling factor, wherein the first target data is fixed-point data;

The first target data is processed according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data of a second preset number of digits, wherein the weight data is quantized fixed-point data of the first preset number of bits;

Dynamically shifting the second target data to obtain third target data of the first preset number of bits, wherein the third target data is quantized fixed-point data.

2. The method according to claim 1, wherein obtaining the target scaling factor of the target neural network comprises:

establishing the first look-up table according to the value range of the fixed-point data of the first preset number of digits;

determining a first quantization parameter in the first look-up table according to the floating-point data;

According to the first quantization parameter, the target scaling factor of the target neural network is obtained.

3 . The method according to claim 2 , wherein the first target data is processed according to the weight data of the target neural network and the bias data of the target neural network to obtain the second predetermined target data. 4 . The second target data of the set number of digits includes:

Performing dot product processing on the weight data of the target neural network and the first target data to obtain processed first target data, wherein the number of data digits of the processed first target data is the second preset number of digits;

Acquire bias data of the target neural network, and process the processed first target data according to the bias data to obtain second target data of the second preset number of digits.

4 . The method according to claim 3 , wherein the bias data of the target neural network is obtained, and the processed first target data is processed according to the bias data to obtain the first target data. 5 . The second target data of two preset digits includes:

performing shift processing on the offset data according to the first quantization parameter to obtain processed offset data;

The processed first target data and the processed offset data are added to obtain the second target data of the second preset number of bits.

5. The method according to claim 4, wherein before dynamically shifting the second target data to obtain the third target data of the first preset number of digits, the method further comprises:

obtaining the second quantization parameter corresponding to the weight data;

The first quantization parameter and the second quantization parameter are added to obtain a target quantization parameter.

6. The method according to claim 5, wherein before dynamically shifting the second target data to obtain the third target data of the first preset number of digits, the method further comprises:

establishing a second look-up table according to the value range of the fixed-point data of the second preset number of digits;

According to the second target data, a third quantization parameter is obtained in the second look-up table, wherein the third quantization parameter represents the number of times the second target data is dynamically shifted.

7. The method according to claim 6, wherein the second target data is dynamically shifted to obtain the third target data of the first preset number of digits, comprising:

According to the target quantization parameter, determine the starting point position when dynamically shifting the second target data;

The second target data is dynamically shifted according to the starting point position and the third quantization parameter to obtain the third target data of the first preset number of bits.

8. A quantization method of neural network, characterized in that, comprising:

obtaining the target scaling factor of the target neural network sent by the client, wherein the target scaling factor is obtained from a first look-up table, and the first look-up table is determined by the value range of the fixed-point data of the first preset number of digits;

In the server, according to the target scaling factor, the floating-point data in the target neural network is converted into first target data with a first preset number of digits, wherein the first target data is a fixed-point number; according to the The weight data of the target neural network and the bias data of the target neural network process the first target data to obtain second target data of a second preset number of digits, wherein the weight data is the quantized first target data. Fixed-point data of a preset number of digits; dynamically shifting the second target data to obtain third target data of the first preset number of digits, wherein the third target data is quantized fixed-point data ;

Return the third target data to the client.

9. A quantization device of a neural network, characterized in that, comprising:

a first obtaining unit, configured to obtain a target scaling factor of the target neural network, wherein the target scaling factor is obtained from a first look-up table, and the first look-up table is obtained from the value of the fixed-point data of the first preset number of digits Scope is determined;

a conversion unit, configured to convert the floating-point data in the target neural network into first target data of a first preset number of digits according to the target scaling factor, wherein the first target data is fixed-point data;

a processing unit, configured to process the first target data according to the weight data of the target neural network and the bias data of the target neural network to obtain second target data of a second preset number of digits, wherein the The weight data is the quantized fixed-point data of the first preset number of digits;

A shifting unit, configured to dynamically shift the second target data to obtain third target data of the first preset number of bits, wherein the third target data is quantized fixed-point data.

10. A computer-readable storage medium, characterized in that the storage medium stores a program, wherein, when the program is run, a device where the storage medium is located is controlled to execute the neural network according to any one of claims 1 to 7 Quantization methods for networks.

11 . A processor, characterized in that the processor is configured to run a program, wherein when the program runs, the method for quantizing a neural network according to any one of claims 1 to 7 is executed.