.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "how_to/deploy_models/deploy_prequantized_tflite.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_prequantized_tflite.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_how_to_deploy_models_deploy_prequantized_tflite.py:


Deploy a Framework-prequantized Model with TVM - Part 3 (TFLite)
================================================================
**Author**: `Siju Samuel <https://github.com/siju-samuel>`_

Welcome to part 3 of the Deploy Framework-Prequantized Model with TVM tutorial.
In this part, we will start with a Quantized TFLite graph and then compile and execute it via TVM.


For more details on quantizing the model using TFLite, readers are encouraged to
go through `Converting Quantized Models
<https://www.tensorflow.org/lite/convert/quantization>`_.

The TFLite models can be downloaded from this `link
<https://www.tensorflow.org/lite/guide/hosted_models>`_.

To get started, Tensorflow and TFLite package needs to be installed as prerequisite.

.. code-block:: bash

    # install tensorflow and tflite
    pip install tensorflow==2.1.0
    pip install tflite==2.1.0

Now please check if TFLite package is installed successfully, ``python -c "import tflite"``

.. GENERATED FROM PYTHON SOURCE LINES 46-48

Necessary imports
-----------------

.. GENERATED FROM PYTHON SOURCE LINES 48-57

.. code-block:: default

    import os

    import numpy as np
    import tflite

    import tvm
    from tvm import relay


.. GENERATED FROM PYTHON SOURCE LINES 58-60

Download pretrained Quantized TFLite model
------------------------------------------

.. GENERATED FROM PYTHON SOURCE LINES 60-76

.. code-block:: default


    # Download mobilenet V2 TFLite model provided by Google
    from tvm.contrib.download import download_testdata

    model_url = (
        "https://storage.googleapis.com/download.tensorflow.org/models/"
        "tflite_11_05_08/mobilenet_v2_1.0_224_quant.tgz"
    )

    # Download model tar file and extract it to get mobilenet_v2_1.0_224.tflite
    model_path = download_testdata(
        model_url, "mobilenet_v2_1.0_224_quant.tgz", module=["tf", "official"]
    )
    model_dir = os.path.dirname(model_path)


.. GENERATED FROM PYTHON SOURCE LINES 77-79

Utils for downloading and extracting zip files
----------------------------------------------

.. GENERATED FROM PYTHON SOURCE LINES 79-94

.. code-block:: default

    def extract(path):
        import tarfile

        if path.endswith("tgz") or path.endswith("gz"):
            dir_path = os.path.dirname(path)
            tar = tarfile.open(path)
            tar.extractall(path=dir_path)
            tar.close()
        else:
            raise RuntimeError("Could not decompress the file: " + path)


    extract(model_path)


.. GENERATED FROM PYTHON SOURCE LINES 95-97

Load a test image
-----------------

.. GENERATED FROM PYTHON SOURCE LINES 99-101

Get a real image for e2e testing
--------------------------------

.. GENERATED FROM PYTHON SOURCE LINES 101-116

.. code-block:: default

    def get_real_image(im_height, im_width):
        from PIL import Image

        repo_base = "https://github.com/dmlc/web-data/raw/main/tensorflow/models/InceptionV1/"
        img_name = "elephant-299.jpg"
        image_url = os.path.join(repo_base, img_name)
        img_path = download_testdata(image_url, img_name, module="data")
        image = Image.open(img_path).resize((im_height, im_width))
        x = np.array(image).astype("uint8")
        data = np.reshape(x, (1, im_height, im_width, 3))
        return data


    data = get_real_image(224, 224)


.. GENERATED FROM PYTHON SOURCE LINES 117-119

Load a tflite model
-------------------

.. GENERATED FROM PYTHON SOURCE LINES 121-122

Now we can open mobilenet_v2_1.0_224.tflite

.. GENERATED FROM PYTHON SOURCE LINES 122-135

.. code-block:: default

    tflite_model_file = os.path.join(model_dir, "mobilenet_v2_1.0_224_quant.tflite")
    tflite_model_buf = open(tflite_model_file, "rb").read()

    # Get TFLite model from buffer
    try:
        import tflite

        tflite_model = tflite.Model.GetRootAsModel(tflite_model_buf, 0)
    except AttributeError:
        import tflite.Model

        tflite_model = tflite.Model.Model.GetRootAsModel(tflite_model_buf, 0)


.. GENERATED FROM PYTHON SOURCE LINES 136-137

Lets run TFLite pre-quantized model inference and get the TFLite prediction.

.. GENERATED FROM PYTHON SOURCE LINES 137-168

.. code-block:: default

    def run_tflite_model(tflite_model_buf, input_data):
        """Generic function to execute TFLite"""
        try:
            from tensorflow import lite as interpreter_wrapper
        except ImportError:
            from tensorflow.contrib import lite as interpreter_wrapper

        input_data = input_data if isinstance(input_data, list) else [input_data]

        interpreter = interpreter_wrapper.Interpreter(model_content=tflite_model_buf)
        interpreter.allocate_tensors()

        input_details = interpreter.get_input_details()
        output_details = interpreter.get_output_details()

        # set input
        assert len(input_data) == len(input_details)
        for i in range(len(input_details)):
            interpreter.set_tensor(input_details[i]["index"], input_data[i])

        # Run
        interpreter.invoke()

        # get output
        tflite_output = list()
        for i in range(len(output_details)):
            tflite_output.append(interpreter.get_tensor(output_details[i]["index"]))

        return tflite_output


.. GENERATED FROM PYTHON SOURCE LINES 169-170

Lets run TVM compiled pre-quantized model inference and get the TVM prediction.

.. GENERATED FROM PYTHON SOURCE LINES 170-181

.. code-block:: default

    def run_tvm(lib):
        from tvm.contrib import graph_executor

        rt_mod = graph_executor.GraphModule(lib["default"](tvm.cpu(0)))
        rt_mod.set_input("input", data)
        rt_mod.run()
        tvm_res = rt_mod.get_output(0).numpy()
        tvm_pred = np.squeeze(tvm_res).argsort()[-5:][::-1]
        return tvm_pred, rt_mod


.. GENERATED FROM PYTHON SOURCE LINES 182-184

TFLite inference
----------------

.. GENERATED FROM PYTHON SOURCE LINES 186-187

Run TFLite inference on the quantized model.

.. GENERATED FROM PYTHON SOURCE LINES 187-190

.. code-block:: default

    tflite_res = run_tflite_model(tflite_model_buf, data)
    tflite_pred = np.squeeze(tflite_res).argsort()[-5:][::-1]


.. GENERATED FROM PYTHON SOURCE LINES 191-193

TVM compilation and inference
-----------------------------

.. GENERATED FROM PYTHON SOURCE LINES 195-199

We use the TFLite-Relay parser to convert the TFLite pre-quantized graph into Relay IR. Note that
frontend parser call for a pre-quantized model is exactly same as frontend parser call for a FP32
model. We encourage you to remove the comment from print(mod) and inspect the Relay module. You
will see many QNN operators, like, Requantize, Quantize and QNN Conv2D.

.. GENERATED FROM PYTHON SOURCE LINES 199-205

.. code-block:: default

    dtype_dict = {"input": data.dtype.name}
    shape_dict = {"input": data.shape}

    mod, params = relay.frontend.from_tflite(tflite_model, shape_dict=shape_dict, dtype_dict=dtype_dict)
    # print(mod)


.. GENERATED FROM PYTHON SOURCE LINES 206-208

Lets now the compile the Relay module. We use the "llvm" target here. Please replace it with the
target platform that you are interested in.

.. GENERATED FROM PYTHON SOURCE LINES 208-212

.. code-block:: default

    target = "llvm"
    with tvm.transform.PassContext(opt_level=3):
        lib = relay.build_module.build(mod, target=target, params=params)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
      "target_host parameter is going to be deprecated. "


.. GENERATED FROM PYTHON SOURCE LINES 213-214

Finally, lets call inference on the TVM compiled module.

.. GENERATED FROM PYTHON SOURCE LINES 214-216

.. code-block:: default

    tvm_pred, rt_mod = run_tvm(lib)


.. GENERATED FROM PYTHON SOURCE LINES 217-219

Accuracy comparison
-------------------

.. GENERATED FROM PYTHON SOURCE LINES 221-224

Print the top-5 labels for MXNet and TVM inference.
Checking the labels because the requantize implementation is different between
TFLite and Relay. This cause final output numbers to mismatch. So, testing accuracy via labels.

.. GENERATED FROM PYTHON SOURCE LINES 224-229

.. code-block:: default


    print("TVM Top-5 labels:", tvm_pred)
    print("TFLite Top-5 labels:", tflite_pred)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    TVM Top-5 labels: [387 102 386 341 349]
    TFLite Top-5 labels: [387 102 386 341 349]


.. GENERATED FROM PYTHON SOURCE LINES 230-233

Measure performance
-------------------
Here we give an example of how to measure performance of TVM compiled models.

.. GENERATED FROM PYTHON SOURCE LINES 233-237

.. code-block:: default

    n_repeat = 100  # should be bigger to make the measurement more accurate
    dev = tvm.cpu(0)
    print(rt_mod.benchmark(dev, number=1, repeat=n_repeat))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Execution time summary:
     mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
      119.0358     119.0219     120.1252     118.4347      0.2873   
               

.. GENERATED FROM PYTHON SOURCE LINES 238-263

.. note::

  Unless the hardware has special support for fast 8 bit instructions, quantized models are
  not expected to be any faster than FP32 models. Without fast 8 bit instructions, TVM does
  quantized convolution in 16 bit, even if the model itself is 8 bit.

  For x86, the best performance can be achieved on CPUs with AVX512 instructions set.
  In this case, TVM utilizes the fastest available 8 bit instructions for the given target.
  This includes support for the VNNI 8 bit dot product instruction (CascadeLake or newer).
  For EC2 C5.12x large instance, TVM latency for this tutorial is ~2 ms.

  Intel conv2d NCHWc schedule on ARM gives better end-to-end latency compared to ARM NCHW
  conv2d spatial pack schedule for many TFLite networks. ARM winograd performance is higher but
  it has a high memory footprint.

  Moreover, the following general tips for CPU performance equally applies:

   * Set the environment variable TVM_NUM_THREADS to the number of physical cores
   * Choose the best target for your hardware, such as "llvm -mcpu=skylake-avx512" or
     "llvm -mcpu=cascadelake" (more CPUs with AVX512 would come in the future)
   * Perform autotuning - :ref:`Auto-tuning a convolution network for x86 CPU
     <tune_relay_x86>`.
   * To get best inference performance on ARM CPU, change target argument
     according to your device and follow :ref:`Auto-tuning a convolution
     network for ARM CPU <tune_relay_arm>`.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 1 minutes  51.389 seconds)


.. _sphx_glr_download_how_to_deploy_models_deploy_prequantized_tflite.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: deploy_prequantized_tflite.py <deploy_prequantized_tflite.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: deploy_prequantized_tflite.ipynb <deploy_prequantized_tflite.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_