Dedoc installation

There are two ways to install and run dedoc as a web application or a library that are described below.

Install and run dedoc using docker

You should have git and docker installed for running dedoc by this method. This method is more flexible because it doesn’t depend on the operating system and other user’s limitations, still, the docker application should be installed and configured properly.

  1. Clone the repository

git clone https://github.com/ispras/dedoc 
  1. Go to the dedoc directory

cd dedoc 
  1. Build the image and run the application

docker compose up --build 

If you need to change some application settings, you may update config.py according to your needs and re-build the image.

If you don’t need to change the application configuration, you may use the built docker image as well.

  1. Pull the image

docker pull dedocproject/dedoc 
  1. Run the container

docker run -p 1231:1231 --rm dedocproject/dedoc python3 /dedoc_root/dedoc/main.py 

Go to dockerhub to get more information about available dedoc images.

Install dedoc using pip

If you don’t want to use docker for running the application, it’s possible to run dedoc locally. However, it isn’t suitable for any operating system (Ubuntu 20+ is recommended) and there may be not enough machine’s resources for its work. You should have python (python3.9, python3.10 are recommended) and pip installed.

1. Install necessary packages:

sudo apt-get install -y libreoffice djvulibre-bin unzip unrar 

libreoffice and djvulibre-bin packages are used by converters (doc, odt, rtf to docx; xls, ods to xlsx; ppt, odp to pptx; djvu to pdf). If you don’t need converters, you can skip this step. unzip and unrar packages are used in the process of extracting archives.

2. Install Tesseract OCR 5 framework.

You can try any tutorial for this purpose or look here to get the example of Tesseract installing for dedoc container or use next commands for building Tesseract OCR 5 from sources:

2.1. Install compilers and libraries required by the Tesseract OCR:

sudo apt-get update sudo apt-get install -y automake binutils-dev build-essential ca-certificates clang g++ g++-multilib gcc-multilib libcairo2 libffi-dev \ libgdk-pixbuf2.0-0 libglib2.0-dev libjpeg-dev libleptonica-dev libpango-1.0-0 libpango1.0-dev libpangocairo-1.0-0 libpng-dev libsm6 \ libtesseract-dev libtool libxext6 make pkg-config poppler-utils pstotext shared-mime-info software-properties-common swig zlib1g-dev 

2.2. Build Tesseract from sources:

sudo add-apt-repository -y ppa:alex-p/tesseract-ocr-devel sudo apt-get update --allow-releaseinfo-change sudo apt-get install -y tesseract-ocr tesseract-ocr-rus git clone --depth 1 --branch 5.0.0-beta-20210916 https://github.com/tesseract-ocr/tesseract/ cd tesseract && ./autogen.sh && sudo ./configure && sudo make && sudo make install && sudo ldconfig && cd .. export TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata/ 

3. Install the dedoc library via pip.

You need torch~=1.11.0 and torchvision~=0.12.0 installed. If you already have torch and torchvision in your environment:

pip install dedoc 

Or you can install dedoc with torch and torchvision included:

pip install "dedoc[torch]" 

Install and run dedoc from sources

If you want to run dedoc as a service from sources. it’s possible to run dedoc locally. However, it isn’t suitable for any operating system (Ubuntu 20+ is recommended) and there may be not enough machine’s resources for its work. You should have python (python3.8, python3.9 are recommended) and pip installed.

  1. Install necessary packages: according to instructions 1. Install necessary packages:

  2. Build Tesseract from sources according to instructions 2. Install Tesseract OCR 5 framework.

  3. We recommend to install python’s virtual environment (for example, via virtualenvwrapper)

Below are the instructions for installing the package virtualenvwrapper:

sudo pip3 install virtualenv virtualenvwrapper mkdir ~/.virtualenvs export WORKON_HOME=~/.virtualenvs echo "export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python3.8" >> ~/.bashrc echo ". /usr/local/bin/virtualenvwrapper.sh" >> ~/.bashrc source ~/.bashrc mkvirtualenv dedoc_env 
  1. Install python’s requirements and launch dedoc service on default port 1231:

# clone dedoc project git clone https://github.com/ispras/dedoc.git cd dedoc # check on your's python environment workon dedoc_env export PYTHONPATH=$PYTHONPATH:$(pwd) pip install -r requirements.txt pip install torch==1.11.0 torchvision==0.12.0 -f https://download.pytorch.org/whl/torch_stable.html python dedoc/main.py -c ./dedoc/config.py 

Install trusted torch (verified version)

You can install a trusted library torch (as a verified version of the library, verified by tools developed by the Ivannikov Institute for System Programming of the Russian Academy of Sciences).

First you need to install two required packages.:

sudo apt-get install -y mpich intel-mkl 

Second you need to install torch and torchvision from built wheels:

For python3.8:
pip install https://github.com/ispras/dedockerfiles/raw/master/wheels/torch-1.11.0a0+git137096a-cp38-cp38-linux_x86_64.whl pip install https://github.com/ispras/dedockerfiles/raw/master/wheels/torchvision-0.12.0a0%2B9b5a3fe-cp38-cp38-linux_x86_64.whl 
For python3.9:
pip install https://github.com/ispras/dedockerfiles/raw/master/wheels/torch-1.11.0a0+git137096a-cp39-cp39-linux_x86_64.whl pip install https://github.com/ispras/dedockerfiles/raw/master/wheels/torchvision-0.12.0a0%2B9b5a3fe-cp39-cp39-linux_x86_64.whl