tips-OCR
Open source OCR
google-tesseract
https://github.com/tesseract-ocr/tesseract
https://github.com/tesseract-ocr
NHocr
https://ja.osdn.net/projects/nhocr/
INSTALL
zypper install autoconf aclocal libtool autoheader automake zypper install gcc-c++ libjpeg8-devel libpng16-devel libtiff-devel autoconf zypper install libicu-devel pango-devel pangomm-devel zypper install leptonica-devel leptonica-tools # zypper install leptonica ## zypper install tesseract-ocr tesseract-ocr-devel zypper install libtesseract3 tesseract-ocr tesseract-ocr-traineddata-english, tesseract-ocr-traineddata-japanese pip3.6 install pyocr :: PYTHON で呼び出す為;;
# install $ git clone git://github.com/tesseract-ocr/tesseract.git https://github.com/tesseract-ocr/tesseract.git $ git clone git://github.com/tesseract-ocr/langdata.git zypper install automake autoconf libtool # cd tesseract $ git checkout 4.1 # ver 4.1 を ( 最新は 5.0_ $ git submodule update --init --recursive
# autoconf ( Error がでるので再実行 ) $ autoconf # autoreconf --install # ./configure --prefix=$HOME/opt # make # make check # make install # make training # make training-install
# # wget tesseract-4.0.0.tar.gz # # tar xvfz tesseract-4.0.0.tar.gz # # cd tesseract-4.0.0 # # ./autogen.sh # # ./configure --prefix=$HOME/opt/tess # # make # # make install # # make check CPPFLAGS="-I$HOME/opt/includ -L$HOME/opt/lib64" pip3.7 install tesserocr
# 学習済み言語データをインストール # 日本語の場合、jpnとjpn_vertの2つが必要 cd $HOME/opt/share/tessdata wget https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata wget https://github.com/tesseract-ocr/tessdata_best/raw/master/jpn.traineddata wget https://github.com/tesseract-ocr/tessdata_best/raw/master/jpn_vert.traineddata ## sudo mv *.traineddata /usr/local/share/tessdata/
Tune
- https://laplace-daemon.com/training-tesseract/
- https://qiita.com/aki_abekawa/items/418e069038fbdb77c59e
ToDO
- https://qiita.com/atuyosi/items/c0933b5edf605c4a7c19
- https://qiita.com/bohemian916/items/67f22ee7aeac103dd205
http://hadashi-gensan.hatenablog.com/entry/2013/10/14/170129http://hadashi-gensan.hatenablog.com/entry/2014/01/15/135316
- http://a244.hateblo.jp/entry/2015/07/28/060803
- https://ebi-works.com/ocr-python/
- https://www.kkaneko.jp/tools/ubuntu/tesseract_buildout.html
前処理
二値化
解像度を上げる
文字列の傾き調整
Split Bregman (ノイズ除去 )
- https://lp-tech.net/articles/CY2Kn
- https://lp-tech.net/articles/tkPFr
- https://qiita.com/MuAuan/items/3962e24ece1860759429
メモ
Google Cloud Vision API
pip3.7 install --upgrade google-cloud-vision pip3.7 install --upgrade pymupdf
サービスアカウントの作成 ( 認証キー )
- https://cloud.google.com/docs/authentication/getting-started?hl=ja
- https://cloud.google.com/docs/authentication/production