如何在Java中将扫描的文档转换为文本


无论您是要最小化存储空间,在线显示文档还是以电子方式编辑文档,光学字符识别(OCR)都是一种数字化打印文本的好方法。这对于企业来说尤其有用,因为它是各种文档(例如发票,银行对帐单,邮件和电子收据)的数据输入形式。

OCR技术的早期版本需要使用每个字符的图像进行训练,并且最早的模型实际上是在1914年创建的,用于将扫描的文本转换为视障人士的电报或音频代码。您可能已经猜到了,OCR的当前版本自1900年代以来已经走了很长一段路,并且能够在多种文件格式上以数字方式实现大多数字体的高度识别精度。

本教程特别关注OCR API,它将文档的扫描图像转换为文本。重要的是要阐明此特定的API仅打算在扫描的文档上运行;如果您想利用OCR技术将照片转换为文本,请务必改用我们的照片转换为文本功能,因为该功能旨在在转换之前使图像倾斜。

要开始我们的过程,我们首先需要通过在pom.xml中添加对存储库的引用来将SDK软件包与Maven一起安装:

<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>

接下来,我们将添加对依赖项的引用:

<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>

安装完成后,我们将准备将导入添加到文件顶部,并使用以下代码调用将图像转换为文本的功能:

// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.ImageOcrApi;
ApiClient defaultClient = Configuration.getDefaultApiClient();
// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");

ImageOcrApi apiInstance = new ImageOcrApi();
File imageFile = new File("/path/to/inputfile"); // File | Image file to perform OCR on.  Common file formats such as PNG, JPEG are supported.
String recognitionMode = "recognitionMode_example"; // String | Optional; possible values are 'Basic' which provides basic recognition and is not resillient to page rotation, skew or low quality images uses 1-2 API calls; 'Normal' which provides highly fault tolerant OCR recognition uses 26-30 API calls; and 'Advanced' which provides the highest quality and most fault-tolerant recognition uses 28-30 API calls.  Default recognition mode is 'Advanced'
String language = "language_example"; // String | Optional, language of the input document, default is English (ENG).  Possible values are ENG (English), ARA (Arabic), ZHO (Chinese - Simplified), ZHO-HANT (Chinese - Traditional), ASM (Assamese), AFR (Afrikaans), AMH (Amharic), AZE (Azerbaijani), AZE-CYRL (Azerbaijani - Cyrillic), BEL (Belarusian), BEN (Bengali), BOD (Tibetan), BOS (Bosnian), BUL (Bulgarian), CAT (Catalan; Valencian), CEB (Cebuano), CES (Czech), CHR (Cherokee), CYM (Welsh), DAN (Danish), DEU (German), DZO (Dzongkha), ELL (Greek), ENM (Archaic/Middle English), EPO (Esperanto), EST (Estonian), EUS (Basque), FAS (Persian), FIN (Finnish), FRA (French), FRK (Frankish), FRM (Middle-French), GLE (Irish), GLG (Galician), GRC (Ancient Greek), HAT (Hatian), HEB (Hebrew), HIN (Hindi), HRV (Croatian), HUN (Hungarian), IKU (Inuktitut), IND (Indonesian), ISL (Icelandic), ITA (Italian), ITA-OLD (Old - Italian), JAV (Javanese), JPN (Japanese), KAN (Kannada), KAT (Georgian), KAT-OLD (Old-Georgian), KAZ (Kazakh), KHM (Central Khmer), KIR (Kirghiz), KOR (Korean), KUR (Kurdish), LAO (Lao), LAT (Latin), LAV (Latvian), LIT (Lithuanian), MAL (Malayalam), MAR (Marathi), MKD (Macedonian), MLT (Maltese), MSA (Malay), MYA (Burmese), NEP (Nepali), NLD (Dutch), NOR (Norwegian), ORI (Oriya), PAN (Panjabi), POL (Polish), POR (Portuguese), PUS (Pushto), RON (Romanian), RUS (Russian), SAN (Sanskrit), SIN (Sinhala), SLK (Slovak), SLV (Slovenian), SPA (Spanish), SPA-OLD (Old Spanish), SQI (Albanian), SRP (Serbian), SRP-LAT (Latin Serbian), SWA (Swahili), SWE (Swedish), SYR (Syriac), TAM (Tamil), TEL (Telugu), TGK (Tajik), TGL (Tagalog), THA (Thai), TIR (Tigrinya), TUR (Turkish), UIG (Uighur), UKR (Ukrainian), URD (Urdu), UZB (Uzbek), UZB-CYR (Cyrillic Uzbek), VIE (Vietnamese), YID (Yiddish)

String preprocessing = "preprocessing_example"; // String | Optional, preprocessing mode, default is 'Auto'.  Possible values are None (no preprocessing of the image), and Auto (automatic image enhancement of the image before OCR is applied; this is recommended).
21
try {

    ImageToTextResponse result = apiInstance.imageOcrPost(imageFile, recognitionMode, language, preprocessing);
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling ImageOcrApi#imageOcrPost");
    e.printStackTrace();
}

这将快速有效地返回您上传的图像的文本版本。为了优化并确保API的准确性,您需要验证是否包含以下参数:

API密钥-可以通过Cloudmersive网站获取;如果您注册一个免费帐户,您将收到一个个人API密钥,并且每月可以访问800次电话 用于执行OCR的图像文件;支持常见的文件格式,例如PNG和JPEG 识别模式-这是可选的;可能的值为:

  • 基本:提供基本识别,对页面旋转,歪斜或低质量图像没有弹性;使用1-2个API调用
  • 正常:提供高度容错的OCR识别;使用26-30个API调用
  • 高级:使用28-30个API调用提供最高质量和最容错的识别。默认识别模式为“高级”

  • 语言-这是可选的,但默认语言是英语(ENG)。可能的值是ENG(英语),ARA(阿拉伯语),ZHO(简体中文),ZHO-HANT(繁体中文),ASM(阿萨姆语),AFR(南非语),AMH(阿姆哈拉语),AZE(阿塞拜疆语),AZE -CYRL(阿塞拜疆-西里尔文),BEL(白俄罗斯),BEN(孟加拉),BOD(西藏),BOS(波斯尼亚),BUL(保加利亚),CAT(加泰罗尼亚语;巴伦西亚语),CEB(Cebuano),CES(捷克), CHR(切罗基语),CYM(威尔士语),DAN(丹麦语),DEU(德语),DZO(宗喀语),ELL(希腊语),ENM(古希腊/中古英语),EPO(世界语),EST(爱沙尼亚语),EUS(巴斯克语),FAS(波斯语),FIN(芬兰语),FRA(法语),FRK(法语),FRM(中法语),GLE(爱尔兰语),GLG(加利西亚语),GRC(古希腊语),HAT(Hatian) ,HEB(希伯来语),HIN(印地语),HRV(克罗地亚语),HUN(匈牙利),IKU(Inuktitut),IND(印尼语),ISL(冰岛语),ITA(意大利语), 预处理-这是可选的;默认预处理模式为“自动”。可能的值为

  • None:不对图像进行预处理

  • Auto:在应用OCR之前增强图像(推荐)


原文链接:http://codingdict.com