Skip to main content

Advertisement

Springer Nature Link
Account
Menu
Find a journal Publish with us Track your research
Search
Saved research
Cart
  1. Home
  2. Research and Advanced Technology for Digital Libraries
  3. Conference paper

Probabilistic retrieval of OCR degraded text using N-grams

  • Information Retreival II
  • Conference paper
  • First Online: 01 January 2005
  • pp 345–359
  • Cite this conference paper
Research and Advanced Technology for Digital Libraries (ECDL 1997)
Probabilistic retrieval of OCR degraded text using N-grams
  • S. M. Harding1,
  • W. B. Croft1 &
  • C. Weir2 

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1324))

Included in the following conference series:

  • International Conference on Theory and Practice of Digital Libraries
  • 212 Accesses

  • 42 Citations

Abstract

The retrieval of OCR degraded text using n-gram formulations within a probabilistic retrieval system is examined in this paper. Direct retrieval of documents using n-gram databases of 2 and 3-grams or 2, 3, 4 and 5-grams resulted in improved retrieval performance over standard (word based) queries on the same data when a level of 10 percent degradation or worse was achieved. A second method of using n-grams to identify appropriate matching and near matching terms for query expansion which also performed better than using standard queries is also described. This method was less effective than direct n-gram query formulations but can likely be improved with alternative query component weighting schemes and measures of term similarity. Finally, a web based retrieval application using n-gram retrieval of OCR text and display, with query term highlighting, of the source document image is described.

Download to read the full chapter text

Chapter PDF

Similar content being viewed by others

A Review of Techniques to Determine the Optimal Word Score in Text Classification

Chapter © 2018

Information Retrieval Using n-grams

Chapter © 2022

A contemporary combined approach for query expansion

Article 03 July 2020

Explore related subjects

Discover the latest articles, books and news in related subjects, suggested using machine learning.
  • Categorization
  • ESCRT
  • Gas chromatography
  • Natural Language Processing (NLP)
  • Reverse engineering
  • Information Storage and Retrieval

References

  1. Callan, J.P., Croft, W.B. and Harding, S.M.: The INQUERY Retrieval System. In Proceedings of the 3rd International Conference on Database and Expert Systems Applications (1992) 78–83.

    Google Scholar 

  2. Cavnar, W.: Using an N-Gram-Based Document Representation with a Vector Processing Retrieval Model. In Overview of the Third Text Retrieval Conference (TREC-3), D.K. Harman, Editor (1994) 269–278.

    Google Scholar 

  3. Cohen, D.J.: Highlights: Language and Domain-Independent Automatic Indexing Terms for Abstracting. J. Amer. Soc. Info. Sci. 46 (1995) 162–174.

    Article  Google Scholar 

  4. Croft, W.B., Harding, S.M., Taghva, K. and Borsack, J.: An evaluation of Information Retrieval Accuracy with Simulated OCR Output. Symposium of Document Analysis and Information Retrieval (1994).

    Google Scholar 

  5. Pierce, C. and Nicholas, C.: TELLTALE: Experiments in a Dynamic Hypertext Environment for Degraded and Multilingual Data. J. Amer. Sec. Info. Sci 47 (1996) 263–275.

    Article  Google Scholar 

  6. Rice, S., Kanai, J. and Nartker, T.: An Evaluation of Information Retrieval Accuracy. In UNLV Information Science Research Institute Annual Report (1993) 9–20.

    Google Scholar 

  7. Taghva, K., Borsack, J., Condit, A., Erva, S.: The effects of noisy data on text retrieval. In UNLV Information Science Research Institute Annual Report (1993) 71–80.

    Google Scholar 

  8. Taylor, S.L., Lipshutz, M., Dahl. D.A. and Weir, C.: An Intelligent Document Understanding System. In Second International Conference on Document Analysis and Recognition (1993) 107–220.

    Google Scholar 

  9. Turtle, H. and Croft, W.B.: Evaluation of an Inference Network-Based Retrieval Model. ACM Trans. on Info. Sys. 9 (1991) 187–222.

    Article  Google Scholar 

  10. Ukkonen, E.: Approximate String-Matching with Q-grams and Maximal Matches. Theor. Comp. Sci. 92 (1992) 191–211.

    Article  Google Scholar 

  11. Weir, C., Taylor, S.L., Harding, S.M. and Croft, W.B.: The Skeleton Document Image Retrieval System. In Symposium on Document Image Understanding Technologies (1997).

    Google Scholar 

  12. Zamora, A.: Automatic Detection and Correction of Spelling Errors in a Large Data Base. J. Amer. Soc. Info. Sci. 31 (1980) 51–57.

    Google Scholar 

  13. Zobel, J. and Dart, P.: Finding Approximate Matches in Large Lexicons. Soft. Pract. and Exper. 25 (1995) 331–345.

    Google Scholar 

  14. Zobel, J. and Dart, P.: Phonetic String Matching: Lessons from Information Retrieval. In Proceedings 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1996) 166–173.

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. CIIR, University of Massachusetts, 01003, Amherst, MA, USA

    S. M. Harding & W. B. Croft

  2. Lockheed Martin C2 Systems, 19355, Frazer, PA, USA

    C. Weir

Authors
  1. S. M. Harding
    View author publications

    Search author on:PubMed Google Scholar

  2. W. B. Croft
    View author publications

    Search author on:PubMed Google Scholar

  3. C. Weir
    View author publications

    Search author on:PubMed Google Scholar

Editor information

Carol Peters Costantino Thanos

Rights and permissions

Reprints and permissions

Copyright information

© 1997 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Harding, S.M., Croft, W.B., Weir, C. (1997). Probabilistic retrieval of OCR degraded text using N-grams. In: Peters, C., Thanos, C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 1997. Lecture Notes in Computer Science, vol 1324. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0026737

Download citation

  • .RIS
  • .ENW
  • .BIB
  • DOI: https://doi.org/10.1007/BFb0026737

  • Published: 17 June 2005

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-63554-3

  • Online ISBN: 978-3-540-69597-4

  • eBook Packages: Springer Book Archive

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Average Precision
  • Document Image
  • Retrieval Performance
  • Query Term
  • Query Expansion

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Publish with us

Policies and ethics

Search

Navigation

  • Find a journal
  • Publish with us
  • Track your research

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Journal finder
  • Publish your research
  • Language editing
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our brands

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Discover
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support
  • Legal notice
  • Cancel contracts here

Not affiliated

Springer Nature

© 2026 Springer Nature