Séminaire de Master : Document Recognition and Retrieval

Teachers:

Denis Lalanne & Rolf Ingold

Abstract

Documents play an important role in everyday communication. With the ever-increasing use of the Web, a growing number of documents are published and accessed on-line. Unfortunately, document structures are not often considered, which considerably weaken users's browsing and searching experience.

There are many levels of abstraction in a document, conveyed by its various structures: thematic, physical, logical, relational or even temporal. In most of the search engines and information retrieval systems, this multi-layered structure is not taken into account; documents are indexed in the best case according to their thematic structure or simply represented as a bag of words. The form of the documents, i.e. their layout and logical structures, is underestimated and could carry important clues about how the document is organized, which could drastically improve indexing and retrieval.

We believe that the various document structures extraction will improve (a) documents indexing and retrieval and (b) linking with other media. In particular, we will see in this seminar how documents can be integrated in multimedia and multimodal applications and how document-based interfaces can improve searching and retrieval in multimedia databases.

Themes

Document visualization
Thematic/Topic segmentation of documents
Document: Information/Content extraction and indexing
Document modeling and formatting
Document physical and logical structure extraction
Document classification

An introductive document will be distributed to participants:
- Rolf Ingold, "Analyse et reconnaissance d'images de documents", Collection Techniques de l'Ingénieur, Traité informatique, vol H-7020.
A folder containing the articles listed bellow (for each theme) is available in DIUF students library (classeur gris, étagère en face du bureau de Karim Hadjar et Maurizio Rigamonti).

When and what

8th April 2004: kick-off, presentation: Rolf Ingold, Denis Lalanne.
At least 4 other afternoons, either Thursday or Friday. Dates will be decided the 8th.
Various internal speakers will present their works or views:
- (Alphabetically) Ardhendu Behera, Karim Hadjar, Dalila Mekhaldi, Maurizio Rigamonti, etc.

Langage

Presentations will be in French, German or English.

Contact

Rolf Ingold (AT) UniFr
Denis Lalanne (AT) UniFr

References

Document visualization

Card, S.K., G.G. Robertson, and W. York. 1996. The WebBook and the WebForager: an information workspace for the World Wide Web, CHI 96, ACM Conference on Human Factors in Software, ACM Press, New York. 111-117.

Marti Hearst, TileBars: Visualization of Term Distribution Information in Full Text Information Access, Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 59-66, Denver, CO, May 1995.

Perlin, K. and D. Fox. 1993. Pad: an alternative approach to the computer interface, Proceedings of 1993 ACM SIGGRAPH Conference, 57-64.

Thematic/Topic segmentation of documents

Salton G., Singhal A., Buckley C. and Mitra M. "Automatic Text Decomposition Using Text Segments and Text Themes". In Proceedings of the Hypertext '96 Conference, USA.

M. Hearst, TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages, Computational Linguistics, 23 (1), 33-64, March 1997.

David M. Blei, Pedro J. Moreno. Topic Segmentation with an Aspect Hidden Markov Model. SIGIR 2001: New Orleans, Louisiana, USA.

Document: Information/Content extraction and indexing

S. Marinai, E. Marino, F. Cesarini and G. Soda. "A General System for the Retrieval of Document Images from Digital Libraries", DIAL2004, International Workshop on Document Image Analysis for Libraries, Palo Alto Research Center (PARC), Palo Alto, CA, USA, 2004.

K. Hadjar, M. Rigamonti, D. Lalanne and R. Ingold. "Xed: a New Tool for eXtracting Hidden Structures from Electronic Documents", DIAL2004, International Workshop on Document Image Analysis for Libraries, Palo Alto Research Center (PARC), Palo Alto, CA, USA, 2004.

X. Lin, S. J. Simske. "Automatic document navigation for digital content remastering", Document Recognition and Retrieval XI, IST SPIE's Symposium on Electronic Imaging: Science and Technology, San Jose, CA, USA.

Document modeling and formatting

Thomas M. Breuel, William C. Janssen, Kris Popat, Henry S. Baird. "Reflowable Document Images", in " Web Document Analysis, Challenges and Opportunities", Word Scientific. Editors Apostolos Antonacopoulos and Jianying Hu.

Steven Bagley, David Brailsford, Matthew Hardy. "Creating reusable, well-structured, PDF as a sequence of Component Object Graphic (COG) elements". DocEng03, The ACM Symposium on Document Engineering, Grenoble, France, 2003.

Lisa Purvis, Steven Harrington, Barry O'Sullivan. "Creating Personalized Documents: An Optimization Approach". DocEng03, The ACM Symposium on Document Engineering, Grenoble, France, 2003.

Julius Mong, David Brailsford. "Some experiments in using SVG as the rendering model for structured and graphically complex Web material". DocEng03, The ACM Symposium on Document Engineering, Grenoble, France, 2003.

Fateh Boulmaiz, Cecile Roisin, Frederic Bes. "Improving formatting documents by coupling formatting systems". DocEng03, The ACM Symposium on Document Engineering, Grenoble, France, 2003.

Document physical and logical structure extraction

O. Altamura, F. Esposito and D. Malerba (2000). Transforming Paper Documents into XML Format with WISDOM++, International Journal of Document Analysis and Recognition, Springer Verlag, 3(2), 175-198.

Andreas Dengel, Bertin Klein: smartFIX: A Requirements-Driven System for Document Analysis and Understanding. Document Analysis Systems 2002: 433-444

C. Shin, D. Doermann, and A. Rosenfeld, "Classification of document pages using structure-based features," International Journal on Document Analysis and Recognition, vol. 3, no. 4, pp. 232-247, 2001.

M. Aiello, C. Monz, Leon Todoran et al., "Document Understanding for a broad class of Documents", IJDAR 2002, Vol 5, pp 1-16

Document classification

J. Hu, R. Kashi, G. Wilfong. "Comparison and Classification of Documents Based on Layout Similarity", Int. journal for Information Retrieval, Vol. 2, No. 2, May 2000, pp 227-243

M. Diligenti, P. Frasconi, M. Gori. "Hidden Tree Markov Models for Document Image Classification", IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), Vol. 25, No. 4, April 2003, pp. 519-523

D. Doermann. The indexing and retrieval of document images: A survey. Technical Report CS-TR3876, University of Maryland, Computer Science Department, February 1998.