Feeding large language models with high quality multi-model data from documents

Join us for our upcoming Future Computing Seminar Series

Speaker: external page Dr. Peter Staar, IBM Research

Date: June 28^th, 2023, 14:00 CET

Where: ETZ E7

Abstract:

Since the introduction of ChatGPT, the AI community has been focused intensely on expanding the capabilities of foundational models. A key ingredient for well-trained foundational models is the access to large quantities of high-quality data. In this talk, we will present how Deep Search [1] is helping with that, in particular with regards to the extraction of text, tables and figures from highly technical domains. The latter requires state-of-the-art AI methods to efficiently and accurately determine the document layout [2] and, in some cases, the structure of its sub-components (eg tables [3,4]). In addition, we will also provide an outlook on what is next for LLM applications on large document collections.

https://ds4sd.github.io/
DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation (KDD2022)
TableFormer: Table Structure Understanding with Transformers (CVPR2022)
Optimized Table Tokenization for Table Structure Recognition (ICDAR2023)

Speaker Bio:

Currently, Peter manages the 'AI for Knowledge' group at the IBM Research - Zurich Laboratory. The group focusses on the development of the Deep Search platform, which consists of cloud native services that ingest large corpora of technical documents and extracts the knowledge contained in them. Peter joined the IBM Research - Zurich Laboratory in July of 2014 as a post-doctoral researcher. The Belgium-born scientist first came to IBM Research as a summer student in 2006. Prior to joining IBM Research, He was a post-doctoral researcher in Theoretical Physics and PASC (Platform for Advanced Scientific Computing) at the Swiss Federal Institute of Technology (ETH) in Zurich, Switzerland. He earned his PhD in Theoretical Physics and his M.Sc. degree in Physics at ETH Zurich in 2013 and 2009, respectively, and his B.S. degree in Physics (cum laude) from the KU Leuven, Belgium. Peter has twice been a finalist for the prestigious ACM Gordon Bell award, first in 2013 for his paper entitled 'Taking a Quantum Leap in Time to Solution for Simulations of High-Tc Superconductors' and then in 2015 for his paper entitled 'An Extreme-Scale Implicit Solver for Complex PDEs: Highly Heterogeneous Flow in Earth Mantle.' The last submission won the Gordon Bell prize. Other significant academic achievements include 'Best Paper Award' at IPDPS 2016 (for novel, linear-scaling graph analytics) and 'Applied AI Application Award' at IAAI 2021 (for novel PDF document conversion ML models)