Focus Group: Data Analytics

(Prof. Gemulla)

Our group's research focuses on systems and methods for analyzing and mining large datasets as well as their application in practice. Our research interests include:

  • Data analysis and data mining
  • Text mining and information extraction
  • Optimization
  • Approximation techniques
  • Algorithms for modern hardware

News

People

Former PhD students

Kaustubh Beedkar, Luciano del Corro, Stefan Kain, Faraz Makari Manshadi, Christina Teflioudi, Yanjie Wang

Data and Software

  • Lapse: A parameter server with dynamic parameter allocation
  • LibKGE: A knowledge graph embedding library
  • OPIEC: An open information extraction corpus
  • MinIE: Open information extractor (spiritual successor to ClausIE)
  • DSGDpp: Various parallel algorithms for matrix factorization (including DSGD++)
  • DESQ: Frequent sequence mining with subsequence constraints
  • Rounding rank: algorithms for computing rounding-rank decompositions
  • CORE: Context-aware open relation extraction with factorization machines
  • FINET: Context-aware fine-grained named entity typing
  • Werdy: Recognition and Disambiguation of Verbs and Verb Phrases with Syntactic and Semantic Pruning
  • ClausIE: Clause-Based Open Information Extraction
  • LEMP: Fast Retrieval of Large Entries in a Matrix Product
  • LASH: Large-Scale Sequence Mining with Hierarchies
  • MG-FSM: Large-Scale Frequent Sequence Mining

Teaching

If you are interested in writing a seminar, Bachelor or Master thesis with us, please read the following guidelines.

Current semester (FSS 2023)

Previous semester (HWS 2022)

    Publications

    See also Google Scholar and DBLP.

    2022   A. Kochsiek, F. Niesel, R. Gemulla
    Start Small, Think Big: On Hyperparameter Optimization for Large-Scale Knowledge Graph Embeddings [pdf, resources]
    To appear in ECML-PKDD, 2022
     A. Saxena, A. Kochsiek, R. Gemulla
    Sequence-to-Sequence Knowledge Graph Completion and Question Answering [pdf, resources]
    In ACL, pp. 2814–2828, 2022
     A. Renz-Wieland, R. Gemulla, Z. Kaoudi, V. Markl
    NuPS: A Parameter Server for Machine Learning with Non-Uniform Parameter Access [pdf, source code]
    In SIGMOD, pp. 481–495, 2022
    2021   A. Kochsiek, R. Gemulla
    Parallel Training of Knowledge Graph Embedding Models: A Comparison of Techniques [pdf, resources]
    In PVLDB, 15(3), 2021
     A. Renz-Wieland, T. Drobisch, R. Gemulla, Z. Kaoudi, V. Markl
    Just Move It! Dynamic Parameter Allocation in Action [pdf, demo]
    In PVLDB (demo), 14(12), 2021.
    2020   A. Renz-Wieland, R. Gemulla, S. Zeuch, V. Markl
    Dynamic Parameter Allocation in Parameter Servers [pdf, source code]
    In PVLDB, 13(12), pp. 1877–1890, 2020
     S. Broscheit, K. Gashteovski, Y. Wang, Rainer Gemulla
    Can We Predict New Facts with Open Knowledge Graph Embeddings? A Benchmark for Open Link Prediction [pdf, resources]
    In ACL, 2020
     D. Ruffinelli, S. Broscheit, R. Gemulla
    You CAN Teach an Old Dog New Tricks! On Training Knowledge Graph Embeddings [pdf, video, resources, OpenReview]
    In ICLR, 2020
     S. Broscheit, D. Ruffinelli, A. Kochsiek, P. Betz, R. Gemulla
    LibKGE – A knowledge graph embedding library for reproducible research [pdf, source]
    In EMNLP (demo), 2020
     K. Gashteovski, R. Gemulla, B. Kotnis, S. Hertling, C. Meilicke
    On Aligning OpenIE Extractions with Knowledge Bases: A Case Study [pdf, slides, resources]
    In Eval4NLP, 2020
    2019   Y. Wang, D. Ruffinelli, R. Gemulla, S. Broscheit, C. Meilicke
    On Evaluating Embedding Models for Knowledge Base Completion [pdf]
    In RepL4NLP workshop, 2019
     K. Beedkar, R. Gemulla, W. Martens
    A Unified Framework for Frequent Sequence Mining with Subsequence Constraints [pdf (journal version), pdf (author version), resources]
    In TODS, 2019
     K. Gashteovski, S. Wanner, S. Hertling, S. Broscheit, R. Gemulla
    OPIEC: An Open Information Extraction Corpus [pdf, poster, resources, OpenReview]
    In AKBC, 2019
     A. Renz-Wieland, M. Bertsch, R. Gemulla
    Scalable Frequent Sequence Mining With Flexible Subsequence Constraints [pdf, poster]
    In ICDE, 2019
    Preprints
    (2019)
       
    Y. Wang, S. Broscheit, R. Gemulla
    A Relational Tucker Decomposition for Multi-Relational Link Prediction [arXiv]
    2019
    2018   C. Meilicke, M. Fink, Y. Wang, D. Ruffinelli, R. Gemulla, and H. Stuckenschmidt
    Fine-grained Evaluation of Rule- and Embedding-based Systems for Knowledge Graph Completion [pdf, resources]
    In ISWC, 2018
     J. Pfeiffer, S. Broscheit, R. Gemulla, M. Göschl
    A Neural Autoencoder Approach for Document Ranking and Query Refinement in Pharmacogenomic Information Retrieval [pdf]
    In BioNLP workshop, 2018
     S. Broscheit, R. Gemulla, M. Keuper
    Learning Distributional Token Representations from Visual Features [pdf]
    In RepL4NLP workshop, 2018
     Y. Wang, R. Gemulla, H. Li
    On Multi-Relational Link Prediction with Bilinear Models [pdf, resources]
    In AAAI, 2018
    2017   K. Gashteovski, R. Gemulla, L. del Corro
    MinIE: Minimizing Facts in Open Information Extraction [pdf, poster, resources]
    In EMNLP, pp. 2620–2630, 2017
     C. Teflioudi, R. Gemulla
    Exact and Approximate Maximum Inner Product Search with LEMP [pdf (journal version), pdf (author version), resources]
    In TODS, 42(1) Art. 5, 2017
    2016   S. Neumann, R. Gemulla, P. Miettinen
    What You Will Gain By Rounding: Theory and Algorithms for Rounding Rank [pdf, tech report, resources]
    In ICDM, pp. 380–389, 2016
     K. Beedkar, R. Gemulla
    DESQ: Frequent Sequence Mining with Subsequence Constraints [pdf, tech report, resources]
    In ICDM (short paper), pp. 793–798, 2016
    2015   L. Del Corro, A. Abujabal, R. Gemulla, G. Weikum
    FINET: Context-Aware Fine-Grained Named Entity Typing [pdf, slides, resources]
    In EMNLP, pp. 868–878, 2015
     F. Petroni, L. Del Corro, R. Gemulla
    CORE: Context-Aware Open Relation Extraction with Factorization Machines [pdf, slides, resources]
    In EMNLP, pp. 1763–1773, 2015
     K. Beedkar, K. Berberich, R. Gemulla, I. Miliaraki
    Closing the Gap: Sequence Mining at Scale [pdf (journal version), pdf (author version), resources]
    In TODS, 40(2) Art. 8, 2015
     C. Teflioudi, R. Gemulla, O. Mykytiuk
    LEMP: Fast Retrieval of Large Entries in a Matrix Product [pdf, slides, resources]
    In SIGMOD, pp. 107–122, 2015
     K. Beedkar, R. Gemulla
    LASH: Large-Scale Sequence Mining with Hierarchies [pdf, slides, source code]
    In SIGMOD, pp. 491–503, 2015
     R. Gemulla
    A Self-Portrayal of GI Junior Fellow Rainer Gemulla: Data Analysis at Scale [pdf (journal version), pdf (author version)]
    it – Information Technology 57(2), pp. 130–132 , 2015
    2014   L. Del Corro, R. Gemulla, G. Weikum
    Werdy: Recognition and Disambiguation of Verbs and Verb Phrases with Syntactic and Semantic Pruning [pdf, resources]
    In EMNLP, pp. 374–385, 2014
     P. Roy, J. Teubner, R. Gemulla
    Low-Latency Handshake Join [pdf]
    In PVLDB, 7(9), pp. 709–720, 2014
     L. Qu, Y. Zhang, R. Wang, L. Jiang, R. Gemulla, G. Weikum
    Senti-LSSVM: Sentiment-Oriented Multi-Relation Extraction with Latent Structural SVM [pdf]
    In TACL, 2, pp. 155–168, 2014
     D. Erdös, R. Gemulla, E. Terzi
    Reconstructing Graphs from Neighborhood Data [pdf (author version), pdf (journal version)]
    In TKDD, 8(4), 2014
    2013   F. Makari, C. Teflioudi, R. Gemulla, P. J. Haas, Y. Sismanis
    Shared-Memory and Shared-Nothing Stochastic Gradient Descent Algorithms for Matrix Completion [pdf (author version), pdf (journal version), source code]
    In KAIS (special issue: best papers of ICDM 2012), pp. 1–31, 2013
     F. Makari, R. Gemulla
    A Distributed Approximation Algorithm for Mixed Packing-Covering Linear Programs [pdf]
    In NIPS 2013 Biglearn workshop (poster), 2013
     F. Makari, B. Awerbuch, R. Gemulla, R. Khandekar, J. Mestre, M. Sozio
    A Distributed Algorithm for Large-Scale Generalized Matching [pdf, slides]
    The analysis of the number of binary search steps (Lemma 2) contains a bug; see our Biglearn paper for a corrected version.
    In PVLDB, 6(9), pp. 613–624, 2013
     I. Miliaraki, K. Berberich, R. Gemulla, S. Zoupanos
    Mind the Gap: Large-Scale Frequent Sequence Mining [pdf, slides, resources]
    In SIGMOD, pp. 797–808, 2013
     L. Del Corro, R. Gemulla
    ClausIE: Clause-Based Open Information Extraction [pdf, slides, resources]
    In WWW, pp. 355–366, 2013
     R. Gemulla, P. J. Haas, W. Lehner
    Non-Uniformity Issues and Workarounds in Bounded-Size Sampling [pdf (author version), pdf (journal version), source code]
    In The VLDB Journal, 22(6), pp. 753–772, 2013
     K. Beedkar, L. Del Corro, R. Gemulla
    Fully Parallel Inference in Markov Logic Networks [pdf]
    In BTW, pp. 205–224, 2013
    2012   D. Erdös, R. Gemulla, E. Terzi
    Reconstructing Graphs from Neighborhood Data [pdf, slides]
    In ICDM, pp. 231–240, 2012
     C. Teflioudi, F. Makari, R. Gemulla
    Distributed Matrix Completion [pdf, slides, source code]
    In ICDM, pp. 655–664, 2012
     L. Qu, R. Gemulla, G. Weikum
    A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts [pdf]
    In EMNLP-CoNLL, pp. 149–159, 2012
    2011   R. Gemulla, P. J. Haas, Y. Sismanis, C. Teflioudi, F. Makari
    Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent [pdf, slides, source code]
    In NIPS 2011 Biglearn workshop, 2011 (best paper award)
     R. Gemulla, E. Nijkamp, P. J. Haas, Y. Sismanis
    Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent [pdf, slides, source code]
    In KDD, pp. 69–77, 2011
     K. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Eltabakh, C.C. Kanne, F. Ozcan, E. Shekita
    Jaql: A Scripting Language for Large Scale Semistructured Data Analysis [pdf]
    In PVLDB (industrial track), 4(11), pp. 1272–1283, 2011
     M. Y. Eltabakh, Y. Tian, F. Özcan, R. Gemulla, A. Krettek, J. McPherson
    CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop [pdf]
    In PVLDB, 4(9), pp. 575–585, 2011
     R. Gemulla, P. J. Haas, E. Nijkamp, Y. Sismanis
    Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent [pdf]
    IBM Research Report RJ10481, March 2011 Revised February, 2013
     B. Schlegel, R. Gemulla, W. Lehner
    Memory-Efficient Frequent-Itemset Mining [pdf]
    In EDBT, pp. 461–472, 2011
    2010   S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, J. McPherson.
    Ricardo: Integrating R and Hadoop [pdf]
    In SIGMOD (industrial track), pp. 987–998, 2010
     B. Schlegel, R. Gemulla, W. Lehner.
    Fast Integer Compression using SIMD Instructions [pdf]
    In DAMON, pp. 34–40, 2010
    2009   K. Beyer, R. Gemulla. P. J. Haas, B. Reinwald, Y. Sismanis.
    Distinct-Value Synopses for Multiset Operations [pdf, technical perspective by Surajit Chaudhuri]
    In Commun. ACM, 52(10), pp. 87–95, 2009
     B. Schlegel, R. Gemulla, W. Lehner.
    k-Ary Search on Modern Processors [pdf, slides]
    In DAMON, pp. 52–60, 2009
    2008   R. Gemulla.
    Sampling Algorithms for Evolving Datasets [pdf, summary, slides]
    Ph.D. thesis, Technische Universität Dresden, 2009
    URL for citations: nbn-resolving.de/urn:nbn:de:bsz:14-ds-1224861856184–11644
     R. Gemulla, P. Rösch and W. Lehner.
    Linked Bernoulli Synopses: Sampling Along Foreign Keys [pdf, slides]
    In SSDBM, pp. 6–23, 2008
     R. Gemulla and W. Lehner.
    Sampling Time-Based Sliding Windows in Bounded Space [pdf, slides]
    As observed by Hu et al., the lower bound of Ω(k log N) stated in Theorem 1 should read Ω(k log(N/k)).
    In SIGMOD, pp. 379–392, 2008
     P. Rösch, R. Gemulla and W. Lehner.
    Designing Random Sample Synopses with Outliers [pdf, poster]
    In ICDE (poster), pp. 1400–1402, 2008
    2007   R. Gemulla, W. Lehner and P.J. Haas.
    Maintaining Bounded-Size Sample Synopses of Evolving Datasets [pdf]
    The resizing algorithm proposed in this article contains a bug; see my Ph.D. thesis or our 2013 VLDB Journal paper for a corrected version.
    In The VLDB Journal, Special Issue: Best Papers of VLDB 2006, pp. 173–201, 2007
     K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis and R. Gemulla.
    On Synopses for Distinct-Value Estimation Under Multiset Operations [pdf, slides]
    In SIGMOD, pp. 199–210, 2007
     R. Gemulla, W. Lehner and P. J. Haas.
    Maintaining Bernoulli Samples over Evolving Multisets [pdf, slides]
    In PODS, pp. 93–102, 2007
    2006   R. Gemulla, W. Lehner and P. J. Haas.
    A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets [pdf, slides]
    In VLDB, pp. 595–606, 2006
     A. Klein, R. Gemulla, P. Rösch and W. Lehner.
    Derby/S: A DBMS for Sample-Based Query Answering [pdf, poster1, poster2]
    In SIGMOD (demo), pp. 757–759, 2006
     R. Gemulla and W. Lehner.
    Deferred Maintenance of Disk-Based Random Samples [pdf, slides]
    In EDBT, pp. 423–441, 2006