AI-powered Webcrawler for Business Information Extraction

The supply chain due diligence act is a German law designed to improve the protection of human rights and the environment in global supply chains and was introduced at the beginning of this year.

As part of the supply chain due diligence act, it is required to gain detailed information about all suppliers to assess their risk of non-compliance with the law. This information includes various data points such as locations, owners, responsible key people, industries in which companies are operating, revenue, number of employees, availability of certificates, and more. However, this information is often available in an unstructured format on the websites of the companies, necessitating further processing.

The objective of this project is to design, implement, and deploy an AI-powered webcrawler that efficiently navigates through business websites in a hierarchical manner, identifies relevant data points, and employs NLP techniques to extract crucial information related to the company's operations and structure. The webcrawler should be capable of handling various types of websites and deliver accurate and comprehensive results. End results is supposed to be the information described in the introduction.

Key Deliverables:

a. AI-powered Webcrawler Prototype: Develop a functional webcrawler that effectively scans websites and utilizes NLP technologies to extract the desired information (industry, ISO certificates, top management, sublocations, products, services, customers, and subfirms).

b. Data Processing and Visualization: Implement a data processing pipeline that cleans, organizes, and presents the extracted data in a user-friendly format for further analysis and visualization.

c. NLP Processing of the Crawled Data: Utilize advanced NLP techniques to process the crawled data and structure it into a pre-processed format suitable for subsequent steps.

d. Neural Network-based Categorization: Employ neural networks to execute categorizations and enhance the accuracy and efficiency of information classification.

Required Skills:

This project demands a diverse set of skills, including:

  • Proficiency in programming languages such as Python, JavaScript, or similar languages for web development and data processing tasks.
  • Understanding of NLP tools and understanding of processing capabilities
  • Base knowledge of Neural Networks and their application in information classification.
  • Knowledge of web crawling techniques, data scraping, and handling various web formats (HTML, XML, etc.).
  • Familiarity with databases and data management to store and retrieve extracted information.
  • Ability to work in a team, manage project timelines, and collaborate effectively.