Computer vision and the data-centric AI pipeline

What is data-centric AI?

Data-Centric AI (DCAI) represents a pioneering approach in AI by focusing on the quality and relevance of data to optimize machine learning models and system performance.

This significant paradigm shift is also being applied in the field of computer vision. In order to create the best possible solutions, the CONET Data Analytics and AI team relies on the integration of DCAI in computer vision projects.

In the following blog article, we explain the concept of DCAI and in particular how it differs from the classic, model-centered development process for AI solutions and what impact this has on the field of computer vision. Finally, we present a standardized DCAI pipeline.


What is data-centric AI?

Data-centric AI is an approach in artificial intelligence and machine learning (ML) that focuses on improving and optimizing data quality. DCAI focuses on improving the quality, relevance and cleanliness of the data used to train AI models. In contrast to model-centric approaches, which aim to develop more complex algorithms, this approach sees the data itself as a key factor in AI performance. Data-Centric AI enables more efficient and accurate AI systems by improving the foundation on which these systems are built.

Why data-centric AI in computer vision?

Data quality also plays a decisive role in computer vision projects. A diversified and carefully prepared data set enables models to generalize effectively and identify a wide variety of patterns. A data-centric approach therefore takes center stage in order to actively counteract bias in computer vision systems. Bias is the tendency for biased or unrepresentative content that can affect the impartiality of machine learning.

This can lead to algorithmic discrimination, where systems in certain applications, such as facial recognition, highlight errors or injustices. This is particularly the case if the training data is not diverse and comprehensive. Research, for example from the Fairness and Accountability in Machine Learning (FAT/ML) group at Microsoft Research, emphasizes the importance of combating bias in visual data for fair and ethical AI practices. Distortions can be identified and corrected through in-depth data analysis and careful data preparation. This ensures the fairness of AI applications for different population groups.

What exactly can the DCAI approach optimize in computer vision projects?

  • Improving model accuracy: DCAI techniques are designed to improve data quality by ensuring a data set that includes a wide variety of images. These images cover different lighting conditions and perspectives. Diversification enables the models to interpret metadata more accurately and recognize complex patterns with increased precision, significantly improving their ability to generalize.
  • Identification and reduction of bias: The selection of data for a (visual) data set can lead to distortions, for example if certain company processes or social groups are not sufficiently represented in the data. DCAI can be used to identify and mitigate such distortions.
  • Optimized model robustness: Targeted optimization of data preparation increases the performance of these models. This makes them more effective in real-world applications and enables them to cope better with both data diversity and dynamic changes.

Approaches to the implementation of data-centric AI in computer vision

Although data quality is also important in a model-centric AI development approach, it plays a central role in development according to the DCAI principle. In the following, we present methods that are typically used in development with a DCAI approach.

  • Error detection and correction: Typical errors in computer vision datasets are incorrectly or inaccurately annotated images. Methods such as cross-validation, consistency checks or the use of pre-trained models for error detection are used to identify and correct these errors. These procedures make a decisive contribution to increasing the quality of training data.
  • Data augmentation: Data augmentation includes the application of various transformation techniques such as rotation, mirroring or brightness adjustment to visual data. These methods generate additional variance in the data set. With data augmentation, the training data set in computer vision projects is expanded to include a more diverse and extensive selection of scenarios, thus increasing the generalization capability of the models.
  • Active learning: With active learning, selecting the most informative data points improves the overall performance of the model. This method considers data for which the model is uncertain to be particularly informative. Frequently used active learning algorithms include selective sampling, iterative refinement, uncertainty sampling and query by committee. You can find more detailed information on the Active Learning process in this blog post.
  • Curriculum Learning: Curriculum Learning is based on the principle of sorting training data according to learning difficulty – from simple to complex tasks. This method is similar to the human learning process, in which a step-by-step approach is used for more complicated tasks. This strategy increases the efficiency of the learning process.
  • Feature engineering and selection: In the context of computer vision, feature engineering refers to the identification and processing of significant features from image or video data to optimize the performance of AI models. Relevant attributes are extracted using techniques such as the Histogram of Oriented Gradients (HOG) or Convolutional Neural Networks (CNNs). These steps are crucial to reduce the dimensionality of the data and thus make the training of AI models more efficient.

A typical data-centric AI pipeline

The integration of data-centric AI into computer vision projects follows a clear process. First, data is collected and carefully selected to cover realistic scenarios. The data is then analyzed and cleansed to prepare it for machine learning. The next step is to train a basic model with the cleansed data set – followed by tests and evaluations. Continuous monitoring of data quality is essential to ensure the integrity of the AI model and to adapt it to new circumstances or data. A data-centric AI approach optimizes the entire development process and makes it efficient.


Weitere Artikel