Data challenges in AI systems:

Data deluge has introduced several challenges in modern digitized societies. The lack of homogeneity that characterizes data production, entails risks in increasing the cost of discovery while at the same time introduces difficulties in quality control and management. To facilitate data interpretation, organization and consumption by both humans and machines, the FAIR principles for research data were developed. The principle of data concerns all the sequence of the information value chain in all Artificial Intelligence (AI) systems, from creation and collection of data to its end use either by human agents, through AI insight or an automated AI-based decision system. The application of FAIR principles in AI systems, is required to improve issues like data bias in learning algorithms.

The challenges of data in AI systems emerge from privacy preservation, usage, and processing of collected datasets. It is difficult to strike a balance between usefulness of data analytics performed by an AI system and detail in data accountability mechanisms with respect to data usage and data processing. The four key data challenges to AI systems are as follows:

  1. Multiple data sources and input modalities: Current data sources from the real world are diverse set of information formats, such as tabular data (age, color, sex), image data or object identification, time-series data (say commodity price, weather records), unstructured sequence data (data in written formats, forms, etc…) and structured sequence data (customer purchase history, web surfing history etc…). To input datasets into an AI system requires integration of data from multiple sources and the above-mentioned input modalities, presents a challenge of consolidation and standardization of datasets. In addition, this challenge is multiplied when the datasets are required to be tested and trained through multiple AI algorithms as that demands dataset compatibility.
  2. Human as data provider: Current AI systems are being trained on vast quantities of data; but the data produced and fed into these systems is by humans, who are fallible and would have not taken all aspects of disparities towards production of the datasets. For example, in the health domain, most applications of AI are focused on diagnostic tasks for example in radiology to detect say tuberculosis in lungs. Most datasets produced to train an AI algorithm are mastering the rules of detection of lung condition of sick patients as most people would visit a doctor only when they feel sick, and doctors have records (in this case x-rays) of only sick patients. The algorithm is thus trained datasets of sick patients, but no dataset of healthy patients is fed into the algorithm. This inherently creates a bias into the algorithm, as the AL model output would not understand the disparity of a healthy person. Thus, AI models would be trained on non-inclusive datasets and output itself would be flawed. Humans thus are not generating good inclusive datasets. Good quality collection of datasets should be incentivized, and it is important that humans who collect the datasets are themselves diverse and understand the need for good quality data-collection and how best it would support the requirements of the AI algorithms.
  3. The challenge of data governance: Data governance as an overall concept is a challenge for AI systems. This is because once the datasets are standardized and labelled with certain parameters and fed into an AI algorithm, the definition of original dataset is lost, as data gets transformed and processed based on the complexity of the algorithmic code. This would be especially the case, if the datasets are pre-processed and are used as training datasets for Deep-learning complex AI models. For example, the data governance challenges for large AI language models such as GPT-3 is to existing data governance regimes which were designed for simplified data collection and data security and provide compensation to the individual donors or data providers on simple case of misuse. As data is getting modulated and processed into complex forms before an output is realized, the liability of output cannot be directly interpreted to the input and thus strong liability judgement cannot be ascertained. The data governance challenge of liability is of greater concern and presents a risk as large AI models do have the possibility of correlating behavior patterns and buying patterns to personal identities.
  4. The challenge of data-reuse: The question of data re-use can be restricted, and checks and balances can be put in place; but it is difficult to put guardrails for datasets that run through a complex deep-learning model; as data processing and re-processing and further processing takes places internally within a ‘black-box’ environment and the user is unable to understand as to how the output result was obtained? Thus, data control and usage for complex AI models undergoing multiple data iterations and reuse is not feasible and remains an open research challenge.

Current excitement about AI systems is driven by the pervasive use of large AI models that are influencing humanity. Most large AI models are pre-trained on large datasets (image or text based) from open-source repositories such as the Internet. The data challenges in AI system thus reflect the challenge between privacy and knowledge and how to come-up with a best value/effort performance in a trustworthy manner for the end user who is standing at both ends of the value chain, i.e., data is being created by humans or about humans and data is being consumed by humans or for humans. As AI models become large the complexity of data shall continue to bring in new challenges.