Usine du futur - Open data sources for industrial AI

When starting a new project (Factory of the Future) aimed at improving the capabilities of a production facility through artificial intelligence, the common question is: "Is it feasible?" Artificial intelligence in an industrial context requires a lot of data to train the underlying algorithms. Systems in operation generate data. But often, these data are encapsulated or the databases are not connected. They may not be available to the team responsible for bringing AI to the company, where its own data is not available for building such systems. And within the constraints of time and budget, the development team is faced with the question of how to obtain the training data.

It seems we like hearing about mistakes more than celebrating success. Although artificial intelligence technologies have transformed our lives; at home and at work; many recent media reports focus on the failures of smart devices; from disappointing gadgets exhibited at CES to faulty hotel robots. Some of the stories are very funny; but all they tell us; is that the technology is still developing and some products are better designed than others.

Why are data sources essential for getting started with AI?

Predictive systems, fully automated systems and knowledge discovery systems require the presence of properly formed data. And data quality defines the operational results delivered by AI systems. If you don't have enough useful data, your training results are often mediocre. As a result, AI isn't able to build the required abstractions, and in what follows, it wouldn't be able to create an AI system offering outstanding results. Despite some reinforcement learning methods that don't need a lot of data, deep learning AI needs large amounts of labeled data.

Data process flow from training to operation

Stage 1:

In a Factory of the Future context, you need to consider the general workflow of creating usable AI. In the first step, you need access to relevant historical data, which may take the form of files or databases containing the required information. In the world of data science, the complete collection of data available in the first step is called a data lake. The data lake contains both unstructured and structured data. 

In general, the data format in the data lake is raw. This means that there is no pre-processing of incoming data from different sources. Sensors and historical records provide the data collection. This data is not just measured data, but also data from sources such as images, video or audio.

Step 2:

In the next step, the data are processed in a pre-processing stage. Here, the data are examined. They are visualized, so that subject matter experts can assess the quality of the data. Then, they could be cleaned and reduced, so that the raw data becomes more meaningful data.

This data forms the basis for the development of predictive models. Machine learning algorithms are generally applied to learn from data. For example, a data scientist may choose neural network models, which need to be validated after learning on new, unknown data, and the training is verified. The training phase includes several feedback cycles to see if training results match requirements.

Machine learning workflow in industrial applications

Finally, ready-to-use AI components need to be integrated throughout the enterprise. This integration is twofold. On the one hand, there is state-of-the-art AI with embedded devices and hardware solutions that accompany machines on site. On the other, there is integration into enterprise systems in the form of software components. These software modules are intended to be adapted to fit existing operations.

The problem is not too little data, but too much data

But where do you find all this data to train your neural networks? Given that data seems to be the new oil of today's world, you might expect to find data sources outside your own company. Industrial companies keep their values and data to themselves. However, there are other sectors, particularly IT companies, that have gone through the same phase of keeping data and source code to themselves. And even a small number of companies still do. 

In recent years, the open source approach has enjoyed unimaginable success. Even highly proprietary companies like Microsoft and Google are exploiting open source. Sharing creates new business opportunities and adds value to entire industries. For example, industry associations and consortia are launching data-sharing initiatives. Publicly funded activities and research are another source of free and open data. Organizations such as NASA and CERN provide a wealth of valuable data. These data sets are used for general tasks and for testing new algorithms. They serve as a reference for algorithmic development. When you search for data on the Internet, you will be overwhelmed by the abundance of data available.

Unfortunately, with these masses of data comes a problem. Artificial intelligence is a hot topic, and everyone is eager to learn more. So it's often difficult to decide which open data is right for your project. There are many unstructured offerings, poor quality data or simply poorly described datasets. AI is used in so many different fields, and used for so many different use cases, that there are many datasets that don't meet your needs.

Relevant open data sources for industrial artificial intelligence

When you examine the categories of applied industrial AI, you'll find that you can add AI to many products and services. In this way, you'll improve your customers' experience. For manufacturing tools, for example, self-diagnosing machines improve the overall performance of the operational facility. It increases their efficiency, reliability, safety and enhances machine longevity. They see their own signs of wear on tool tips such as drills, saw blades, welding tools or pliers.

Automation is the second application that needs data. Trend researchers call this hyper-automation. It helps the already existing automation of industrial processes get another boost. It makes people obsolete and displaces negligible change. Here, data from autonomous driving and intelligent robotics standards are used to give individual training to vehicles and industrial autonomous machines.

A third area in which AI is applied is knowledge discovery for engineering systems. The aim here is to find the root causes of problems and eliminate risks with the help of AI. Many critical areas provide a wealth of data via sensors and history logs. Here, AI could create real insights beyond anomaly detection and simple failure mode detection. AI could then predict the unexpected. It finds relationships between similar incidents in the past and current sensor readings. This allows problems to be avoided even before they occur.

What data do you need?

With these given scopes of application, you can search for relevant publicly available data. As many industrial applications require massive amounts of sensor data, this data is not always available for direct download. Sometimes, you need to access the data via a specific API. This API creates a connection to existing databases, enabling you to extract and analyze them.


One example of available sensor data is the predictive maintenance dataset for turbojet engines supplied by NASA. It contains sensor data for 100 engines of the same model. The dataset includes four different sets of engine data using the C-MAPSS aircraft engine simulator. The engines were tested under different operating conditions and failure modes.

This data on turbofan engines comes from NASA's Prognostics Center of Excellence, PCoE. This NASA department has even more open data sets available. It presents data sets from various universities, agencies or companies. These time series data help to create prognostic algorithms. They show the transition from a nominal state to a failed state. Many different industrial tasks have been included. You'll find milling data and bearing tests. You'll find data on electronics and batteries.


More open and freely available repositories are available in the UK. The UK's national oil and gas data repository, NDR, provides 130 terabytes of offshore data. It covers over 12,500 wells, 5,000 seismic surveys and 3,000 pipelines. This data is freely accessible to all. But NRD is not exclusive to the UK. These types of national data repositories are available in many countries and provide open data, an open government approach.

Valuable government data is not limited to the oil and gas industry. The British Geological Survey also provides numerous data sets. It offers real-time seismograms and historical data from its more than 100 seismographic stations across the UK. And over 525 additional datasets on various geological topics.

The main search engines for open data

The best way to find open data sources for your AI project is through specific search engines, catalogs and aggregators. With the help of these tools, you'll be able to quickly find a suitable dataset. They will guide you through the jungle of available open data sources. Like the classic search engine, you can enter a term corresponding to what you're looking for, and the search engine will show you interesting datasets.

The Google dataset

The Google dataset search,, provides an impressive overview of freely available datasets. Once you've performed your search, the results don't just give you the link to the repository. It also gives you direct information on the data formats provided and how the data can be accessed. This recently published tool includes around 25 million publicly accessible datasets.


The research data repository registry,, offers a full text search of its linked repositories. It has a nice graphical exploration tool under "search by subject" to find open data. But for engineering sciences, there are only a few results. What's more, this search engine doesn't take you directly to the data. It simply sends you to the repositories from which your search continues.


In addition to this, it's worth taking a look at the well-known Kaggle platform, as it organizes industry-related competitions from time to time. There's also a new platform, called Unearthed, dedicated to solving data science challenges in the context of Industry 4.0 and Factory of the Future concepts.

With these starting points, you'll quickly find the right open data. Open data helps you get started straight away with your industrial AI project, so you don't have to wait for your operational sensor and business configuration to change.

Need an expert opinion?

Follow our innovations on social networks

We frequently publish on social networks (LinkedinTwitter and Medium) our innovations and the new functionalities of our industrial management solutions.

Also, we would be happy to share with you the latest trends in industrial management 4.0 through high quality content that you could share with others.