Big Data adoption issues and considerations

Informations presented on this page may be incomplete or outdated. You can find the most up-to-date version reading my book Engineering of Big Data Processing

In this part we cover the following topics

What we have to think about when adopting Big Data?

What we have to think about when adopting Big Data?

The first step has been done -- we made a decision to adopt Big Data. This is not the end, but beginning of the journey - there are many things to consider and account for when adopting Big Data.

Prerequisites for organization
Big Data is not a swiss army knife just out of the box. In order for data analysis and analytics to offer real value, enterprise need to have right and adequate management schemes.
- Appropriate processes and sufficient skills are needed for those who will be responsible for implementing, adopting, populating and using Big Data solutions.
- The quality of the data targeted for processing by Big Data solutions needs to be assessed.
- We have to think one step ahead to ensure that any necessary changes in the future related with Big Data will not turn upside down the enterprise.
As we can read in [W6], analysis of the examples in the Catalogue of catastrophe reveals the most common mistakes [not directly related with Big Data but also possible in this field]. Given the frequency of occurrence, these mistakes can be considered the classic mistakes. The following list outlines the most common themes:
- The underestimation of complexity, cost and/or schedule.
- Failure to establish appropriate control over requirements and/or scope.
- Lack of communications.
- Failure to engage stakeholders.
- Failure to address culture change issues.
- Lack of oversight / poor project management.
- Poor quality workmanship.
- Lack of risk management.
- Failure to understand or address system performance requirements.
- Poorly planned / managed transitions.
"As the world’s largest commercial aircraft the Airbus A380 is a feat of engineering. With two full decks, a wingspan wider than a football pitch and space for up to 850 passengers (in high density mode), the A380 is the most complex aircraft flying today.

Originally scheduled for delivery in 2006, the aircraft’s entry into service was delayed by almost 2 years and the project was several billion dollars over budget ($6.1B in additional costs due to project delays).

At the heart of the problems were difficulties integrating the complex wiring system needed to operate the aircraft with the metal airframe through which the wiring needed to thread. With 530Km of wires, cables and wiring harnesses weave their way throughout the airframe. With more than 100,000 wires and 40,300 connectors performing 1,150 separate functions, the Airbus A380 has the most complex electrical system Airbus had ever designed. As the first prototype (registration F-WWOW) was being built in Toulouse France, engineers began to realize that they had a problem. Wires and their harnesses had been manufactured to specification, but during installation the wires turned out to be too short. Even though the cables were at times just a few centimetres too short, in an aircraft you can’t simply tug a cable to make it fit. As construction of the prototype continued Airbus management slowly came to the realization that the issue was not an isolated problem and that short wires was a pervasive issue throughout the design.

Internal reviews identified that the heart of the problem was the fact that the different design groups working on the project had used different Computer Aided Design (CAD) software to create the engineering drawings. The development of the aircraft was a collaboration between 16 sites spread across 4 different countries. German and Spanish designers had used one version of the software (CATIA version 4), while British and French teams had upgraded to version 5. In theory, the fact that the design centers were sharing their drawings meant that the electrical system designed in Germany would be compatible with the airframe components designed in France. Part way through the project the design centres also started integrating their diagrams into a single 3D Digital Mock-up Unit (DMU) that should further have validated compatibility. Unfortunately, the construction of F-WWOW demonstrated that theory and practice are not always the same thing.

In part the problem was the CATIA version 5 was not a simple evolution from version 4, it was a complete rewrite. Reports indicate that the calculations used to establish bend radii for wires as they wove through the airframe were inconsistent across the different versions of the software and that inconsistency resulted in the problem. Stripping out the wiring from the prototype, redesigning the wiring, making new harnesses and then rethreading the wiring into the airframe became a monumental task. Taking months to complete the project was delayed multiple times as hundreds of engineers tried to overcome the problems. At one point more than 1,100 German engineers were camped out at the Toulouse production facility trying to rectify the problems."

It is really strange why the CATIA V4/V5 issue was not detected early in the A380 development program, and allowed to progress until it was discovered that the (siginfiicant) component designs did not/could not be matched up between Airbus factories.

"The root of the problem can be traced back to a single decision: the decision to proceed with the project despite the fact that two CAD systems were in use. That decision resulted in design inconsistencies, mismatched calculations and configuration management failures."

[W5]

Another good description of this problem is a translation into Polish from http://www.flug-revue.rotor.com/FRHeft/FRHeft06/FRH0612/FR0612b.htm. Unfortunately original version is inaccessible so to avoid problems with double translation (I hope that you know famous examples with "The spirit is willing, but the flesh is weak" sentence translated into Russian and then back to English resulting "The vodka is good, but the meat is rotten" [Wikipedia: Literal translation]) I leave it in Polish.

To nie samolot, ale sam cykl produkcyjny zawiera w sobie krytyczny błąd. Chodzi o najsłabsze ogniwo w całym procesie produkcji, które powoduje, że do tej pory nie zdecydowaliśmy się na rozpoczęcie produkcji w pełnym wymiarze. Problem powiązany jest z projektem wiązek przewodów przeznaczonych do zastosowania w obrębie dzioba i rufy kadłuba. Chodzi przy tym o sieć przewodów o długości przeszło 530 km, które przyłączone są do 100.000 sekcji przewodów, zawierających 40.300 złączek i mających 350 km długości w przypadku jednego jedynego samolotu. Okablowanie A 380 jest dwa razy bardziej złożone, niż to ma miejsce w przypadku A340-A600.

Analiza, którą przeprowadziliśmy w ostatnich tygodniach sugeruje, że sytuacja jest tak naprawdę znacznie gorsza, niż nam się pierwotnie wydawało. Wystąpiły niezgodności, jeśli chodzi o zastosowany program projektujący użyty do przygotowania struktury i układu wiązek przewodów. Co więcej, nie mieliśmy wystarczająco dużo czasu aby zmodyfikować w jakiś sposób przewody jeszcze w fazie ich projektowania. Teraz stoimy przed zadaniem wzajemnego zestrojenia aplikacji projektującej i odpowiednich baz danych. To na pewno zajmie nam trochę czasu.

Prawdopodobnie problemy wynikają z zastosowania róźnych wersji programu 3D – CATIA. Podczas gdy fabryki Airbus-a w Niemczech i Hiszpanii korzystały ze starej i sprawdzonej wersji czwartej, zakłady we Francji i Wielkiej Brytanii zdecydowały się na wersję nowszą, czyli piątą, i to jej użyły w procesie projektowania.

Różnice występujące pomiędzy obiema tymi wersjami są duże: podczas gdy starsza wersja była zakodowana w języku programowania Fortran i pracowała w środowisku UNIX na stosunkowo mocnych komputerach, nowsza, piąta wersja, która pozornie bardziej dostosowana została pod kątem Windowsa i również może działać na komputerach osobistych, korzysta z języka programowania C++.

Pomijając podobny interfejs, oba programy są kompletnie odmienne i tworzą także różne typy danych. Efektem tego był problem przeniesienia ich z wersji czwartej do wersji piątej. Rezultatem podejmowania tego typu prób był fakt, że utracie ulegały dodawane przez inżynierów notatki, wskazówki dotyczące szczegółów kształtów czy też krzywych. Znowu dane dotyczące na przykład zacisków, kształtu czy też rozmiaru, były zamrożone i nie mogły zostać poddane edycji, w celu ich dostosowania. Abstrahując od tego, przewody są jednym z najtrudniejszych do projektowania elementów w programie CATIA, ponieważ wymagają ciągłego przełączania się pomiędzy trybem dwu- i trójwymiarowym.

[W8]

Another example of failure is inappropriate for the Afghan environment camouflage uniforms designed and bought for... the Afghan army.

With the goal of unifying the army’s battle dress a project was initiated to identify a camouflage pattern that could be used for the uniforms.

Although the US Military has a range of camouflage patterns that could be used at no cost, the officials involved, simply surfed the Internet looking for alternatives.

According to reports, the officials "ran across" a "forest" pattern offered by a private company [and picked one]. Because the pattern was privately owned, uniforms made with the pattern would need to pay a license fee to the rights holder.

Despite that issue, the pattern was shown to the Afghan Defense Minister who "liked what he saw". As a result the decision to pay for the priority pattern was made. Apparently no efforts were made to engage the actual troops or their commanders and no effort was made to get the camouflage to determine its suitability to the Afghan geography. Given that only 2% of Afghanistan is forest, the use of a forest pattern does seem like a poor choice.

In an interview with USA Today, the author of the report is quoted as having said: "My concern is what if the minister of defense liked purple, or pink? Are we going to buy pink uniforms for soldiers and not ask questions? That’s insane. This is just simply stupid on its face. We wasted $28 million of taxpayers’ money in the name of fashion, because the defense minister thought that that pattern was pretty. So if he thought pink or chartreuse was it, would we have done that?"

Pink Uniforms
Data sources considerations
Some of data are free some others are commerciall (we have to paid for them once or periodically) or may have use other distribution mmodel. Consider if there are planned any updates of your data and if they will be compatible with previously gathered. Past, present and future data compatibility is very important factor which plays crucial role in Big Data utilization (see A380 problems in the previous section).

Think about compatibility.
Data tracking
When working with data their authenticity and quality becomes a very important factor. Because at different stages in the analytics lifecycle, data is in different states due to the fact it is transmitted (data-in-motion), processed (data-in-use) or storage (data-at-rest), we have to keep all information about the source of the data and how it has been processed till now. In other words, we have to save as metadata a whole data history at every state, capturing what has been done with it starting from the first time we got it. Oryginal data shouldn't be ever deleted or changed in irreversible way. No matter what you think now it may come a days you will need them in the form you saw them a first time. Ability to prove provenance information is crucial to evaluate the value of the analytic result.

Never deleted or changed data in irreversible way.
Data provenance
Both data source and data tracking can be describe with one term: data provenance. As we can read in [W10], before jumping in any exploratory data analysis, we should know as much as possible on the provenance of the data we are analyzing. We need to understand how the data was collected and how it was processed. Are there any past transformations on the data that could affect our analysis? We should be able to answer those questions on our dataset:
- How was it collected?
- Was it properly sampled?
- Was the dataset transformed in any way?
- Are there some know problems on the dataset?
If we don’t understand where the data is coming from, we will have a hard time drawing any meaningful conclusions from the dataset. We are also at risk of making very important analysis mistakes.

Ability to prove provenance information is crucial to evaluate the value of the analytic result.

Business pay greater attention to data quality and monetization but the situation does not look too optimistic for now. As much as 33% of data is considered as ROT data (redundant, obsolete or trivial) -- unnecessary, obsolete or irrelevant data, and only 20% of companies monitor the quality of data, manage it and take care of its improvement.

The observed phenomenon of democratization of analytics does not mean that enterprises are convinced of the high quality of information they have. According to the SAS study:

12% of respondents admitted that they had current data,
9% of respondents are convinced that they had correct data,
6% of respondents say that they had complete data.

According to SAS experts, the lack of trust in the quality of information and, on the other side, the belief that the decisions made on their basis will be accurate and will have a positive impact on the business can significantly affect the slowdown activities of such functioning enterprises.

The degree of trust in the data depends on the source of their origin.

63% -- internal information is the most appreciated,
57% -- data from internet sensors / devices of the internet of things,
55% -- controller data,
37% -- information provided by customers,
29% -- data from sellers / partners,
23% -- publicly available information,
12% -- data provided by business competitors.

According to the Capgemini report, "Big & Fast Data: The Rise of Insight-Driven Business" so far, only 27% of enterprises considered the implemented projects to be successful. Why? One of the reasons was that companies processed outdated information. Meanwhile, many decisions must be made in real time, based on the most up-to-date, high-quality data.

In 2019 and in subsequent years, business will pay much more attention to the quality of data than before. Companies are gradually growing out of the conviction that only the amount of data they collect counts -- their relevance becomes crucial. They see that it is better to have less data, but to be sure that, for example, this is the latest data, reflecting current market needs. Conscious companies enrich their systems with 3rd party data, and big data providers are increasingly asking questions about the age of data and so-called data quality. The data must be regularly updated, because the parameters present in them, such as the user's "shopping intentions", are often temporary rather than permanent. Therefore data provided in the real-time model will be very valuable.

Sebastian Kuniszewski, Tylko jedna na pięć firm dba o jakość danych, IT professional, 4/2019, p. 68-69

Privacy
Performing analytics on datasets can reveal confidential information about organizations or individuals. Even analyzing datasets that contain seemingly insignificant and unrelated facts can reveal secrets when the datasets are analyzed jointly. The question is, if we are allowed to work with a such confidential information and if we can assure adequate protection level. Accidentally disclosed fact may became a source of a big social, political or financial problems.

Keep private what should be private.
Security
Some of the components of Big Data solutions lack the robustness of traditional enterprise solution environments when it comes to access control and data security. A good example are NoSQL databases which generally do not provide robust built-in security mechanisms. They instead rely on simple HTTP-based APIs where data is exchanged in plaintext, making the data prone to network-based attacks.

Also because data lifecycle is more complicated, more people work with them at different bussiness stages and they have different level of IT skills and awareness of what should be done and how it should be done, a correct security policies with proper authentication and authorization mechanisms are indispensable.

Know who and why touches the data and with what effect.
Law
Different law (but also culture, social relations) in different countries may cause problems when its come to data usage. Let's say that we have different data sources: one located in the same place as main branch of the company and second in another country or even in a cloud. If we are going to process in main branch of the company a data located in another country which law should be sonsidered?

Law? What law?
Realtime processing
Although all we need results as fast as we only can, many open source Big Data solutions and tools are batch-oriented. We can achieve near-realtime results processing transactional data as it arrives and combine it with previously summarized batch-processed data.
Performance
Due to the volumes of data that some Big Data solutions are required to process, performance is often a concern.
Governance
Any Big Data based solution collect data, access data, process data and generate data, all of which become assets of the business. In such a case governance framework is required to ensure that the data and the solution environment itself are regulated, standardized and evolved in a controlled manner.