On a daily basis, terms creep into our everyday language that we use frequently, lightly, and whose meanings are often unknown. Some of them are SQL, Big Data, MapReduce, NoSQL or Machine Learning.
In this article we will explain the meaning of each one from a historical perspective in order to participate in the daily conversations in which these topics come up.
Relational databases and SQL language
Before the 1960s, information was stored on computers through files that contained data on each of the stakeholders of a company: customers, suppliers, etc. And, to answer questions such as which users do you have in Madrid, you had to open these files one by one; Or, create a specific program that will review its content and select this information.
It was from then on that the first programs appeared that allowed the data of a company to be stored and consulted in a structured way. They were called database management systems .
At first, the information was organized in the form of a network, but in the 1970s tables began to be used. Each row represented a customer of the company and each column one of their data (identity document, name, address, etc.). It was the relational model and the SQL language was used to retrieve information.
Currently, the relational model is also used, since it has necessary properties in certain cases. For example, in bank databases it is essential to be certain that there are no inconsistencies in the information (data with different values ??in various places, movements that are reflected in the source account, but not in the destination account, etc. ) and that the operations are carried out completely , without staying in an intermediate step (atomicity).
The origins of Big Data: NoSQL databases
In recent times, many people have redirected their daily actions to the digital world. This has generated thousands of data that companies and organizations store and process in order to obtain useful information that can give them a competitive advantage. It is the well-known Big Data.
The treatment of these data has led to the emergence of 3 new needs to which the relational model does not fully respond:
- Volume:more efficient and flexible ways of storing information are being sought.
- Variability:these must allow the modification of the structures of the hot tables.
- Speed: theyhave to respond in a matter of seconds to a processing that can include TB of data.
This has favored the emergence of databases NoSQL.
MapReduce: what is this programming model?
Another factor that has led to feasible solutions has been the parallel and distributed data processing of MapReduce.
Broadly speaking, it is an algorithm that allows dividing the problem into parts, obtaining partial results from each one and combining them to achieve a global solution to the initial approach.
Obviously it is not always possible to apply MapReduce, but it can be done in many cases related to Big Data.
It is the turn of the dnRDBS
In the first half of the 2000s, distributed non-relational database systems (dnRDBS) emerged. Some of them still have a very significant market share today:
- MongoDB: itis a document database . Each element of the database is a document with a free structure, defined and different from the previously stored reports.
- Cassandra: Itis a key value database . The information is stored similar to a dictionary.
- Apache HBase: itis a columnar database . The information is stored in tables and the number of columns for each record can be variable.
Big Data and the importance of scalability
Another very interesting issue related to Big Data is scalability . For example, on Black Friday an electronic commerce can multiply its business volume and, therefore, its servers receive more requests than usual. To do this, it needs the infrastructure that supports it to be scalable, that is, to have more operational machines on time than it has on days with normal traffic .
Also, there is the possibility that this peak in service cannot be predicted and, consequently, the number of machines that are providing service must grow or decrease dynamically depending on the demand that takes place at all times. This is a feature that Cloud infrastructures such as Google Cloud Platform, Microsoft Azure, AWS or IBM Cloud have, among others.
SQL is still necessary
Despite the expansion of Cloud Computing platforms that allow the use of NoSQL databases for Big Data, relational data models (and the SQL language) continue to be used in applications and systems for several reasons:
- There are circumstances in which relational databases offer a better answer.
- Many NoSQL databases use dialects of the SQL languageto operate.
- In the first stage of a Big Data project, it is necessary to know the input data and carry out preprocessingso that it is in a format that is easily processable by the Machine Learning algorithm to be used. In many cases, SQL is used for this phase.
Another proof that this language is in vogue is the high demand for professional profiles in today’s market . If a job search is carried out with this requirement, it is easy to find around 1,500 job offers on Infojobs. And, according to the TicJob portal, it is the third most requested technology. In addition, it should not be overlooked that they are well-paid jobs and that easily are around 30,000 euros per year.