Habits for Faster Quality
Today’s applications are powered by data, just like machines are powered by energy.
For example, a cab-booking application would rely on data like your location, map-information, driver availability, dynamic pricing and payment details.
A relational database models data in terms of entities and relationships. Here are some examples of entities:
customer
is identified by a CustomerID
customer
orders a ride
ride
is identified by an OrderID
If you aren’t familiar with SQL, you can quickly get a flavor for it over here: w3schools SQL interpreter
Consider the following data:
CustomerID | CustomerName | … | Pin code | City | Country |
---|---|---|---|---|---|
300 | Arijit | … | 110005 | Delhi | India |
301 | John | … | 02111 | Boston | USA |
302 | Mary | … | 02115 | Boston | USA |
303 | Sonu | … | 560001 | Bangalore | India |
304 | Vijay | … | 560002 | Bangalore | India |
305 | Latha | … | 110001 | Delhi | India |
Observe that there is some redundancy here. For example, ‘India’ will be repeated in every row for all customers in India. In any case, we can infer both the city and the country from the pin code.
Normalization is the process of removing redundancy. Here is a normalized representation of the above data:
CustomerID | CustomerName | … | Pin code |
---|---|---|---|
300 | Arijit | … | 110005 |
301 | John | … | 02111 |
302 | Mary | … | 02115 |
303 | Sonu | … | 560001 |
304 | Vijay | … | 560002 |
305 | Latha | … | 110001 |
Pin code prefix | City |
---|---|
1100 | Delhi |
5600 | Bangalore |
0211 | Boston |
City | Country |
---|---|
Delhi | India |
Bangalore | India |
Boston | USA |
There are situations where redundancy is actually desirable. An example is healthcare data. Every medical image needs to be accompanied by the all the patient’s details, for effective diagnosis.
Machine learning algorithms need all the data together to make effective models.
Normalized data can be denormalized using ‘join’ operations on the data. Use the tutorial mentioned to learn about joins.