Top SQL Queries for Data Scientist - Coding

SQL (Structured Query Language) is one of the critical instruments used in data manipulation and analysis. Knowledge of SQL queries is crucial for data scientists to efficiently select, modify, and analyse the collected big data. Indeed, using SQL queries plays a key role in improving the quality of findings from data by providing efficient techniques to analyze the data.

SQL Queries for Data Scientist

This article aims to identify various Top SQL queries that any data scientist should be conversant with within their line of work, including filtering methods, aggregation, and joining of data.

Table of Content

Basic SQL Queries
Aggregation and Grouping
Advanced Filtering Techniques
Joins and Unions
Advanced SQL Functions
Window Functions
Common Table Expressions (CTEs)
Data Modification Queries
Conclusion

Basic SQL Queries

Retrieving Data with SELECT

The SELECT statement is fundamental for retrieving data from a database. For example, to retrieve all columns from a table named employees:

SELECT * FROM employees;

Filtering Data with WHERE

The WHERE clause allows you to filter data based on specific conditions. To find employees in the ‘Sales’ department:

SELECT * FROM employees WHERE department = ‘Sales’;

Sorting Data with ORDER BY

The ORDER BY clause sorts the result set. To sort employees by their salary in descending order:

SELECT * FROM employees ORDER BY salary DESC;

Limiting Results with LIMIT

The LIMIT clause restricts the number of rows returned. To get the top 5 highest-paid employees:

SELECT * FROM employees ORDER BY salary DESC LIMIT 5;

Aggregation and Grouping

Using Aggregate Functions

Aggregate functions perform calculations on multiple rows. For example, to get the total salary expense:

SELECT SUM(salary) FROM employees;

Grouping Data with GROUP BY

The GROUP BY clause groups rows that have the same values. To find the average salary by department:

SELECT department, AVG(salary) FROM employees GROUP BY department;

Filtering Groups with HAVING

The HAVING clause filters groups based on aggregate conditions. To find departments with an average salary above 50,000:

SELECT department, AVG(salary) FROM employees GROUP BY department HAVING AVG(salary) > 50000;

Advanced Filtering Techniques

Using Subqueries in WHERE Clause

Subqueries can be used within a WHERE clause to filter data. To find employees who earn more than the average salary:

SELECT * FROM employees WHERE salary > (SELECT AVG(salary) FROM employees);

Correlated Subqueries

A correlated subquery refers to the outer query. To find employees who have the highest salary in their department:

SELECT * FROM employees e1 WHERE salary = (SELECT MAX(salary) FROM employees e2 WHERE e1.department = e2.department);

Using CASE Statements for Conditional Logic

The CASE statement allows for conditional logic. To categorize employees based on their salary:

SELECT name, salary,

CASE

WHEN salary > 70000 THEN ‘High’

WHEN salary BETWEEN 50000 AND 70000 THEN ‘Medium’

ELSE ‘Low’

END AS salary_category

FROM employees;

Joins and Unions

Understanding Different Types of Joins

Joins combine rows from two or more tables. An INNER JOIN returns only matching rows:

SELECT e.name, d.department_name

FROM employees e

INNER JOIN departments d ON e.department_id = d.id;

A LEFT JOIN returns all rows from the left table, and matching rows from the right table:

SELECT e.name, d.department_name

FROM employees e

LEFT JOIN departments d ON e.department_id = d.id;

Combining Results with UNION and UNION ALL

The UNION operator combines the result sets of two queries, removing duplicates:

SELECT name FROM employees

UNION

SELECT name FROM contractors;

The UNION ALL operator includes duplicates:

SELECT name FROM employees

UNION ALL

SELECT name FROM contractors;

Handling NULL Values in Joins

NULL values can affect join results. To handle NULLs in a LEFT JOIN:

SELECT e.name, d.department_name

FROM employees e

LEFT JOIN departments d ON e.department_id = d.id

WHERE d.department_name IS NOT NULL;

Advanced SQL Functions

String Functions

String functions manipulate text data. For example, to concatenate first and last names:

SELECT CONCAT(first_name, ‘ ‘, last_name) AS full_name FROM employees;

Date and Time Functions

Date functions handle date and time data. To get the current date and time:

SELECT NOW();

Numeric Functions

Numeric functions perform operations on numbers. To round salaries to the nearest thousand:

SELECT ROUND(salary, -3) FROM employees;

Window Functions

Window functions perform calculations across a set of table rows. To assign a row number to each employee:

SELECT name, ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num FROM employees;

Using ROW_NUMBER, RANK, and DENSE_RANK

These functions assign ranks to rows. ROW_NUMBER gives a unique rank:

SELECT name, ROW_NUMBER() OVER (ORDER BY salary DESC) AS rank FROM employees;

RANK can give the same rank to ties:

SELECT name, RANK() OVER (ORDER BY salary DESC) AS rank FROM employees;

DENSE_RANK ensures no gaps in rank values:

SELECT name, DENSE_RANK() OVER (ORDER BY salary DESC) AS rank FROM employees;

Aggregating Data with OVER Clause

The OVER clause defines the window for aggregate functions. To calculate a running total of salaries:

SELECT name, salary, SUM(salary) OVER (ORDER BY salary) AS running_total FROM employees;

Common Table Expressions (CTEs)

Basics of CTEs

CTEs define temporary result sets. To define and use a CTE:

WITH HighSalaryEmployees AS (

SELECT * FROM employees WHERE salary > 70000

)

SELECT * FROM HighSalaryEmployees;

Recursive CTEs for Hierarchical Data

Recursive CTEs handle hierarchical data. To list an employee hierarchy:

WITH RECURSIVE EmployeeHierarchy AS (

SELECT id, name, manager_id FROM employees WHERE manager_id IS NULL

UNION ALL

SELECT e.id, e.name, e.manager_id FROM employees e

INNER JOIN EmployeeHierarchy eh ON e.manager_id = eh.id

)

SELECT * FROM EmployeeHierarchy;

Using CTEs for Complex Queries

CTEs simplify complex queries. To calculate department budgets and average salaries:

WITH DepartmentSalaries AS (

SELECT department, SUM(salary) AS total_salary, AVG(salary) AS avg_salary

FROM employees

GROUP BY department

)

SELECT * FROM DepartmentSalaries;

Data Modification Queries

Inserting Data with INSERT

The INSERT statement adds new rows to a table. To insert a new employee:

INSERT INTO employees (name, department, salary) VALUES (‘John Doe’, ‘Sales’, 60000);

Updating Data with UPDATE

The UPDATE statement modifies existing data. To give all employees in ‘Sales’ a 10% raise:

UPDATE employees SET salary = salary * 1.10 WHERE department = ‘Sales’;

Deleting Data with DELETE

The DELETE statement removes rows from a table. To delete employees with a salary below 30000:

DELETE FROM employees WHERE salary < 30000;

Merging Data with MERGE (Upserts)

The MERGE statement combines insert and update operations. To insert or update employee records:

MERGE INTO employees AS target

USING new_employees AS source

ON target.id = source.id

WHEN MATCHED THEN

UPDATE SET target.name = source.name, target.salary = source.salary

WHEN NOT MATCHED THEN

INSERT (id, name, salary) VALUES (source.id, source.name, source.salary);

Conclusion

SQL becomes an essential component of a data scientist’s arsenal since it allows for efficient data extraction as well as manipulation and analysis. It is crucial for a data scientist to have knowledge of basic as well as advanced levels of SQL to manage various types of data sets and extract useful information from them. SELECT, WHERE, and JOIN are the essential parts for data acquisition and extraction, while window functions, CTEs, and pivot tables are more advanced features that augment one’s capability of performing various calculations and creating elaborate reports. With these SQL queries applied, the experience of a data scientist will be made easier, the ability to analyze complex data will become more accurate, and the formulation of the right decisions will be possible in the different domains.

Top SQL Queries for Data Scientist – FAQ’s

Explain what SQL is and why data scientists should be concerned with it.

SQL, or Structured Query Language, is a language that is used for interaction between a system and relational databases. This is important to data scientists since they can manipulate data and get what they need within a short time.

What SQL statements should middle-level data scientists be aware of?

Some of the fundamental queries in SQL are: SELECT for pulling data; WHERE to filter; ORDER BY for sorting the obtained data; and JOIN for organizing data from at least two tables.

What are advanced SQL queries and how do they differ from basic ones?

The operations that involve more intricacy in MS-Access are: window functions that are used for calculation on rows; common table expressions (CTEs) that define new temporary result sets; and lastly, pivot tables for converting rows into columns. These queries are considered to be more complex as compared to the basic queries and enable more efficient analysis of data.

How do window functions work in SQL?

Window functions apply operations to a set of rows adjacent to the current row and offer computations like total till now, rank, and moving average. They run on a partition of data or a window of data.

What does CTE stand for, and under what circumstances should it be used?

A CTE is a temporary result set that is written within the WITH clause and that gets referred to as needed in other queries. It is employed for decomposing the complex queries into a set of simpler statements that can help increase the code’s readability and reuse.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
Application of Data Science in Cyber Security
Data Science Vs Computer Science Salary: Key Difference
Why Is Data Engineering Important?
Architecture of Super-Resolution Generative Adversarial Networks (SRGANs)
Plotting Lines and Multilines on Maps Using Folium

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	20