70+ SQL Interview Questions: Your Comprehensive Guide to Acing Your Interview

Getting ready for a SQL interview can be simple with the right resources. SQL, or Structured Query Language, is required in most data and tech job listings. It’s a must-have skill for professionals in data analytics, software development, and database management.

This guide provides a comprehensive list of SQL interview questions and answers across different experience levels to help you prepare effectively and boost your confidence.

Table of Contents

Importance of SQL in Tech Interviews

SQL is ubiquitous in the tech industry. Whether you are aiming to be a software engineer, data analyst, or database administrator, you will likely encounter SQL questions in technical interviews. In fact, SQL has consistently ranked among the top three most-used programming languages for many years, which reflects its broad adoption.

Employers highly value SQL proficiency – for example, almost all data analyst roles require SQL knowledge, and it’s often considered the most important skill to land a data job. Interviewers include SQL questions to ensure candidates can work with databases, retrieve data efficiently, and understand data relationships.

Being well-versed in SQL demonstrates your ability to handle real-world data problems. From startups to tech giants, companies like Google, Amazon, and Facebook rely on SQL for data storage and analysis. It’s no surprise that SQL interview questions are common in technical screenings. A solid grasp of SQL not only helps you clear interviews but also prepares you for the day-to-day demands of modern tech roles.

Why SQL Skills Are Essential in Modern Tech Roles

Beyond interviews, SQL skills are critical in today’s data-driven workplaces. Almost every application or service uses a database in the backend – from web apps storing user information to analytics platforms crunching business data.

Being able to write efficient SQL queries means you can extract insights from data and contribute to informed decision-making. Modern tech roles emphasize data literacy, and SQL provides a universal way to communicate with relational databases.

Even with the rise of big data and NoSQL technologies, SQL remains highly relevant. Many big data tools (like Hive or SparkSQL) use SQL-like syntax, and cloud data warehouses (Redshift, BigQuery, etc.) also use SQL.

This means if you know SQL, you can adapt to various data platforms easily. As data volumes grow, companies need professionals who can manage and query data efficiently. SQL skills ensure you can handle tasks like generating reports, joining data from multiple sources, optimizing queries for performance, and maintaining data integrity.

Now, let’s dive into the actual interview questions. We will start with basic SQL interview questions (ideal for freshers or those just learning SQL), then move to intermediate questions, advanced topics for experienced candidates, and finally some practical SQL query examples.

Basic SQL Interview Questions and Answers (for Freshers)

These basic SQL interview questions cover fundamental concepts that beginners or fresh graduates are expected to know. If you are a fresher, focus on understanding these thoroughly.

1. What is a database?

A database is an organized collection of data stored and accessed electronically. Databases typically store information in tables (rows and columns) and are managed by a Database Management System (DBMS). In simple terms, a database allows you to store, retrieve, update, and manage data in a structured way.

2. What is SQL?

SQL (Structured Query Language) is a standard programming language used for managing and manipulating relational databases. SQL allows you to create databases and tables, insert or update data, delete data, and query data using simple, declarative statements. In other words, SQL is the language through which you communicate with a relational database to perform CRUD operations (Create, Read, Update, Delete).

3. What is a DBMS, and how is an RDBMS different from a DBMS?

A DBMS (Database Management System) is software that interacts with end-users, applications, and the database to capture and analyze data. It provides an interface to perform database operations. A RDBMS (Relational Database Management System) is a type of DBMS that stores data in a structured format using rows and columns (tables). The key difference is that an RDBMS uses relations (tables) and enforces relationships between tables through keys, whereas a general DBMS might not support relations and could store data in other formats (like hierarchical or file-based storage). In summary, all RDBMS are DBMS, but not all DBMS are relational – RDBMS organize data into tables and support the SQL language for querying.

Common examples of RDBMS software include MySQL, PostgreSQL, Oracle Database, Microsoft SQL Server, and SQLite. All of these use SQL as the query language (with some dialect differences) and are widely used in industry. For instance, MySQL is popular for web applications, SQL Server is used in many enterprise environments, and SQLite is a lightweight database often used in mobile applications. (Notably, the term SQL Server usually refers to Microsoft SQL Server.)

5. What are the different types of SQL commands?

SQL commands are commonly classified into categories based on their purpose:

  • DDL (Data Definition Language): Commands that define or modify database structure. Examples: CREATE, ALTER, DROP, TRUNCATE. (For example, CREATE TABLE to create a new table.)
  • DML (Data Manipulation Language): Commands for data manipulation. Examples: SELECT, INSERT, UPDATE, DELETE. These allow you to retrieve and modify data in tables.
  • DCL (Data Control Language): Commands related to permissions. Examples: GRANT and REVOKE – they control access to data in the database.
  • TCL (Transaction Control Language): Commands that deal with transaction management. Examples: COMMIT, ROLLBACK, SAVEPOINT – these ensure data integrity by grouping SQL statements into transactions.

6. What is a table in SQL?

In SQL, a table is a structured collection of data organized into rows and columns. Each table represents an entity or concept (for example, a table named Employees might store data about employees). Columns (fields) define the attributes of the entity (e.g., Name, Age, Salary), and each row (record) in the table is a single entry with values for those attributes. Think of a table like a spreadsheet: the columns are headers, and each row is a line of data under those headers.

7. What is the difference between CHAR and VARCHAR data types?

Both CHAR and VARCHAR are used to store string/text data, but they behave differently:

  • CHAR(n): A fixed-length character type. If you define a column as CHAR(10), it will always use 10 bytes of storage per entry, padding with spaces for values shorter than 10 characters. For example, storing “SQL” in a CHAR(10) will actually use 10 characters (with 7 trailing spaces).
  • VARCHAR(n): A variable-length character type. If a column is VARCHAR(50), it can store up to 50 characters, but it will use only as much space as the actual text plus a small overhead. For example, “SQL” in a VARCHAR(10) will use 3 characters of storage.

In essence, CHAR is best for fixed-size data (like country codes of 2 characters), while VARCHAR is more space-efficient for varying-length data.

8. What is a primary key in SQL?

A primary key is a column (or combination of columns) in a table that uniquely identifies each row in that table. Primary keys have two main properties: they must be unique for each row, and they cannot contain NULL. By defining a primary key, you enforce entity integrity – no two rows can have the same primary key value, and every row must have a value for the primary key. For example, a table of Employees might use an EmployeeID number as the primary key so that each employee record can be fetched or updated uniquely by that ID.

9. What is a foreign key in SQL?

A foreign key is a column (or set of columns) in one table that refers to the primary key of another table. A foreign key creates a relationship between two tables, enforcing referential integrity. This means you cannot have a value in the foreign key column that doesn’t exist in the referenced primary key column of the other table. For example, if you have an Orders table with a foreign key CustomerID referencing Customers(CustomerID), the database will ensure every CustomerID in Orders corresponds to a valid record in the Customers table. Foreign keys help maintain consistency across related tables by preventing “orphan” records.

10. What is a unique key in SQL?

A unique key (unique constraint) is a constraint that ensures all values in a column (or a combination of columns) are distinct across the table. Like a primary key, a unique key prevents duplicate values in the constrained column. The differences are that a table can have multiple unique keys (but only one primary key), and unique keys may allow a single NULL (depending on the RDBMS) whereas primary keys do not allow NULL at all. Unique keys are useful for columns that need uniqueness but are not the primary means of identifying a row. For example, an Employees table might have a primary key EmployeeID, but you could also enforce a unique constraint on the Email column so that no two employees share the same email address.

11. What is a composite key?

A composite key is a key composed of more than one column used to uniquely identify a row in a table. In other words, the combination of those columns is unique for each row, even if individually the columns might not be unique. A composite key can be a primary key or a unique key. For example, consider a table Enrollments with columns StudentID and CourseID. You might create a composite primary key on (StudentID, CourseID) to ensure that the same student cannot enroll in the same course more than once. Neither StudentID nor CourseID alone might be unique in the table, but together they form a unique pair.

12. What is a JOIN in SQL and what are its types?

A JOIN in SQL is used to combine rows from two or more tables based on a related column between them. By using joins, you can query data from multiple tables as if they were one. The common types of SQL joins are:

  • INNER JOIN: Returns only the rows where there is a match in both joined tables. (This is the default join type if just JOIN is specified.)
  • LEFT JOIN (LEFT OUTER JOIN): Returns all rows from the left table, and the matching rows from the right table. If there’s no match, the result will have NULLs for columns of the right table.
  • RIGHT JOIN (RIGHT OUTER JOIN): Opposite of left join; returns all rows from the right table, and matching rows from the left table (NULLs for left-table columns when no match).
  • FULL JOIN (FULL OUTER JOIN): Returns all rows when there is a match in either left or right table. Non-matching rows from either side will show NULLs for the other side.
  • CROSS JOIN: Returns the Cartesian product of the two tables (every row of the first table paired with every row of the second table). CROSS JOIN doesn’t require a matching column and typically is used rarely (it can generate a lot of rows).

For example, an INNER JOIN between a Customers table and Orders table on the CustomerID will yield only those customer-order pairs where the Customer exists and has made an order.

13. What is the difference between INNER JOIN and OUTER JOIN?

An INNER JOIN returns only those rows that have matching values in both tables being joined. If a row in one table has no corresponding match in the other table, it will not appear in the result.

An OUTER JOIN will return the matched rows plus any unmatched rows from one or both tables, depending on whether it’s a LEFT, RIGHT, or FULL outer join:

  • In a LEFT OUTER JOIN, all rows from the left table appear in the result; unmatched right-table values are filled with NULL.
  • In a RIGHT OUTER JOIN, all rows from the right table appear; unmatched left-table values are NULL.
  • In a FULL OUTER JOIN, all rows from both tables appear. Unmatched entries on either side will have NULLs for the other side.

In summary, inner joins yield the intersection of the tables (only where they overlap), while outer joins yield the union of the tables (all rows, with NULLs where there’s no overlap).

14. What is a self join?

A self join is when a table is joined with itself. This is useful when the table contains hierarchical or relational data in a single table. In a self join, you treat one instance of the table as the “left” table and another instance as the “right” table. Usually, you’ll use table aliases to differentiate the two roles.

For example, consider an Employees table that has a column ManagerID which is a foreign key referencing EmployeeID in the same table (each employee can have another employee as a manager). To list employees along with their manager’s name, you would self join the Employees table to itself, matching Employees.ManagerID (in the left/child alias) to Employees.EmployeeID (in the right/manager alias). The self join allows you to retrieve related data within one table.

15. What are constraints in SQL?

Constraints are rules applied to table columns to enforce data integrity and consistency. They restrict the type of data that can go into a table. Common SQL constraints include:

  • NOT NULL: Ensures a column cannot have NULL value (every row must have a value for this column).
  • UNIQUE: Ensures all values in a column (or a set of columns) are distinct across the table.
  • PRIMARY KEY: Uniquely identifies each row in a table (combines NOT NULL and UNIQUE). There can be only one primary key per table.
  • FOREIGN KEY: Ensures referential integrity by requiring that a value in one table (child) corresponds to an existing value in another table (parent). It links two tables.
  • CHECK: Ensures that all values in a column satisfy a specific condition. For example, CHECK (age >= 0) on an Age column ensures no negative ages can be inserted.
  • DEFAULT: Specifies a default value for a column when none is provided. For instance, DEFAULT 0 could be set for a quantity column so that if not specified, it defaults to 0.

Using constraints guarantees that the data in the database adheres to the business rules and expected formats (for example, a CHECK constraint can ensure a date is in the future, or a score is within 0 and 100).

16. What does a NULL value mean in SQL?

A NULL value in SQL represents the absence of a value or an unknown value in a column. It is not the same as zero or an empty string – it literally means no data. Key points about NULL:

  • Any comparison with NULL using the usual = or != will result in UNKNOWN, not true or false. That’s why to check for NULL you must use IS NULL or IS NOT NULL.
  • NULLs are ignored by aggregate functions like COUNT(column) (which counts only non-null values) but COUNT(*) will count rows regardless of NULLs.
  • A column defined with a NOT NULL constraint cannot have NULL values.

For example, if a person’s middle name is not provided, the MiddleName column could be NULL for that record. In a condition, WHERE MiddleName = 'John' will not catch rows where MiddleName is NULL (because NULL doesn’t equal anything, not even NULL).

17. What is the difference between DELETE, TRUNCATE and DROP?

These SQL commands are all used to remove data, but at different levels and with different scope:

  • DELETE: DELETE FROM table WHERE condition removes rows from a table that satisfy the condition. You can delete all rows by omitting the WHERE clause (e.g., DELETE FROM table;). DELETE is a DML operation, so it logs individual row deletions and can be rolled back if within a transaction. After a DELETE, the table and its structure remain, and you can also filter which rows to delete.
  • TRUNCATE: TRUNCATE TABLE table quickly removes all rows from a table. It is a DDL operation (in many databases) that deallocates the data pages, so it’s faster for bulk removal and uses fewer system resources. You cannot specify a WHERE clause – it’s all or nothing. Truncate operations are generally not logged row-by-row (making them faster) and in some systems cannot be rolled back (unless used within transactional context on systems that allow it). The table structure remains (just like with DELETE).
  • DROP: DROP TABLE table deletes the entire table structure from the database, including all its rows, schema, and dependencies. After a DROP, the table no longer exists in the database (you would have to reCREATE it if needed). Similarly, DROP DATABASE dbname would remove an entire database. Drop is a DDL operation and cannot be rolled back once executed (in most systems, except within some transaction-supporting DDL contexts).

In summary: use DELETE to remove specific rows (or all rows) but keep the table, TRUNCATE to quickly remove all rows and reset the table (keeping structure), and DROP to completely delete the table (or other object) from the database.

18. What is the WHERE clause used for?

The WHERE clause in an SQL query is used to filter rows that meet a specified condition. It follows the FROM (and JOIN) part of a SELECT, and it restricts which rows are returned (for SELECT) or affected (for UPDATE or DELETE). Only rows for which the WHERE condition evaluates to true are processed.

For example:

SELECT *
FROM Employees
WHERE Department = 'Sales' AND Salary > 50000;

This query will retrieve only those employees who are in the Sales department and have a salary greater than 50,000. Rows that do not meet both conditions are filtered out. In summary, the WHERE clause is essential for querying a subset of data based on conditions (it’s like a filter).

19. How do you retrieve unique values from a column?

To select only distinct (unique) values from a column in SQL, you use the DISTINCT keyword in your SELECT query. SELECT DISTINCT eliminates duplicate rows from the result set. It can be applied to one or multiple columns.

Example:

SELECT DISTINCT Country
FROM Customers;

This query will return each country that appears in the Customers table only once, removing any duplicates. If multiple columns are specified after DISTINCT, the combination of those columns is considered (distinct pairs, etc.).

Keep in mind that DISTINCT applies to the entire row’s combination of selected columns. It’s commonly used to find unique values or combinations in a dataset.

20. How do you sort the result set of a query?

You use the ORDER BY clause to sort the results of a query. The ORDER BY clause comes at the end of a SELECT statement. You can specify one or more columns to sort by, and you can sort each in ascending (ASC) or descending (DESC) order. By default, ORDER BY sorts in ascending order (ASC).

Example:

SELECT Name, Age
FROM Students
ORDER BY Age DESC;

This will list students sorted by Age in descending order (from oldest to youngest). You could add a second sort key, e.g., ORDER BY Age DESC, Name ASC to sort by age (desc) and then name (asc) for students with the same age.

21. How can you perform pattern matching in SQL?

You perform pattern matching in SQL using the LIKE operator (often in a WHERE clause) along with wildcard characters:

  • The % wildcard matches any sequence of characters (including an empty sequence).
  • The _ wildcard matches exactly one character.

For example:

SELECT * 
FROM Employees
WHERE Name LIKE 'J%';

This finds all employees whose Name starts with “J” (e.g., “John”, “Jane”, “Jacob”). Some more examples:

  • WHERE Name LIKE '%son' finds values ending in “son” (such as “Johnson”).
  • WHERE Name LIKE '_r%' finds values where the second letter is ‘r’ (e.g., “Grace”, “Fred”).

Pattern matching is case-sensitive in some databases (like PostgreSQL) by default and not in others (like MySQL, unless using a case-sensitive collation). If you need case-insensitive matching in a case-sensitive DB, functions like LOWER() on both sides can be used.

22. What are aggregate functions? Can you name a few?

Aggregate functions perform a calculation on a set of values and return a single value. They are often used in combination with grouping (the GROUP BY clause) to summarize data. Common aggregate functions include:

  • COUNT(): Counts the number of values (or rows). For example, COUNT(*) counts all rows, whereas COUNT(column) counts non-NULL values in that column.
  • SUM(): Computes the sum of numeric values.
  • AVG(): Calculates the average of numeric values.
  • MIN(): Finds the minimum value in a set.
  • MAX(): Finds the maximum value in a set.

For example, to get the total and average salary from an Employees table:

SELECT SUM(Salary) AS TotalSalary, AVG(Salary) AS AvgSalary
FROM Employees;

This will return a single row with the total salary of all employees and the average salary. Aggregate functions skip NULL values (except COUNT(*) which counts rows regardless of NULLs).

23. What is the GROUP BY clause used for?

The GROUP BY clause in SQL is used in conjunction with aggregate functions to group result rows by one or more columns. It allows you to perform aggregations (like SUM, COUNT, etc.) on subsets of the data that share common values.

In practice, GROUP BY takes the result set and assembles rows with the same values in specified column(s) into groups. Then aggregate functions can be applied to each group independently.

For example:

SELECT Department, AVG(Salary) 
FROM Employees
GROUP BY Department;

This query groups the employees by Department and then calculates the average salary for each department. The result would show one row per department with its average salary.

Any column in the SELECT that is not inside an aggregate must be listed in the GROUP BY clause. Essentially, GROUP BY defines the granularity of the aggregation (e.g., per department, per month, etc.).

24. What is the difference between the WHERE and HAVING clauses?

Both WHERE and HAVING are filtering clauses, but they apply at different stages of query processing:

  • WHERE filters rows before any grouping or aggregation occurs. It selects which rows from the table are considered in the query or sent to the next step (like grouping). The WHERE clause cannot refer to aggregate functions (since at that stage, aggregates haven’t been computed yet).
  • HAVING filters groups after the GROUP BY and aggregation step. It is like a WHERE for aggregated data. You use HAVING to apply conditions on aggregate results or on groups.

In short, use WHERE to filter individual rows, and use HAVING to filter aggregated groups.

For example, consider:

SELECT Department, COUNT(*) as EmpCount
FROM Employees
GROUP BY Department
HAVING COUNT(*) > 10;

Here, the WHERE clause is not used, but the HAVING clause ensures that only departments with more than 10 employees appear in the result. If we had a WHERE, it could have been used to restrict which employees are counted (e.g., WHERE Status = 'Active' to count only active employees). Whereas HAVING is used to restrict which department groups make it to the final output (those with EmpCount > 10 in this case).

25. What is normalization in databases?

Normalization is the process of organizing data in a database to minimize redundancy (duplicate data) and ensure data integrity. It usually involves dividing a database into multiple related tables and defining relationships (using foreign keys) between them. By doing so, each piece of data is stored only once, and updates or deletions of data occur in one place.

SQL Interview Questions
Figure. Data Normalization | Source

There are several normal forms (1NF, 2NF, 3NF, BCNF, etc.), each with specific rules, but generally:

  • 1NF (First Normal Form): Ensure each column has atomic (indivisible) values and each record is unique.
  • 2NF (Second Normal Form): Be in 1NF and every non-key column is fully dependent on the table’s primary key.
  • 3NF (Third Normal Form): Be in 2NF and all columns are directly dependent on the primary key (no transitive dependencies).

If i summarize, normalization means creating separate tables for different entities and linking them via keys. For example, instead of one big table with departments repeated in every employee row, you’d have an Employees table and a Departments table, and a DepartmentID foreign key to link them. This reduces duplication of department info and makes the data more consistent.

26. What is denormalization?

Denormalization is the deliberate process of introducing redundancy into a database for performance optimization. It’s essentially the opposite of normalization. While normalization is about reducing redundancy and improving integrity, denormalization accepts some redundancy to reduce the number of joins or complex queries needed, thereby speeding up read operations in certain scenarios.

For example, in a normalized design you might have to join 3–4 tables to get a report. If that report is needed frequently and the join is expensive, you might denormalize by creating a single table or a materialized view that consolidates relevant data, even if it duplicates some data. This reduces the need for joins. The trade-off is that updates become more complex (you have to update redundant data in multiple places) and there’s a risk of data becoming inconsistent.

Figure. Normalization vs. Denormalization | Source

27. What are the differences between SQL and NoSQL databases?

“SQL” databases typically refers to relational databases (RDBMS) that use SQL, whereas NoSQL refers to a broad category of non-relational databases. Key differences include:

  • Data Model: SQL databases use a fixed schema defined by tables (rows and columns). They are best for structured data. NoSQL databases can be document stores, key-value stores, wide-column stores, or graph databases, handling unstructured or semi-structured data. They often allow flexible or dynamic schemas.
  • Schema: SQL DBs require a predefined schema – data must conform to the structure. NoSQL DBs often allow schema-less data or schema-on-read, meaning each record (e.g., document) can have different fields.
  • Query Language: Relational DBs use SQL for queries. NoSQL DBs may use JSON-based queries, proprietary query APIs, or no declared query language at all (in case of simple key-value stores).
  • Joins: SQL databases support JOIN operations to combine data. Many NoSQL databases do not support joins (data is often aggregated in a single document or denormalized form).
  • Transactions: SQL databases generally support ACID transactions (ensuring consistency for complex, multiple-step operations). Some NoSQL solutions prioritize availability and partition tolerance (see CAP theorem) and may sacrifice strict ACID compliance, though many offer at least atomic updates within a document.
  • Scalability: SQL databases are traditionally scaled vertically (bigger hardware), though modern ones can do read-replicas and sharding. NoSQL databases are often designed to scale horizontally (distribute data across multiple servers) easily, which is great for big data and high throughput use cases.

SQL databases (e.g., MySQL, PostgreSQL) are great for structured data and complex queries with transactions, whereas NoSQL databases (like MongoDB, Cassandra) are chosen for large volumes of data, flexible schemas, or distributed architectures.

28. How do you insert a new record in a SQL table?

You insert new records using the INSERT statement. There are two common syntaxes:

  • Inserting a single row by specifying values:
INSERT INTO TableName (Column1, Column2, Column3)
VALUES (Value1, Value2, Value3);
This inserts one new row with the given values corresponding to the specified columns.
  • Inserting multiple rows in one statement (supported by many RDBMS):
INSERT INTO TableName (Column1, Column2)
VALUES 
  (Value1a, Value2a),
  (Value1b, Value2b),
  ...;
Example: Suppose we have a Customers(Name, Age, Country) table. To insert a new customer:
INSERT INTO Customers (Name, Age, Country)
VALUES ('Alice', 30, 'USA');
This will add a new row for Alice. If you omit the column list, you must provide values for all columns in the table in the correct order. Also, if a column has a DEFAULT value or is auto-incrementing, you can omit it and the database will fill it in.

29. How do you update an existing record in SQL?

You use the UPDATE statement along with a WHERE clause to modify existing records. The basic syntax is:

UPDATE TableName
SET Column1 = new_value1,
    Column2 = new_value2,
    ...
WHERE condition;
Without a WHERE clause, all rows in the table will be updated (so be careful!).

Example: To give a 10% salary raise to an employee with ID 5:

UPDATE Employees
SET Salary = Salary * 1.10
WHERE EmployeeID = 5;
This finds the row(s) where EmployeeID is 5 and updates the Salary for those row(s). If the WHERE clause matches multiple rows, all those rows will be updated. If it matches none, zero rows are affected.

Always ensure your WHERE clause is specific to avoid accidental mass-updates. You can check what will be updated first by doing a SELECT with the same condition.

30. What is a NATURAL JOIN in SQL?

A NATURAL JOIN is a type of join that automatically joins two tables based on all columns with the same name in both tables. Essentially, the database figures out the matching columns by name and implicitly creates the join condition. It’s a shorthand – you do not explicitly write the ON clause.

For example:

SELECT *
FROM Students NATURAL JOIN Enrollments;
If both Students and Enrollments tables have a column named StudentID, the natural join will join on Students. StudentID = Enrollments.StudentID automatically. It will also not duplicate the join column in the result.

However, Natural joins can be risky in practice:

  • If column names change or new matching names are added, the join behavior changes.
  • It might unintentionally join on a column you did not intend to, if two tables have a same-named column that isn’t actually a logical relation.

Natural joins are not supported by all databases and are generally avoided in production code in favor of explicit JOIN ... ON ... clauses for clarity and control.

Intermediate SQL Interview Questions

In this section, we move to the intermediate SQL interview questions focus on more complex SQL queries, joins across multiple tables, subqueries, and basic performance considerations. Interviewers at this level expect you to know how to write correct SQL for various scenarios and to understand how relational databases work under the hood (indexes, transactions, etc.).

Let’s explore some common questions in this category.

31. What is a view in SQL?

A view is a virtual table in SQL, defined by a query. It does not store data physically (except in the case of materialized views, covered later) but presents data from one or more tables through a SELECT query. Once created, a view can be used in queries just like a regular table.

For example, you can create a view:

CREATE VIEW HighSalaryEmployees AS
SELECT Name, Department, Salary
FROM Employees
WHERE Salary > 100000;
Now HighSalaryEmployees can be queried as if it were a table, e.g., SELECT * FROM HighSalaryEmployees;. This will internally run the view’s query.

Views serve several purposes:

  • Simplify complex queries: You can select from a view that joins multiple tables instead of writing the complex join repeatedly.
  • Security: You can grant users access to specific views rather than whole tables, so they only see a subset of the data.
  • Abstraction: They provide a level of abstraction; underlying table structures can change while the view remains the same for the end user.

Note: a standard view always reflects the current data from underlying tables (it’s like a saved query). Also, some views are updatable (you can insert/update through them) if they meet certain criteria (single source table, all necessary primary key columns present, etc.), but others are read-only.

32. What is a stored procedure in SQL?

A stored procedure is a precompiled set of one or more SQL statements that are stored on the database server. You can call (execute) the procedure by name, optionally passing parameters, rather than sending multiple individual SQL statements from your application. Stored procedures are like functions or subroutines in programming that live in the database.

Key points about stored procedures:

  • They can contain complex logic, including variables, control flow (IF, loops), etc., depending on the SQL dialect (PL/SQL for Oracle, T-SQL for SQL Server, etc.).
  • They can accept parameters (IN, OUT, INOUT) and return results or output parameters. Some procedures return a result set like a query, others just perform actions.
  • Because they are precompiled and reside on the server, execution can be faster (the database already optimized the query plan) and network traffic is reduced (you send just a call instead of potentially large SQL text or multiple calls).
  • They help enforce business logic at the database level. For example, an “TransferFunds” procedure might ensure money is deducted from one account and added to another in one transaction.

Example (SQL Server T-SQL syntax):

CREATE PROCEDURE GiveRaise @EmpID INT, @Percent DECIMAL(5,2)
AS
BEGIN
   UPDATE Employees
   SET Salary = Salary * (1 + @Percent/100.0)
   WHERE EmployeeID = @EmpID;
END;
This stored procedure GiveRaise can be executed with an ID and percentage to give a raise to a particular employee.

33. What is a trigger in SQL?

A trigger is a special kind of stored program that automatically executes (or “fires”) in response to certain events on a table or view. Those events can be INSERT, UPDATE, or DELETE operations. Triggers can be used to enforce complex business rules, maintain audit logs, derive column values, etc.

For example, a trigger can be set on the Orders table so that whenever a new order is inserted, the trigger fires and automatically updates the inventory in the Products table.

Key aspects of triggers:

  • Triggers can be defined to fire BEFORE or AFTER the event. For instance, a BEFORE INSERT trigger might modify the values being inserted (or validate them), while an AFTER INSERT trigger might insert into an audit table.
  • In many SQL dialects, within a trigger, you have access to pseudo-tables like INSERTED (new values) and DELETED (old values) for the affected rows.
  • Triggers execute as part of the transaction of the triggering SQL statement, so if they cause an error or rollback, the entire transaction (including the original statement) can be rolled back.

Example (MySQL syntax for an AFTER INSERT trigger):

CREATE TRIGGER after_order_insert
AFTER INSERT ON Orders
FOR EACH ROW
BEGIN
   UPDATE Products
   SET Stock = Stock - NEW.Quantity
   WHERE ProductID = NEW.ProductID;
END;
This trigger adjusts the stock after a new order is inserted by subtracting the ordered quantity. (Note: ensure the logic handles cases like stock going negative, etc.)

While powerful, triggers should be used cautiously as they can make debugging harder (hidden side-effects) and potentially impact performance if doing heavy work for each row.

34. What is an alias in SQL and why is it used?

An alias is an alternate name given to a table or column in a SQL query, usually for the duration of that query. You create an alias using the AS keyword (although AS is optional for table aliases in many databases). Aliases are essentially pointers or nicknames; they don’t change the actual names in the database, but they make queries more readable or the SQL easier to write.

Uses of aliases:

  • Column alias: To rename a column in the result set or to make computed columns readable.
  • Table alias: To shorten table names in complex queries, especially when performing self joins or multiple joins.

Example of a column alias:

SELECT FirstName || ' ' || LastName AS FullName
FROM Customers;
Here, FullName is an alias for the concatenated result of FirstName and LastName.

Example of a table alias (joining a table to itself):

SELECT e.Name AS Employee, m.Name AS Manager
FROM Employees AS e
JOIN Employees AS m
  ON e.ManagerID = m.EmployeeID;
In this self-join, we alias the Employees table as e and m to refer to it twice in the query – one representing the employee, another representing the manager. Using aliases simplifies references (we can write e.Name instead of Employees.Name and clearly distinguish e vs. m).

Aliases improve readability and are required whenever you have ambiguous references (like two tables with the same column names in a join). They only exist for the query’s scope.

35. What is a subquery in SQL?

A subquery is a query nested inside another SQL query. It’s also known as an inner query or nested query. Subqueries are used to perform an additional lookup or computation needed for the main (outer) query. They can appear in SELECT, FROM, or WHERE clauses (and others like HAVING) depending on how you use them:

  • Subquery in SELECT: e.g., computing a value for each row.
  • Subquery in FROM (a derived table): treating the subquery’s result set as a table to join or select from.
  • Subquery in WHERE: often used to filter results based on some condition involving another table.

For example, a common subquery usage:

SELECT Name
FROM Employees
WHERE Salary > (SELECT AVG(Salary) FROM Employees);
This finds all employees whose salary is greater than the average salary of all employees. The inner query (SELECT AVG(Salary) FROM Employees) computes a single value that the outer query uses for comparison.

Subqueries can return:

  • A single scalar value (like the example above).
  • A row or set of rows (to be used with IN, EXISTS, or as a derived table).

It’s important to use subqueries appropriately for clarity. In some cases, a JOIN can be used instead of a subquery for the same result, which might be more efficient. However, subqueries are very handy for filtering based on aggregated data or when the logic is easier to express in nested form.

36. What is the difference between UNION and UNION ALL?

Both UNION and UNION ALL are used to combine the results of two (or more) SELECT queries. The difference lies in how they handle duplicates:

  • UNION: Combines the result sets and eliminates duplicate rows. It performs a distinct sort of the combined results, which ensures only unique rows appear.
  • UNION ALL: Combines the result sets including duplicates. It does not remove duplicates, so if the same row appears in both queries, it will appear twice in the result.

Because UNION does the extra work of removing duplicates (which may involve sorting or hashing the results), it can be slightly slower than UNION ALL, especially on large data sets.

Example:

SELECT City FROM Customers
UNION
SELECT City FROM Suppliers;
This gives the distinct list of cities either customers or suppliers are in.

If we used UNION ALL instead, and if some cities had both customers and suppliers, those city names would appear twice (or more) in the combined list.

In summary: use UNION when you need a set of distinct records. Use UNION ALL when you want to preserve all occurrences (or when you are certain the two result sets have no overlap, and you want to avoid the overhead of duplicate elimination).

37. What is the difference between clustered and non-clustered indexes?

This concept is specific to some databases (particularly SQL Server and other systems that differentiate index types). In general:

  • A clustered index determines the physical order of data in a table. A table can have at most one clustered index (because you can’t order the data in more than one way at the same time). The clustered index’s key values are stored in sorted order on disk, and the table’s rows are stored in that same order. Essentially, the table is the clustered index (sometimes called an index-organized table in Oracle). Retrieving data via the clustered index is fast because the data is already sorted and adjacent. By default, a primary key often creates a clustered index.
  • A non-clustered index is a separate structure that references the table’s data. It’s like a sorted list of key values with pointers (row identifiers or primary key values) to the actual table rows. A table can have multiple non-clustered indexes (on different columns or combinations). Non-clustered indexes do not affect the physical order of the data. Think of it like an index in a book: it’s separate from the main content, but helps you find topics quickly.

Analogy: Clustered index = phone book organized by last name (the book itself is sorted by last name). Non-clustered index = an index at the end of a textbook that points to pages.

Performance implications:

  • A clustered index is efficient for range queries and sorting on the indexed column(s) because the data is sequential. However, if you insert data out of order, the database may need to rearrange (potential page splits), which can have a cost.
  • A non-clustered index is great for lookups on the indexed column, but retrieving the full row may require an extra step (a lookup using the pointer) unless the index “covers” the query (contains all needed columns). Non-clustered indexes are additional structures, so they take extra space and maintenance on writes.

38. What is a transaction in SQL, and what are COMMIT and ROLLBACK?

A transaction in SQL is a sequence of one or more SQL operations treated as a single logical unit of work. A transaction has the ACID properties (Atomicity, Consistency, Isolation, Durability) – meaning all operations succeed or fail as one, maintain database consistency, are isolated from other transactions, and once committed, changes are permanent.

Key points:

  • BEGIN (or START) TRANSACTION – marks the start of a transaction (some systems implicitly begin a transaction for each statement unless auto-commit is on).
  • COMMIT – saves all changes made in the transaction to the database permanently. Until a transaction is committed, none of its changes are visible to other transactions (under the appropriate isolation level) and are not guaranteed to persist (for example, if the database crashes).
  • ROLLBACK – undoes all changes made in the current transaction, reverting the database to the state before the transaction began. This is used to abort a transaction if something goes wrong (like an error occurs or a condition fails).

For example, consider transferring money between two bank accounts:

BEGIN;
UPDATE Accounts SET Balance = Balance - 500 WHERE AccountID = 1;
UPDATE Accounts SET Balance = Balance + 500 WHERE AccountID = 2;
IF (SELECT Balance FROM Accounts WHERE AccountID = 1) < 0
BEGIN
    ROLLBACK;
    PRINT 'Transfer failed, insufficient funds.';
END
ELSE
BEGIN
    COMMIT;
    PRINT 'Transfer successful.';
END
In this pseudo-code, we deduct and add funds as one transaction. We then check a condition (no overdraft allowed); if violated, we rollback (so neither account is updated). If all good, we commit, making both updates permanent.

39. How do you optimize the performance of a SQL query?

Optimizing SQL queries can involve many strategies. Here are some key techniques and considerations for query optimization:

  • Indexes: Ensure appropriate indexes exist on columns used in WHERE clauses, join conditions (ON clauses), and in ORDER BY or GROUP BY if sorting/grouping large data sets. Indexes can drastically speed up lookups. However, avoid redundant or overly many indexes as they slow down writes.
  • Avoiding Select: Only retrieve the columns you need (SELECT specific_columns). This reduces I/O and network transfer, especially if the table has wide rows or if you only need a few fields.
  • Joins vs Subqueries: Sometimes an equivalent operation can be written with either a join or a subquery. Generally, JOINs are preferable for set operations, and the database optimizer often handles them well. Use subqueries carefully, and consider the execution plan – for example, an IN (subquery) might be optimized as a join by the DB, but in some cases an EXISTS might perform better than an IN for subqueries.
  • WHERE clause filtering: Write selective WHERE clauses to filter data early. Ensure calculations in WHERE clauses are minimized – for instance, move arithmetic to the other side of a comparison to allow index use (e.g., use WHERE date >= '2025-01-01' instead of WHERE YEAR(date) = 2025 which might disable index usage).
  • Avoiding Wildcards at Start of LIKE: If you use LIKE '%something%' the leading wildcard prevents use of an index. If possible, avoid leading % so that an index on that column can be used.
  • Query structure: Break complex queries into simpler parts or use temporary tables / common table expressions (CTEs) if it helps the optimizer or improves clarity. However, be mindful that too many temp results can also be costly.
  • Use EXPLAIN plan: Most databases offer an EXPLAIN or EXPLAIN PLAN command to see how the query will execute (index usage, join algorithms, etc.). Analyze the plan for bottlenecks – e.g., a full table scan on a huge table might indicate the need for an index.
  • Limit the result set: If you only need the first N rows, use LIMIT (MySQL/Postgres) or TOP N (SQL Server) to avoid scanning unnecessary rows.
  • Denormalization or caching: In some cases, especially for read-heavy workloads, consider caching results of expensive queries, or denormalizing certain data to reduce the number of joins (at the cost of update complexity).
  • Avoid unnecessary ORDER BY or DISTINCT: Sorting (ORDER BY) and eliminating duplicates (DISTINCT) are resource-intensive. Use them only when needed. If you know your data is inherently unique or you don’t care about order, skip these.
  • Batch operations: For many insert/update/delete operations, doing them in batches (or via a single SQL statement with proper WHERE) is faster than row-by-row processing, because it reduces overhead and allows the DB to optimize the set operation.
  • Proper use of transactions: Group related operations in one transaction to reduce commit overhead, but also avoid keeping transactions open too long, which can hold locks and affect concurrency.

In essence, query optimization is about making your queries do the least amount of necessary work: retrieve less data, touch fewer rows, use indexes to avoid full scans, and leverage the database’s strengths (set-based operations) over procedural row-by-row processing.

40. What is a Common Table Expression (CTE) in SQL?

A Common Table Expression (CTE) is a temporary named result set that you can reference within a SELECT, INSERT, UPDATE, or DELETE statement. It is defined using a WITH clause and lasts only for the duration of the query. CTEs improve query readability and organization, especially for complex queries, recursive queries, or when you need to reuse a subquery result.

Basic syntax:

WITH CTE_Name (optional_column_list) AS (
    -- inner query
    SELECT ... 
)
SELECT *
FROM CTE_Name
WHERE ...;
For example:
WITH HighEarners AS (
    SELECT EmployeeID, Name, Salary
    FROM Employees
    WHERE Salary > 100000
)
SELECT COUNT(*) 
FROM HighEarners;

Here, HighEarners is a CTE that first selects employees with salary > 100k. The main query then uses that CTE to count how many such employees exist. This could also be done with a subquery, but the CTE makes it a bit more readable by giving it a name.

CTEs are particularly useful for:

  • Breaking down complex queries: You can build the query step by step.
  • Recursive queries: CTEs can be self-referencing (recursive CTEs) to work with hierarchical data (for example, an organizational chart). Recursive CTEs repeatedly execute and accumulate results until a termination condition is met.
  • Reusing a subquery result: If you need to use the same derived result multiple times in a query, you can select from the CTE multiple times instead of duplicating a subquery, which may help the optimizer or at least simplify the SQL writing.

Remember, a CTE does not persist or index data; it’s just a way to structure the query. In terms of performance, a non-recursive CTE often doesn’t differ from a subquery or view – it’s primarily for readability and maintainability.

41. What is the difference between a stored procedure and a function in SQL?

Both stored procedures and functions are programmable routines stored in the database, but there are differences in their usage and capabilities (the specifics can vary by SQL dialect, but generally):

  • Return Values: A function (often called User-Defined Function or UDF) must return a value (either a scalar value or a table result, depending on the type). A stored procedure does not have to return a value; it can just execute code. In some databases (like SQL Server), a stored procedure can only return integers via RETURN (as a status code) but can also return result sets and output parameters, whereas a function returns a single value or table.
  • Usage in SQL: Functions can be used in a SQL statement like inline with SELECT (e.g., SELECT dbo.GetOrderTotal(1)). They can be part of an expression. Stored procedures, on the other hand, cannot be called directly in a SELECT or used in expressions. You execute a procedure using an EXEC or the call syntax, separately from a SELECT.
  • Side Effects: Functions are generally expected to be deterministic (given same input, produce same output) and have no side effects (don’t change state). Many SQL engines enforce that a function cannot modify database state (no INSERT/UPDATE/DELETE inside most functions). Stored procedures, however, are designed to perform operations and can change database state, call DDL, etc. In SQL Server, for example, a function cannot perform an INSERT/UPDATE on a table (except maybe table-valued functions building a result), whereas a procedure can.
  • Parameters: Both can take parameters. Stored procedures can have input and output parameters. Functions typically only have input parameters (except table-valued functions which sort of output a table).
  • Transaction Control: In some systems, functions cannot start or commit transactions (they run within an existing context if any), whereas stored procedures can manage transactions.

Use a function when you need to compute and return a value (or table) that can be used inline in SQL (like in a select or join). Use a stored procedure when you need to perform an operation (or a series of operations) possibly affecting data or when you need to return multiple pieces of data via output parameters or result sets, and you’ll call it on its own.

For example, you might have a function CalculateTax(amount) that returns a numeric tax value and can be used in a SELECT query. Conversely, you might have a stored procedure ApplyYearEndUpdates() that updates many tables and doesn’t return a value, just prints or logs what it did.

42. How can you implement conditional logic (if-else) in a SQL query?

In SQL, conditional logic can be implemented using the CASE expression (in the context of queries) or using control-of-flow statements like IF/ELSE inside stored procedures and scripts.

For within a SELECT (or ORDER BY, etc.), use a CASE expression. It works like a simplified if-else to output a value based on conditions:

SELECT 
    EmployeeID,
    Salary,
    CASE 
       WHEN Salary >= 100000 THEN 'High Salary'
       WHEN Salary >= 50000  THEN 'Medium Salary'
       ELSE 'Low Salary'
    END AS SalaryCategory
FROM Employees;
This will categorize each employee’s salary into High/Medium/Low based on the conditions. CASE goes through each WHEN condition in order and returns the corresponding THEN value for the first true condition (or the ELSE if none are true). The CASE expression returns a single value and thus can be used as part of a SELECT list, in a WHERE clause, ORDER BY, etc.

If we are talking about procedural logic (like in a stored procedure or plpgsql function), you have programming constructs:

IF condition 
BEGIN
   -- SQL statements
END
ELSE 
BEGIN
   -- alternate SQL statements
END
This is T-SQL (SQL Server) style. Other RDBMS have similar constructs (PL/SQL in Oracle, PL/pgSQL in PostgreSQL, etc.). These are used in the body of procedural code, not in a plain SELECT.

So, for typical query usage, remember that CASE is your tool for inline conditional logic. It’s very powerful: you can use it to implement conditional aggregation, conditional sorting (e.g., custom order), pivoting logic, etc.

43. Write a SQL query to find the second highest salary from a table.

Finding the second highest (or Nth highest) salary can be done in a few ways. One common approach is to use a subquery to find the highest and then find the next one below it:

Method 1: Subquery approach (max of less than max):

SELECT MAX(Salary) AS SecondHighestSalary
FROM Employees
WHERE Salary < (SELECT MAX(Salary) FROM Employees);
Explanation: The inner subquery finds the maximum salary overall. The outer query finds the maximum salary that is less than that overall max, which equates to the second highest.

Method 2: Order and limit (if the SQL dialect supports it, like MySQL/Postgres):

SELECT Salary
FROM Employees
ORDER BY Salary DESC
LIMIT 1 OFFSET 1;
This sorts salaries in descending order and skips the first one (OFFSET 1 skips the top salary) then returns the next (LIMIT 1 gives one result). The result is the second highest salary. (In SQL Server, you’d use SELECT TOP 1 Salary ... ORDER BY Salary DESC OFFSET 1 ROW or use TOP in a subquery.)

Method 3: Self-join or dense ranking method:

SELECT MIN(e1.Salary) AS SecondHighestSalary
FROM Employees e1
JOIN Employees e2 
  ON e1.Salary < e2.Salary
GROUP BY e1.Salary
HAVING COUNT(DISTINCT e2.Salary) = 1;
This is more complex: it joins the table to itself where e1’s salary is less than e2’s salary, and then groups by e1.Salary counting how many salaries are higher. The HAVING ensures we only get the salary that has exactly one salary higher than it (meaning it’s the 2nd highest). Then we select the minimum of those (which is the second highest).

Usually, the first method (MAX subquery) is straightforward and works in standard SQL. If there could be ties and you specifically want the second distinct highest, the above still works because < MAX(salary) inherently targets distinct highest.

44. Write a SQL query to find duplicate records in a table.

To find duplicates, you need to identify the values that appear more than once. Assuming by “duplicate records” we mean rows where some key or combination of columns repeats, the general approach is:

Method 1: Using GROUP BY and HAVING:

SELECT Column1, Column2, ..., ColumnN, COUNT(*) as DupCount
FROM YourTable
GROUP BY Column1, Column2, ..., ColumnN
HAVING COUNT(*) > 1;
Replace Column1...ColumnN with the columns that define a "record" for which duplicates are to be checked. If the entire row should be identical to count as duplicate, use all columns in the GROUP BY. The query will list each set of column values that appear more than once, along with how many times they appear (DupCount).

For example, if we have a table Employees(Name, Department, Age) and we want to find duplicate Name+Department pairs (perhaps the same person listed in the same department multiple times):

SELECT Name, Department, COUNT(*) as DupCount
FROM Employees
GROUP BY Name, Department
HAVING COUNT(*) > 1;
This would output something like ("Alice", "HR", 2) if Alice in HR appears twice.

Method 2: Self-Join approach:
This is less efficient but conceptually straightforward:

SELECT t1.*
FROM YourTable t1
JOIN YourTable t2
  ON t1.Column1 = t2.Column1 
 AND t1.Column2 = t2.Column2 
 ... (all columns matching)
 AND t1.PrimaryKeyColumn <> t2.PrimaryKeyColumn;
This joins the table with itself on all columns that define a duplicate, ensuring the primary key (or unique identifier) is different. This will give you pairs of duplicates. Typically, the GROUP BY method is preferred.

Once you have identified duplicates via one of the above, you might refine how you display them (maybe just the duplicated key values, or maybe full rows using a JOIN back to the main table).

45. Write a SQL query to count the number of employees in each department.

To count employees per department, you would use a GROUP BY on the Department column and an aggregate COUNT on the Employees.

Assuming a table Employees with at least columns EmployeeID and Department (or maybe DeptID), the query is:

SELECT Department, COUNT(*) AS NumEmployees
FROM Employees
GROUP BY Department;
This will produce output like:
Department    NumEmployees
-----------   ------------
Sales         10
HR            5
Engineering   20
Each row of the result gives a department name and the number of employees in that department.

A few notes:

  • We used COUNT(*) which counts rows. It’s common to use the primary key or some not-null column in COUNT, like COUNT(EmployeeID), which should give the same result as COUNT(*) as long as EmployeeID is never NULL (primary keys aren’t NULL).
  • If there were any departments with zero employees, they would not appear in the result at all, because the query only looks at departments that exist in the Employees table. (If you had a separate Department table and wanted to include departments with zero employees, you’d do an OUTER JOIN with that table.)
  • The result can be sorted if needed: e.g., add ORDER BY NumEmployees DESC to list departments by employee count.

46. Write a SQL query to display employee names along with their department names (assuming two tables: Employees and Departments).

We have two tables:

  • Employees (with columns like EmployeeID, Name, DeptID, etc.)
  • Departments (with columns DeptID, DeptName, etc.)

We want to join these so we can show each employee’s name alongside the department name they belong to.

The SQL query would be:

SELECT E.Name AS EmployeeName, D.DeptName AS DepartmentName
FROM Employees E
JOIN Departments D
  ON E.DeptID = D.DeptID;
Explanation:
  • We are using a JOIN (specifically an INNER JOIN) between Employees (aliased as E) and Departments (aliased as D).
  • The join condition is E.DeptID = D.DeptID, which links each employee to the department with the same DeptID.
  • We select the employee’s Name and the department’s DeptName, and optionally alias them for clarity in output.

If we also wanted to include employees who have no department assigned (maybe DeptID is NULL or doesn’t match), we could use a LEFT JOIN instead of JOIN (which is an INNER JOIN). With a LEFT JOIN, employees without a valid department would still show up, with DepartmentName as NULL.

But as the question is stated, a straightforward inner join is likely expected.

47. Write a SQL query to find all employees that do not have a matching department (i.e., employees whose department does not exist in the Departments table).

This suggests we want employees with an invalid DeptID (no corresponding DeptID in Departments). We can achieve this using an OUTER JOIN or a NOT EXISTS/NOT IN subquery.

Method 1: LEFT JOIN and check for NULL:

SELECT E.EmployeeID, E.Name, E.DeptID
FROM Employees E
LEFT JOIN Departments D
  ON E.DeptID = D.DeptID
WHERE D.DeptID IS NULL;
Here we left-join departments. For employees whose DeptID doesn’t find a match in Departments, the D.DeptID (and other D columns) will be NULL. The WHERE filters those, meaning those employees did not have a matching department.

Method 2: NOT EXISTS subquery:

SELECT EmployeeID, Name, DeptID
FROM Employees E
WHERE NOT EXISTS (
    SELECT 1 
    FROM Departments D 
    WHERE D.DeptID = E.DeptID
);
This returns employees for which no department with the same DeptID exists.

Method 3: NOT IN subquery:

SELECT EmployeeID, Name, DeptID
FROM Employees
WHERE DeptID NOT IN (SELECT DeptID FROM Departments);
This assumes DeptID is not null (if DeptID can be null, the NOT IN approach can have issues because if the subquery returns any null, the NOT IN condition fails to return any rows).

The result will list employees (likely by their ID or name, etc.) who are referencing a department that isn’t in the Departments table. Essentially, these are “orphan” records in referential integrity terms.

48. Write a SQL query to find all employees with a salary greater than the average salary.

To do this, we need to compare each employee’s salary to the average salary of all employees. We can calculate the average salary using a subquery or a CTE, and then filter those above that.

Using a subquery in the WHERE clause:

SELECT Name, Salary
FROM Employees
WHERE Salary > (
    SELECT AVG(Salary) 
    FROM Employees
);
This works as follows: the subquery (SELECT AVG(Salary) FROM Employees) computes the average salary across all employees. The outer query then selects those employees whose Salary is greater than that average.

Each row is compared against the same single value (the overall average). The DBMS will likely compute the subquery once since it’s not correlated to the outer query (some may even optimize it as a join or compute during planning).

Alternatively, using a CTE for clarity:

WITH AvgSal AS (
    SELECT AVG(Salary) AS AvgSalary
    FROM Employees
)
SELECT E.Name, E.Salary
FROM Employees E, AvgSal
WHERE E.Salary > AvgSal.AvgSalary;
This achieves the same result but may be more readable in some contexts.

The output will be the list of employees who earn above the overall average salary. This is a classic query that demonstrates use of subqueries.

49. How is a JOIN different from a UNION in SQL?

JOIN and UNION are fundamentally different operations in SQL:

  • A JOIN combines columns from two (or more) tables based on a related key between them, thereby creating wider rows. It is a horizontal combination of data. For each pair of rows that meet the join condition (or all pairs in case of cross join), the result includes a single row that contains columns from both tables.
  • A UNION combines the results of two (or more) selectable queries vertically, stacking rows on top of each other. It requires that the two queries have the same number of columns with compatible data types, and it produces a result set that is the concatenation of the two. So, if Table A and Table B have the same structure, UNION will produce a result set with as many rows as in A plus B (minus duplicates if using UNION without ALL).

In other words:

  • JOIN = adding more columns (from multiple tables into one result row).
  • UNION = adding more rows (from multiple query results into one result set).

Example difference:
Suppose you have a Customers table and an Orders table.

  • A JOIN (Customers JOIN Orders) might produce a row per order, with customer info attached to each. You get columns like CustomerName, OrderID, OrderDate, etc. in each row.
  • A UNION might be used if you had two tables with similar info (maybe 2024_Orders and 2025_Orders tables, for instance) and you UNION them to get a combined list of orders for both years.

Also:

  • Union inherently removes duplicates (unless UNION ALL is used), whereas a join’s result duplicates depend on the data relationships (it can duplicate data if one side has multiple matches).
  • You typically use JOIN when you need information from multiple tables in one result (through relationships). Use UNION when you have two similar datasets you want to view together (e.g., combining query results from two periods or two categories).

To summarize succinctly: JOIN merges data across tables by matching rows, expanding columns; UNION appends data from tables by matching columns, expanding rowsintellipaat.comintellipaat.com.

50. What is the difference between the IN and EXISTS clauses in SQL?

Both IN and EXISTS are used in WHERE clauses to filter based on a subquery, but they work in slightly different ways:

  • IN: WHERE column IN (subquery) – The subquery returns a list of values, and the outer query checks if the column’s value is equal to any of those. Essentially, the subquery produces a result set (often a single column), and the outer query’s column is compared against each value in that set. It’s like a set membership check. Example: SELECT * FROM Employees WHERE DeptID IN (SELECT DeptID FROM Departments WHERE Location = 'NY'); This finds employees whose department is among departments located in NY. The subquery returns a list of DeptIDs, and the outer query checks membership.
  • EXISTS: WHERE EXISTS (subquery) – The subquery is typically correlated (refers to the outer query’s table). EXISTS checks if the subquery returns at least one row. It doesn’t care about the specific values returned, just whether the subquery result is non-empty for a given outer row. Example (same logic as above using EXISTS): sqlCopyEditSELECT * FROM Employees E WHERE EXISTS ( SELECT 1 FROM Departments D WHERE D.DeptID = E.DeptID AND D.Location = 'NY' ); For each employee E, the subquery checks if there exists at least one Department D such that D.DeptID = E.DeptID and D.Location = ‘NY’. If yes, that employee qualifies.

Differences/Considerations:

  • Performance: EXISTS can be more efficient in some databases for correlated subqueries, especially if the subquery can stop searching on the first found match. IN typically evaluates the entire subquery and might internally sort or hash it (especially if it’s a large list). However, modern SQL optimizers often transform IN and EXISTS into similar plans if possible. For very large subquery results, EXISTS (with a join condition) can be more memory-efficient.
  • NULL semantics: If the subquery of an IN yields even a single NULL, and the outer value is not equal to any other non-null values, the result of value IN (list_with_null) becomes UNKNOWN (thus false in WHERE). For example, 5 IN (1,2,NULL) is neither true nor false (which effectively means false for the WHERE filter). EXISTS doesn’t directly involve matching values, so it’s not affected by null in that way.
  • Usage: IN is succinct for comparing a value against a list of literal values or an easy subquery. EXISTS is more flexible with correlated conditions. Some queries are easier to express with one or the other.

Often you can rewrite an IN to a join or an EXISTS and vice versa. For not-in, using NOT EXISTS is usually safer than NOT IN if the subquery could produce a NULL.

51. What are the various normal forms in database design (1NF, 2NF, 3NF)?

The process of normalization involves several normal forms, each with a specific rule that helps eliminate redundancy and dependency issues. The main normal forms are:

  • First Normal Form (1NF):
    • Each column must contain only atomic (indivisible) values (no repeating groups or arrays in a single column).
    • Each row is unique (there’s a primary key to identify each row).
    • Essentially, 1NF eliminates multi-valued attributes. For example, you shouldn’t have a column that holds multiple phone numbers; instead, have a separate row for each phone number or a separate table.
  • Second Normal Form (2NF):
    • Meets all requirements of 1NF.
    • Every non-key column must be fully functionally dependent on the entire primary key. This applies mostly to tables with composite primary keys.
    • In other words, no partial dependency: a column should not depend on part of a composite key. If it does, that column belongs in another table.
    • Example: If you have a composite primary key (A, B) and a column C that depends only on A (not on B), then 2NF is violated. You’d move C to a table where A is the primary key.
  • Third Normal Form (3NF):
    • Meets all requirements of 2NF.
    • No transitive dependencies: Non-key columns should not depend on other non-key columns.
    • This means every non-key attribute should depend only on the primary key, not on any other non-key attribute.
    • Example: If you have a table Employees(EmployeeID PK, DeptID, DeptName), here DeptName is dependent on DeptID (which is not the primary key in this Employees table). This is a transitive dependency via DeptID. To be in 3NF, DeptName should be moved to a Departments table keyed by DeptID.
  • Boyce-Codd Normal Form (BCNF):
    • A stronger version of 3NF. It requires that for every functional dependency X -> Y, X should be a superkey. In simpler terms, every determinant must be a candidate key. BCNF fixes certain edge cases not handled by 3NF (usually when there are multiple candidate keys).
  • Fourth Normal Form (4NF):
    • Deals with multi-valued dependencies. 4NF says there should be no non-trivial multi-valued dependencies other than a candidate key. This basically means if two independent multi-valued facts exist in one table, split them into separate tables.
  • Fifth Normal Form (5NF or Project-Join NF):
    • Deals with cases where information can be reconstructed from smaller pieces but is stored in a larger table – it’s about decomposing tables to eliminate redundancy that is not covered by lower NFs. It’s less commonly discussed in typical interviews.

In practice, most database designs aim for 3NF (or BCNF) as a good balance of normalization. Normal forms beyond 3NF are more theoretical and less commonly encountered, except in specialized designs.

For example, summarizing 1NF, 2NF, 3NF:

  • 1NF: Eliminate repeating groups (make a separate table for each set of related data).
  • 2NF: Eliminate partial dependencies (no column should depend on part of a composite primary key).
  • 3NF: Eliminate transitive dependencies (no column should depend on non-key columns).

Normalization ensures each fact is stored in one place. Higher normal forms reduce redundancy but at the cost of potentially more joins in queries. Designers sometimes denormalize for performance after reaching 3NF if needed.

52. What is Boyce-Codd Normal Form (BCNF)?

Boyce-Codd Normal Form (BCNF) is an advanced version of the third normal form (3NF). A table is in BCNF if it is in 3NF and, additionally, for every functional dependency X -> Y (a determinant X functionally determines Y) in the table, X is a superkey of the table.

In simpler terms:

  • BCNF addresses certain anomalies not handled by 3NF, especially in situations where a table has two or more candidate keys.
  • Any dependency in the table should be on the whole key, and nothing but the key. If any attribute (or set of attributes) on the left side of a dependency is not a superkey, it violates BCNF.

BCNF example scenario:
Consider a table with columns (Course, Professor, Textbook) and the following dependencies:

  • Professor and Course together determine Textbook (Professor + Course -> Textbook).
  • Professor determines Course (Professor -> Course). Perhaps each professor teaches only one course.

The candidate keys here could be (Professor) and (Course), because each uniquely identifies the other and the textbook. This table might be in 3NF (since all non-key fields depend on keys), but it has an anomaly: Professor -> Course means Professor is not a superkey (assuming a professor teaches one course, professor alone can identify course and textbook). This violates BCNF.

To fix this, you’d split into two tables:

  • Table1(Professor, Course) – (Professor is primary key, and course can be determined from it)
  • Table2(Course, Textbook) – (Course is primary key, determines textbook)

Now each dependency’s left side is a key in its table.

lets summarize it, BCNF is slightly stricter than 3NF. Most well-designed 3NF tables are also in BCNF. But if you find a table where a non-candidate key attribute functionally determines part of a candidate key (like the professor example), that table fails BCNF and should be decomposed.

BCNF ensures no redundancy due to functional dependencies. It eliminates all anomalies that 3NF might not handle. In practice, whenever possible, design tables to satisfy BCNF to ensure a robust schema.

53. What is the difference between database partitioning and sharding?

Both partitioning and sharding involve breaking up a database into pieces, but they usually refer to different contexts:

  • Partitioning typically means dividing a single database’s data into multiple segments (partitions) for manageability and performance, but usually on the same server or instance. Each partition is often based on some range or list of values of a key. Partitioning can be:
    • Horizontal partitioning: splitting rows (e.g., by range of date or ID). Each partition has the same table schema, but a subset of rows.
    • Vertical partitioning: splitting columns (rarely called partitioning per se, more like splitting a table’s columns into multiple tables).
      Partitioning is often done to improve query performance (by scanning only relevant partitions) or to make maintenance (like archiving old data) easier. Many RDBMS support table partitioning natively.
  • Sharding usually refers to distributing data across multiple servers or database instances. It’s a form of horizontal partitioning, but the emphasis is on distribution across different physical nodes. Each shard is effectively an independent database (with its own hardware/resources) holding a subset of the data. The application (or a middleware) directs queries to the appropriate shard based on some shard key (like user ID, geographic region, etc.). Sharding is often used for scaling out when the data volume or load is too high for a single server. Each shard is like a partition, but managed by a different database server.

Key differences:

  • Partitioning (in classic sense) is often within one database system (one server/instance), whereas sharding spreads across multiple systems.
  • Partitioning is usually transparent to the application (the DB engine handles it internally). Sharding often requires application awareness (or a special proxy) to route queries to correct shard.
  • Partitioning primarily addresses performance and manageability in a single database. Sharding addresses scalability and high volume by adding more machines.
  • In partitioning, joins across partitions are still on one DB and possible (though might be slower). In sharding, joins across shards are complex because they involve multiple databases (the application might have to handle that logic or avoid cross-shard joins entirely).

Example:
Imagine a very large user table.

  • If partitioned by the first letter of username (on one server), you have partitions A-F, G-L, etc., all on one DB instance.
  • If sharded, you might have user A-F on server1, G-L on server2, etc. Each server hosts that subset of users.

Sharding is common in NoSQL and large scale web apps for user distribution. Partitioning is a feature in many SQL databases (e.g., partition tables by date to speed up queries on recent data).

54. Can you update or insert data through a SQL view? Explain.

Whether you can UPDATE or INSERT through a view depends on the SQL database and the nature of the view.

In general:

  • Simple views that map directly to a single table (and include all primary key columns) are often updatable. This means you can do UPDATE view_name ... or INSERT INTO view_name ..., and it will affect the underlying base table.
  • Complex views that involve multiple tables (joins), aggregations, DISTINCT, GROUP BY, or calculated columns are usually not updatable by default. The database wouldn’t know how to map an insert or update unambiguously to the underlying tables.

For example, in many systems:

  • A view defined as SELECT * FROM Employees WHERE Department='Sales' – you can update through this view (for allowed columns). If you insert through it, it would insert a row in Employees with Department ‘Sales’ (some systems allow that).
  • A view defined as a join of Employees and Departments to show employee names and department names – that is typically not updatable because an insertion or update could involve two tables. Some RDBMS might allow updates on such a view for columns that clearly belong to one underlying table (called updatable join view with certain constraints like one side being a key-preserved table), but not all.
  • A view with an aggregate like SELECT Department, COUNT(*) as EmpCount FROM Employees GROUP BY Department cannot be directly updated or inserted because it doesn’t represent individual rows of a base table.

Many databases have rules or guidelines:

  • In SQL Server and Oracle, a view must be schema bound in certain ways to allow updates (i.e., each row in the view corresponds to exactly one row in a base table).
  • Some systems allow INSTEAD OF triggers on a view. This means you can define triggers to fire when an insert/update/delete is attempted on the view, and those triggers can perform the appropriate operations on the underlying tables, effectively making the view updatable via the trigger’s logic.

So, the answer:
Yes, you can update or insert through a view if the view is updatable. Typically, this requires the view to be directly based on a single table (no joins, no aggregates, no distinct, etc.), including key columns. If those conditions are met, modifications through the view get passed to the base table.

If the view is not inherently updatable (like multi-table or aggregate views), you cannot directly modify it, unless you use special mechanisms like instead-of triggers or updatable materialized views provided by the DBMS.

In summary, simple views (selecting certain columns/rows from one table) are usually updatable. Complex views (joins, etc.) are not directly updatable.

55. In what order are the clauses of an SQL SELECT statement executed?

SQL’s logical order of query processing (for SELECT statements) is not the same as the written order. The typical logical execution order is:

  1. FROM – Determine the source tables and perform any joins, producing a working result set. (Includes ON filter for joins at this stage.)
  2. WHERE – Apply row filters to the result from FROM (and JOIN). Rows that do not meet the conditions are removed.
  3. GROUP BY – Group the remaining rows by the specified columns (if grouping is used). This creates summary groups.
  4. HAVING – Apply filters to groups (works on the results of grouping/aggregation). Groups that don’t meet the condition are removed.
  5. SELECT – Compute any expressions, including aggregates, and choose the columns for the final output. (Note: SELECT is logically after HAVING, meaning you can only SELECT aggregate results that have been calculated. Also, aliases defined in SELECT can’t be used in earlier clauses like WHERE because SELECT comes later in logical order.)
  6. DISTINCT – (If used) Remove duplicate rows from the result set (it could be considered part of the SELECT stage in effect).
  7. ORDER BY – Sort the final results by specified column(s).
  8. LIMIT/OFFSET (or FETCH FIRST N in SQL:2008 standard) – If used, limit the number of rows returned (applied after ordering).

The written query might be:

SELECT ... 
FROM ... 
JOIN ... ON ...
WHERE ...
GROUP BY ...
HAVING ...
ORDER BY ...;
But behind the scenes, the database processes in the order described above (FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY).

Understanding this order explains why:

  • You can’t use an alias defined in SELECT in the WHERE clause (because SELECT happens later).
  • You can’t use aggregate functions in WHERE (you must use HAVING) because aggregation occurs after WHERE filters rows.
  • ORDER BY can use SELECT aliases (since it’s after SELECT logically).
  • The WHERE filters individual rows before grouping, while HAVING filters groups after grouping.

So, the execution order (in simpler terms):
FROM (including JOINs) -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY -> LIMIT.

Some add that JOIN is conceptually part of FROM; DISTINCT as part of SELECT processing. But the key sequence above is generally accepted for SQL query logical processing.

(Also note, SQL’s actual physical execution might differ due to optimization, but logically the results reflect that order.)

Advanced SQL Interview Questions (Experienced)

For senior or experienced candidates, SQL interview questions often delve into deeper knowledge of database systems, performance tuning, complex queries, and architectural decisions. This section covers advanced topics such as transactions (ACID properties), indexing strategies, query plans, security, and more. These questions test not just your ability to write SQL, but your understanding of how databases work and how to design for efficiency and correctness.

65. Explain the ACID properties of a transaction.

ACID is an acronym that stands for Atomicity, Consistency, Isolation, Durability, which are the key properties that guarantee reliable processing of database transactions:

  • Atomicity: A transaction is “all or nothing.” This means that either all operations in the transaction succeed, or none do. If any part of the transaction fails, the database state is left unchanged (as if the transaction never happened). The system ensures that partial transactions are not saved.
  • Consistency: A transaction should transition the database from one consistent state to another consistent state. This means that a transaction enforces all defined rules (constraints, triggers) of the database. If a transaction is completed (committed), it must obey integrity constraints (like foreign keys, unique constraints, etc.), preserving database consistency. If any inconsistency would occur, the transaction is aborted.
  • Isolation: Transactions operate as if they are the only transaction. The intermediate state of a transaction is invisible to other concurrent transactions. Isolation levels (serializable, repeatable read, read committed, read uncommitted) define the degree of isolation, controlling phenomena like dirty reads, non-repeatable reads, and phantoms. In the highest isolation (serializable), the outcome is as if transactions were executed sequentially, one after the other.
  • Durability: Once a transaction is committed, its changes are permanent, even in the event of a system crash or power failure. Durability is ensured by the DBMS using techniques like write-ahead logging, flushing transaction logs to disk, and backups. So after a commit, the client can trust the data is saved.

To illustrate: Suppose a banking transaction transfers $100 from Account A to Account B.

  • Atomicity: If debiting A succeeds but crediting B fails, the transaction will roll back, so A is not debited. Both actions happen or neither does.
  • Consistency: If the system had a rule that total money must remain constant, the transaction ensures after its completion the total in A + B remains the same, preserving that invariant.
  • Isolation: If two transfers are happening at once, each transaction sees the database as if the other isn’t concurrently modifying data. Partial updates from one won’t be seen by the other.
  • Durability: Once the transfer transaction commits, the updated balances will survive any crashes. The changes have been recorded to stable storage.

ACID properties are crucial for correctness in database systems, especially in financial or critical applications where transactions need to be reliably processed.

66. What are the different transaction isolation levels?

Transaction isolation levels control the degree to which transactions are isolated from one another, i.e., how/when the changes made by one transaction become visible to others. The SQL standard defines four isolation levels, primarily to allow tuning trade-offs between strictness of isolation and performance/throughput:

  • Read Uncommitted: The lowest level, where transactions may see uncommitted changes made by other transactions. This can lead to dirty reads (reading data that might be rolled back later). Generally, Read Uncommitted allows all phenomena: dirty reads, non-repeatable reads, and phantom reads.
  • Read Committed: A transaction will only see data that has been committed by other transactions. It prevents dirty reads. However, non-repeatable reads (if a row is modified by another transaction between two reads in your transaction) and phantom reads (new rows added by others between queries) can occur. This is a common default level in many databases (e.g., Oracle, SQL Server).
  • Repeatable Read: This level guarantees that if a transaction reads a row twice, it will see the same data each time (no other transaction can modify that row in between). It prevents dirty reads and non-repeatable reads. However, phantom reads can still occur (another transaction can insert new rows that match a WHERE clause of a query in your transaction, so a repeat of the same query might return additional rows). Many systems (like MySQL InnoDB) use Repeatable Read as default, and they handle phantom reads via next-key locking or MVCC.
  • Serializable: The highest isolation level. Transactions are completely isolated as if they were executed serially (one after the other). This level prevents dirty reads, non-repeatable reads, and phantom reads. It ensures full consistency and isolation – essentially it is as if transactions execute one at a time. This can be achieved by locking or multi-versioning. Serializable is the safest but potentially the slowest due to reduced concurrency (it can cause more blocking or aborts to maintain serial equivalence).

To summarize the phenomena:

  • Dirty Read: Reading data written by a transaction that hasn’t committed yet (and might roll back).
  • Non-Repeatable Read: Getting different values for the same row in two reads within a transaction because another committed transaction modified that row in the meantime.
  • Phantom Read: Getting a different set of rows in two executions of the same query within a transaction because another transaction committed inserting or deleting some of those rows.

Isolation levels in practice might be implemented using locks or MVCC (Multi-Version Concurrency Control). Some databases also offer a level called Snapshot isolation (similar to repeatable read with MVCC, preventing phantoms in some implementations).

So, the four standard isolation levels (from weakest to strongest) are: Read Uncommitted, Read Committed, Repeatable Read, Serializable.

67. What is a deadlock in SQL, and how can it be resolved?

A deadlock in SQL (or any database system with locking) is a situation where two or more transactions are each waiting for the other to release a lock, and thus none can proceed. It’s a cyclic waiting scenario.

For example, Transaction A locks row 1 and needs row 2, while Transaction B locks row 2 and needs row 1. A is waiting for B’s lock and B is waiting for A’s lock – they’re deadlocked.

When a deadlock occurs:

  • The database management system will typically detect it (most have a deadlock detector that runs periodically or upon certain wait conditions).
  • To resolve a deadlock, the DBMS will usually choose one of the transactions as a “victim” to roll back, freeing its locks so that the other transaction(s) can proceed. The chosen victim transaction is aborted (often the one that can be rolled back with the least cost or the one that has done the least work so far), and the other transaction gets to continue.

How to resolve or prevent deadlocks:

  • Application Design: Ensure a consistent ordering of accessing resources (like tables or rows) across transactions. If all transactions lock tables or rows in the same order (e.g., always lock Account table, then Order table in that order), deadlocks are less likely.
  • Lock Granularity: Sometimes deadlocks arise from locking too much data. Using more granular locks (row-level vs table-level) or using READ COMMITTED isolation (which may use less strict locking) can reduce deadlock chances.
  • Short Transactions: Keep transactions short in duration and touch as little data as necessary. This reduces the window where deadlocks can occur.
  • Deadlock Retry Logic: If an application receives a deadlock error (like SQL Server error 1205), it should be designed to retry the transaction after a random short delay, since deadlocks are usually transient.
  • Use appropriate isolation levels: If deadlocks are frequent, perhaps lowering isolation on some read transactions to not hold long locks (like using snapshot isolation or READ COMMITTED with row versioning) can help.

Detection and Logging: Most databases log when deadlocks happen. For instance, SQL Server will choose a victim and throw an error to that session, while Oracle automatically rolls back one statement (and not the whole transaction by default) – Oracle uses a different approach (waits will time out as “deadlock detected”).

In summary, a deadlock is a standoff between transactions over resources. The system typically breaks the deadlock by killing one transaction. As a developer or DBA, you resolve deadlocks by prevention (proper ordering, shorter transactions, indexing to avoid long lock holds during scans, etc.) and by handling exceptions (retrying aborted transactions). Once the deadlock is broken by rolling back one transaction, the other can proceed and the deadlock is resolved.

68. What is database partitioning, and what are its benefits?

Database partitioning is the process of dividing a large database table (or index) into smaller, more manageable pieces, called partitions. Each partition is an independent segment of the data, but together they represent the whole table’s data.

There are different partitioning strategies:

  • Horizontal Partitioning: Distribute rows among partitions. For example, you might partition a table by date range (January data in one partition, February in another) or by an ID modulus (IDs 1-1000 in one, 1001-2000 in another). Each partition has the same columns, but a different subset of rows.
  • Vertical Partitioning: Distribute columns among tables (not often referred to as “partitioning” in DBMS, more like splitting a table).
  • List Partitioning: Partition by a list of values (e.g., region = ‘US’ vs ‘EU’).
  • Hash Partitioning: Partition by a hash function (to evenly distribute data when ranges aren’t obvious).
  • Composite Partitioning: combination of the above (e.g., first by range, then by hash within each range).

Benefits of partitioning:

  • Performance: Queries that target a specific partition can be faster because the database only has to scan that partition instead of the entire table. This is called partition pruning. For example, a query for data in January only touches the January partition. Indexes can also be partitioned, making them smaller and more efficient to search within each partition.
  • Manageability: Partitioning makes maintenance tasks easier. You can load or remove data by partition (e.g., drop an old partition of data that’s past retention, rather than running a slow DELETE). You can rebuild indexes or back up data one partition at a time.
  • Concurrency: Different partitions can sometimes be processed in parallel by the DBMS (parallel query operations, or parallel loading).
  • Reduced contention: If heavy insert/update occurs mostly in one partition (e.g., current month) and reads occur in another (e.g., archival data), they contend less with each other (less locking interference, etc.).
  • Improved Disk I/O: You can store partitions on different disks (spreading I/O) or even different storage types (keep recent hot data on fast SSDs, old archive on slower disks).
  • Partition-wise joins: If two tables are partitioned on the same scheme, joins can be optimized to occur partition-by-partition, potentially improving join performance.

It’s worth noting partitioning is mostly beneficial for very large tables or specific access patterns. For smaller tables, the overhead might not be worth it.

Example Use Case: A massive log table partitioned by day. Benefits:

  • You can drop partitions older than 90 days instantly instead of running a huge delete.
  • Queries for recent logs (by date range) only touch relevant partitions.
  • Index on date is effectively partitioned, making inserts for current date only affect the current partition (less index bloating on entire dataset).

69. What is an execution plan in SQL and how is it useful?

An execution plan (or query plan) is a detailed roadmap of how the database query optimizer will execute a given SQL query. It shows the steps the database will take to retrieve or modify data, including which indexes will be used, join algorithms, the order of operations, how tables are scanned, etc.

Key elements you might see in an execution plan:

  • Operations: like table scans (sequential scan), index seeks, index scans, nested loop join, hash join, merge join, sort, aggregation, etc.
  • Order of execution: Typically a tree or hierarchy of operations showing what feeds into what.
  • Cost estimations: The optimizer’s estimated cost (perhaps relative) for each step or overall. This often includes estimated number of rows at each step.
  • Predicate details: which filters (from WHERE clause) are applied at which step.

Execution plans can be viewed via commands like EXPLAIN (PostgreSQL, MySQL) or graphical tools (SQL Server Management Studio’s “Display Estimated Execution Plan”). They can be actual (after running the query) or estimated (without running).

How it is useful:

  • It allows developers/DBAs to understand how a query is being executed internally, which is crucial for performance tuning.
  • By examining the plan, you can verify if the query is using indexes as expected or if it’s doing a full table scan.
  • You can spot if a join is done by a less efficient method (like nested loop over large sets) vs a more optimal one (like hash join for large, unsorted sets).
  • If a query is slow, the plan can reveal the bottleneck (e.g., maybe it’s sorting a huge dataset because an index is missing).
  • Helps in identifying if statistics misled the optimizer (for example, estimated vs actual rows differ greatly).
  • Plan can show if partition pruning is happening, if parallel execution is used, etc.

Example: You have a slow query SELECT * FROM Orders WHERE CustomerID = 123. The plan might show a full table scan on Orders if there’s no index on CustomerID. Realizing that, you add an index and check the plan again, now it shows an index seek on the new index, which should be much faster.

Another example: A query joining two big tables without an index on the join column might use a hash join or merge join with a sort. The plan might highlight a large sort operation or scan that indicates you should add an index or restructure the query.

70. What are window functions in SQL and why are they useful?

Window functions (also known as analytic functions) perform calculations across a set of rows that are related to the current row, as defined by an OVER clause. Unlike regular aggregate functions, window functions do not collapse rows; each row retains its identity and can have an aggregated value computed from a “window” of surrounding rows.

Key points:

  • A window function uses the syntax: function_name(expr) OVER (PARTITION BY ... ORDER BY ... frame_clause).
  • Common window functions include:
    • Aggregate window functions: SUM(), AVG(), COUNT(), MIN(), MAX() used as window functions.
    • Ranking functions: ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE().
    • Value functions: LAG(), LEAD() (to access previous or next row’s values), FIRST_VALUE(), LAST_VALUE(), NTH_VALUE().
  • PARTITION BY divides the result set into partitions (like grouping, but without collapsing the rows).
  • ORDER BY within OVER can define ordering to, for example, compute running totals or rankings.
  • The frame_clause (e.g., ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) specifies the subset of the partition considered for the current row’s calculation (if omitted, defaults depend on function; for ranking, the whole partition is default, for aggregates default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW in many databases).

Why they are useful:

  • They allow you to perform calculations like running totals, moving averages, rankings, percentiles, lead/lag comparisons, etc., in a single query without the need for self-joins or subqueries.
  • They are very powerful for reporting and analytics. For example, you can easily add a column with the cumulative sum of sales or the rank of each row within its category.
  • They often result in clearer, more efficient SQL than alternative approaches. For instance, using ROW_NUMBER() to deduplicate data or LAG() to find differences between consecutive rows.
  • Window functions can access data from other rows while still returning a result for each row, which standard aggregates cannot do directly.

Example use case:
Suppose you have a Sales table (Date, Region, Amount). If you want to list each sale and also show the total sales for that region, you can do:

SELECT Date, Region, Amount,
       SUM(Amount) OVER (PARTITION BY Region) AS TotalByRegion
FROM Sales;
This will output each sale row, alongside a TotalByRegion that repeats the total for that region on each row. A regular SUM with GROUP BY would have collapsed output to one row per region.

Another example:
Ranking – say you want to rank employees by salary within each department:

SELECT Name, Department, Salary,
       RANK() OVER (PARTITION BY Department ORDER BY Salary DESC) as SalaryRank
FROM Employees;
This lists employees and their rank in salary in their department. Doing this without window functions would be more complex (maybe a self-join or subquery counting how many have higher salary).

Window functions do not remove rows, making them ideal for comparative analyses (like “compare each row to the average of its group” or “difference from previous row”).

In summary, window functions allow you to perform advanced calculations across sets of rows (windows) without aggregating away details. They are extremely useful for analytical queries and reporting.

71. What is a cursor in SQL, and when would you use it?

A cursor is a database object used to retrieve, manipulate, and navigate through result set row-by-row. It acts as a pointer to the result set of a query. In procedural extensions of SQL (like T-SQL, PL/SQL, PL/pgSQL), cursors allow you to loop through rows one at a time and process them.

Typical steps in using a cursor (in, say, T-SQL or PL/SQL):

  1. Declare the cursor (with a SELECT statement to define the result set).
  2. Open the cursor (execute the query and establish result set).
  3. Fetch rows one-by-one (or in blocks) from the cursor into variables.
  4. Process the data for each row within a loop.
  5. Close the cursor (release the result set).
  6. Deallocate or free the cursor resources.

When to use it:

  • Cursors are useful when you need to perform row-by-row processing that can’t be done in a set-based operation easily. For example, complex sequential calculations where the next row’s logic depends on the previous row’s results, and it might be hard or impossible to do with window functions or set logic.
  • When interfacing with procedural code that inherently needs to handle one row at a time, such as performing an action for each row (though often, set-based or batch operations are preferred).
  • Some administrative tasks or algorithms are easier to implement with cursors.

However, cursors are generally slower and more resource-intensive than set-based operations, because they bypass the set-oriented nature of SQL. Operating row-by-row is usually much less efficient.

Today, many tasks that once might have used cursors can often be done with:

  • Window functions (for calculations that depend on other rows).
  • SET-based updates/inserts (using single SQL statements that affect multiple rows).
  • Stored procedures with vectorized operations.

Example scenario for cursor:
Imagine you need to send a custom email to each user in a table, and you want to use database code to fetch user info and call an email sending routine for each. You might use a cursor to loop through each user record and execute a send operation. (Though in many architectures you’d pull into application code to do that.)

In SQL Server, a common alternative is using a set-based approach or a WHILE loop with a table variable. But sometimes cursors (especially FAST_FORWARD or READ_ONLY cursors) are acceptable for things like generating sequential numbers in certain odd scenarios, or legacy code.

72. What is a materialized view, and how does it differ from a regular view?

A materialized view is a database object that contains the results of a query physically stored on disk. In other words, it’s like a view that has been computed and cached as a table.

Regular view:

  • A standard (regular) view is just a stored query definition. It doesn’t hold data itself. When you query a view, the database effectively runs the view’s query on the underlying tables each time. It’s essentially a macro or logical layer; no data is stored specifically for the view.
  • Views are always up-to-date because they reflect the current underlying data (since they compute results on the fly), but there’s no performance benefit beyond query simplification or security (except the DB query optimizer might merge the view’s query with the outer query for optimization).

Materialized view (MV):

  • It stores the view’s result set physically. It’s like a snapshot or caching of the data derived by the view’s query.
  • Because it’s stored, you can index a materialized view and query it like a table, leading to potentially huge performance improvements for complex queries (especially aggregates or joins) that don’t need real-time freshness.
  • However, because it stores data, it needs to be refreshed to reflect changes in the underlying tables. Refresh can be done on demand (manually), or on a schedule, or immediate (some systems support immediate refresh where changes to base tables propagate to MVs, but that can be complex or using logs).
  • MVs take up storage space and must be maintained (either automatically or via explicit refresh). They can become stale if not refreshed.

Differences recap:

  • Data storage: Regular view = no, MV = yes.
  • Performance: Regular view = query executed on the fly; MV = precomputed result for faster reads.
  • Freshness: Regular view = always fresh (no separate maintenance); MV = might be stale, needs refresh to update.
  • Use cases:
    • Regular view: simplify complex queries, encapsulate logic, provide security by exposing only certain columns/rows.
    • Materialized view: improve performance of expensive queries (like summary tables) in scenarios where data doesn’t change too rapidly or real-time accuracy isn’t required, or where you can afford to maintain the MV.

Example:
Suppose you have a huge sales table and often need to query the total sales per region. A view TotalSalesByRegion might be SELECT Region, SUM(Amount) FROM Sales GROUP BY Region. If you use a regular view, each query on it will scan and sum possibly millions of rows from Sales. If you use a materialized view for that, the sum per region is stored; a query on the materialized view is just like reading from a small precomputed table of regions (very fast). You’d then refresh this MV perhaps nightly or whenever new sales data comes in.

In Oracle and PostgreSQL (via “materialized view”), and SQL Server (via “indexed views”), these exist albeit with different names/implementation details. MySQL doesn’t have native materialized views (people often simulate with triggers or application logic).

In summary: A materialized view is a physical copy of data derived from a query (for speed), whereas a standard view is a virtual table defined by a query (for convenience and logical abstraction).

73. How do you prevent SQL injection in your queries?

SQL injection is a security vulnerability that occurs when an application incorporates untrusted user input into an SQL query in an unsafe manner, allowing attackers to alter the query’s structure. To prevent SQL injection:

  • Use Prepared Statements (Parameterized Queries): This is the most important protection. Rather than constructing SQL strings by concatenation, use placeholders (e.g., ? or named parameters like @param) and then bind user input to those parameters. The database treats bound values as data, not as part of the SQL command, thus injection is prevented. For example, in Java using JDBC:
PreparedStatement ps = conn.prepareStatement("SELECT * FROM Users WHERE username = ? AND password = ?"); ps.setString(1, userInputName); ps.setString(2, userInputPassword); ResultSet rs = ps.executeQuery();
  • Or in PHP PDO:
$stmt = $pdo->prepare("SELECT * FROM Users WHERE username = :user"); $stmt->execute(['user' => $userInput]);
  • Parameterization ensures that even if userInput contains something like "' OR 1=1 --", it will be treated as a literal string, not as SQL.
  • Use Stored Procedures (with caution): If you call stored procedures with parameters (and not by string building), it similarly separates code from data. However, stored procedures themselves can be vulnerable if they dynamically build queries insecurely inside.
  • Input Validation and Sanitization: While prepared statements are usually sufficient, it’s also good to validate user input for expected format (like ensure an ID is numeric, or a name contains no weird characters) and length (to avoid overly long inputs). This is more about defense-in-depth and user feedback; it’s not foolproof against injection if done alone, but it’s a good practice.
  • Least Privilege: Ensure the database account used by the application has the minimum rights needed. That way, if an injection does happen, the damage is limited (e.g., the account might not have DROP TABLE privileges).
  • ORMs / High-level APIs: Using an Object-Relational Mapper or other database abstraction can help, as they typically handle parameterization for you. But still remain aware of raw query methods.
  • Avoid constructing SQL with direct user input insertion: e.g., avoid directly including user input in an ORDER BY or in a list of columns unless you validate it against an allowlist of safe values.
  • Use of Stored Procedures or Functions for set logic: can reduce dynamic SQL usage.
  • Escaping inputs properly (less preferred than parameterization, but in some contexts like dynamic LIKE patterns you might escape wildcard characters). If parameterization is not available (like older languages, or constructing SQL scripts), you must escape characters like quotes. But manual escaping can be error-prone and vary by DB syntax.

Illustration:
If you had concatenated:

$query = "SELECT * FROM Users WHERE username = '$userInput'";
And userInput was alice' OR '1'='1, the resulting SQL becomes:
SELECT * FROM Users WHERE username = 'alice' OR '1'='1'

Which is always true (returns all users). Using a parameterized query avoids that because the ' in input is handled properly.

In summary, parameterize every variable that goes into SQL. That is the number one rule. Proper ORMs and database libraries make this straightforward. Additional measures (validation, using least privileges, etc.) add layers of protection. By doing so, you ensure that user input cannot break out of the intended query structure.

74. How do SQL dialects like PL/SQL and T-SQL differ from standard SQL?

SQL dialects such as PL/SQL (Oracle) and T-SQL (Transact-SQL for Microsoft SQL Server/Sybase) extend standard SQL with programming constructs and DB-specific features. Some differences include:

  • Procedural Extensions:
    • PL/SQL and T-SQL both allow procedural code (variables, loops, conditionals, etc.) to be executed on the database side (in stored procedures, functions, triggers). Standard ANSI SQL is just query language without procedural logic.
    • For example, PL/SQL has BEGIN ... END blocks, IF-THEN-ELSE, LOOP, etc., and so does T-SQL (though T-SQL’s syntax differs, e.g., BEGIN...END, and control flow like WHILE, BEGIN...END for batches).
  • Syntax and Functions:
    • T-SQL uses @ as a prefix for variables (@var), whereas PL/SQL variables are defined in a declaration section without a prefix.
    • Oracle PL/SQL and standard SQL uses := for assignment inside PL/SQL, T-SQL uses SET @var = value or SELECT @var = ....
    • Date/time functions and others differ: T-SQL has GETDATE() for current datetime, Oracle uses SYSDATE.
    • String concatenation: T-SQL uses +, Oracle uses || (standard SQL).
  • Procedures and Functions Differences:
    • Oracle PL/SQL distinguishes between procedures (no return) and functions (return a value) and allows both to be stored. T-SQL has stored procedures (which can return status via RETURN int and output parameters) and user-defined functions (which can return scalar or table).
    • Error handling: T-SQL historically used @@ERROR and now TRY...CATCH blocks, PL/SQL uses EXCEPTION ... WHEN ... blocks.
  • Batches and Blocks:
    • In T-SQL, a batch is a set of commands sent at once (delimited by GO in tools), and variables have batch scope. In PL/SQL, you have named blocks (like in stored procedures) and an anonymous block for one-off.
  • Built-in Packages/Procedures:
    • Oracle has extensive built-in packages (DBMS_OUTPUT, DBMS_SCHEDULER, etc.) and PL/SQL-specific libraries. SQL Server has system stored procedures (like sp_help) and functions and different system views.
  • Transactional behavior:
    • Oracle by default commits DDL automatically, and in PL/SQL you cannot roll back within a stored procedure beyond certain savepoint behaviors. T-SQL allows transactions in procedures more straightforwardly. Also, Oracle does not allow committing inside a PL/SQL function (only in autonomous transactions or in procedures).
  • Differences in DML with output:
    • Oracle’s RETURNING clause vs T-SQL’s output clause. For example, INSERT ... RETURNING col INTO variable in PL/SQL vs OUTPUT Inserted.col in T-SQL.
  • Cursors and triggers syntax:
    • Both have cursors but syntax differs slightly. T-SQL triggers are per-statement and use inserted/deleted tables, Oracle triggers can be per-row or per-statement with :OLD/:NEW references.
  • Standard Compliance:
    • Both T-SQL and PL/SQL support core SQL (SELECT, etc.), but differ on advanced SQL standard features. Oracle tends to stick closer to standard in some syntax (like MERGE is similar across, CASE expressions), whereas T-SQL historically had some non-standard aspects (like the old outer join syntax *=, which is deprecated).

Summarizing: PL/SQL (Oracle’s Procedural Language SQL) and T-SQL (Microsoft/Sybase) allow you to embed SQL in programming logic on the server. They have different syntax and built-in functions, but both extend standard SQL (which is mainly declarative) with imperative capabilities. They also differ in some SQL specifics (like how to paginate results before SQL:2008 standard: T-SQL used TOP, Oracle used ROWNUM or later ANSI OFFSET/FETCH).

In practice, if you move from Oracle to SQL Server or vice versa, you’d need to adapt things like: variable declaration (DECLARE @x INT vs x NUMBER; in PL/SQL declare section), string concatenation, date functions, how to execute stored procedures, etc. The fundamental SQL (SELECT/INSERT/UPDATE) is similar, but these dialects differ in their procedural and system-specific features.

75. What is the difference between OLTP and OLAP databases?

OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) are two different types of database usage with distinct characteristics:

  • OLTP systems are optimized for handling a large number of short, atomic transactions that modify the database (e.g., insert/update/delete operations). They support day-to-day transactional tasks in an application.
    • Workload: Many concurrent users, frequent small transactions (like bank account updates, order entries).
    • Data Model: Highly normalized schema (3NF or beyond) to avoid anomalies and ensure fast, consistent writes. Typically relational databases.
    • Query Types: Simple, specific queries (read/write) focusing on one or few records (like fetch this customer’s order, update this inventory count). Queries are typically indexed for primary key lookups or small range scans.
    • Examples: Banking systems, e-commerce order systems, inventory management, etc. Real-time systems where consistency and speed of individual transactions are critical.
    • Performance Metrics: Measured by TPS (transactions per second), low latency for individual operations, concurrency handling.
  • OLAP systems are designed for analysis and reporting on large volumes of data. They often power business intelligence (BI), allowing complex queries to aggregate and analyze historical data.
    • Workload: Fewer users (analysts) running complex, long-running queries that process lots of data (scans, aggregations).
    • Data Model: Often denormalized schemas like star or snowflake schemas (fact and dimension tables). Redundancy is accepted to optimize read performance (like having pre-aggregated data). Data warehouses often fall here.
    • Query Types: Complex queries involving joins, aggregations across millions of rows (e.g., total sales by region by quarter for last 5 years). Lots of read-only queries; updates are batched (like nightly ETL).
    • Examples: Data warehouse for a retailer, where management runs queries to find trends, or a dashboard that shows KPIs aggregated from many records. OLAP cubes or analytical databases like SAP BW, etc.
    • Performance Metrics: Measured by query throughput, ability to handle big scan/aggregation in reasonable time, maybe by how quickly it can return a multi-dimensional analysis. Emphasis on high query performance for reads, often using specialized indexing (like bitmap indexes) and parallel processing.

Key differences:

  • Purpose: OLTP = operational, record-keeping; OLAP = analytical, decision support.
  • Data updates: OLTP constant inserts/updates by users; OLAP mostly periodic bulk loads, then heavy reads.
  • Schema design: OLTP normalized to reduce duplicates, ensure integrity in transactions; OLAP denormalized to reduce joins and accelerate read queries.
  • Data volume: OLTP deals with current data pertinent to operations (perhaps smaller time window of data); OLAP accumulates large historical data sets (often summarizing across long periods).
  • Examples:
    • OLTP: Making a travel reservation (each seat booking is a transaction).
    • OLAP: Analyzing which flight routes have the highest occupancy over the last year and trends – requiring scanning large log of bookings.

In practice, organizations often maintain a separate database for OLAP (a data warehouse) that is fed from OLTP databases through ETL processes. This separation ensures the heavy analytical queries don’t slow down the transactional system, and the transactional system’s frequent writes don’t interfere with analytical consistency requirements.

76. Write a SQL query to compute a running total (cumulative sum) of sales ordered by date.

To compute a running total (cumulative sum) of sales by date, we can use a window function. Assuming we have a Sales table with columns SaleDate and Amount, and we want the running total ordered by date (and typically if multiple sales per day, by date ordering, maybe including them day-wise or transaction-wise).

A query using the window function SUM() OVER would be:

SELECT 
    SaleDate,
    Amount,
    SUM(Amount) 
      OVER (ORDER BY SaleDate 
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
           ) AS RunningTotal
FROM Sales
ORDER BY SaleDate;
Explanation:
  • The SUM(Amount) OVER (ORDER BY SaleDate ROWS UNBOUNDED PRECEDING TO CURRENT ROW) calculates a cumulative sum of Amount from the beginning up to the current row in the sorted order of SaleDate.
  • We explicitly specify the frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW which is the default for cumulative sum when an ORDER BY is present in many SQL dialects (so you might omit it and it defaults to that frame).
  • This assumes SaleDate is unique or if not, then the exact output might do running total including ties in chronological order (if multiple entries same date, it will still accumulate sequentially).
  • We ORDER BY SaleDate in the final select to output in chronological order with their running total alongside.

For example, if the table has:

SaleDate     Amount
2023-01-01   100
2023-01-03   200
2023-01-05   50

The result would be:

SaleDate     Amount   RunningTotal
2023-01-01   100      100
2023-01-03   200      300
2023-01-05   50       350
If using a database or scenario without window functions (older systems), one could do a self-join or correlated subquery:
SELECT a.SaleDate, a.Amount,
       (SELECT SUM(b.Amount) 
        FROM Sales b
        WHERE b.SaleDate <= a.SaleDate) as RunningTotal
FROM Sales a
ORDER BY a.SaleDate;
This correlated subquery adds up all amounts up to the current row's date. But this approach is less efficient than using window functions available in modern SQL.

But the window function method is the preferred, more efficient solution if supported.

77. What is the difference between the RANK() and DENSE_RANK() functions?

Both RANK() and DENSE_RANK() are window functions used to provide a rank to rows in an ordered partition. The difference lies in how they handle ties (equal values):

  • RANK(): If there are ties in the ordering, RANK will assign the same rank to tied rows, and then skip subsequent ranks. In other words, it leaves gaps in the ranking sequence after ties. For example, if ordering by a score descending, and two participants have the highest score, they both get rank 1. The next highest score gets rank 3 (because ranks 1, 1, then 3 – rank 2 is skipped). Specifically, rank is “1 + number of rows preceding in the ordered partition.” So if two rows share rank 1, the next row’s rank = 1 + 2 = 3.
  • DENSE_RANK(): If there are ties, DENSE_RANK also gives the same rank to tied rows, but it does not skip any ranks after ties. It compacts the ranking. Using the same scenario, two participants tie for highest score => both get dense_rank 1. The next highest score gets dense_rank 2 (no gap). Dense rank is “1 + number of distinct values preceding in order.”

Ordered by Score DESC:

  • Using RANK():
    • Alice: rank 1
    • Bob: rank 1 (tie with Alice, so same rank)
    • Charlie: rank 3 (because two people were ahead, he is third place, ranks 2 is skipped)
    • Dave: rank 4 (Dave is 4th in order? Actually check: Alice/Bob tie=rank1, Charlie=rank3, Dave and Eve tie for next distinct score after Charlie, they’d be rank 4?)
    • Eve: rank 4 (tie with Dave)
    Let’s recalc carefully:
    Alice & Bob = rank 1 (top score)
    Charlie = rank 3 (he is the 3rd row in ordering; rank formula sees 2 before him so 2+1=3)
    Dave & Eve = rank 4 or 5? Actually, by the formula:
    • For Dave: number of rows before him = 3, so rank = 3+1 = 4
    • Eve has 4 rows before (Alice, Bob, Charlie, Dave), but Dave had same score as Eve? Actually in ordering if Dave and Eve have same score 75:
      The ordering might list Dave rank4, Eve rank4 (tie), and if another came after they’d be rank6.
      So final ranks: 1,1,3,4,4. Yes, rank jumps: (1,1,3,4,4). Notice rank 2 and rank 5 were skipped.
  • Using DENSE_RANK():
    • Alice: dense_rank 1
    • Bob: dense_rank 1 (tie)
    • Charlie: dense_rank 2 (next distinct score gets next rank 2, rather than 3)
    • Dave: dense_rank 3 (next distinct score after 80 is 75, gets rank 3)
    • Eve: dense_rank 3 (tie with Dave)
    So sequence: 1,1,2,3,3.

Summary:

  • RANK: sequence might go 1,1,3,4,… (gaps in numbers for ties).
  • DENSE_RANK: sequence might go 1,1,2,3,… (no gaps; “dense” meaning no gaps in ranking numbers).

Which one to use depends on how you want to assign subsequent ranks after ties. For example, in competition scoring:

  • If two athletes tie for 1st, and the next is called “3rd place”, that’s RANK behavior.
  • If you prefer to call the next “2nd place” (despite two people ahead), that’s DENSE_RANK behavior.

SQL usage:

SELECT Name, Score,
       RANK() OVER (ORDER BY Score DESC) as Rank,
       DENSE_RANK() OVER (ORDER BY Score DESC) as DenseRank
FROM Results;

78. What is a surrogate key in a database?

A surrogate key is an artificially generated key that uniquely identifies a record in a table, which is not derived from any application data. It's typically a single integer (or UUID, etc.) assigned to each row, often automatically by the database (like an auto-increment column or sequence).

Key points about surrogate keys:

  • They have no inherent business meaning – they’re not derived from the actual attributes of the data. For example, an EmployeeID that is just an auto-increment number is a surrogate key for an Employee.
  • They are used as primary keys in place of natural keys (which are keys coming from actual data, like Social Security Number or email address could be a natural key for a person, for instance).
  • Surrogate keys are often integer numbers or GUIDs. Integers are popular because they are compact and auto-increment ensures uniqueness easily.
  • They simplify relationships – using a single surrogate key column as a primary key and foreign keys referencing it is straightforward and avoids multi-column key issues.

Advantages:

  • Simplicity: It’s just one column (often numeric), easy to index and join on.
  • Stability: Surrogate keys typically never change, whereas natural data might change (e.g., a person might change their email or even name, but an internal PersonID would remain the same).
  • Uniqueness guaranteed by system (no need to worry about composite uniqueness or natural key collisions if using surrogate).
  • Decoupling from business logic: If business rules change (say two companies merge and now there might be duplicate customer numbers), surrogate keys can remain unaffected.

Disadvantages:

  • They don’t convey any information themselves (which can be an advantage or disadvantage). For debugging or manual lookups, a natural key can be more informative.
  • If not careful, can allow duplicate records logically (since uniqueness is only enforced on surrogate, two rows might accidentally represent the same real entity if natural uniqueness isn’t also enforced with constraints).
  • Additional join needed sometimes to get human-meaningful data (you have an Order table with CustomerID surrogate, you must join to Customer to see which actual customer).

Surrogate vs Natural key example:
Natural key: A country’s ISO code (like “US” for United States) could be a natural primary key for a Country table. Surrogate key: assign an integer like CountryID = 1 for USA, 2 for Canada, etc.
If someday a new ISO standard changes “US” to something else, the surrogate ID 1 can remain constant and all foreign keys to Country remain valid.

When to use surrogate keys:
Common in most designs, especially for big systems where natural keys are composite or prone to change or large. Almost all dimensions in a data warehouse use surrogate keys (dimension IDs) for performance and managing slowly changing dimensions. In OLTP, surrogate keys are very common for main entities (CustomerID, OrderID, etc.).

In summary, a surrogate key is a substitute for a natural primary key. It’s an arbitrary unique identifier (with no business meaning) used to uniquely identify each row.

Final Thoughts

Preparing for an SQL interview needs combines mastering technical skills with developing effective communication strategies. By immersing yourself in regular practice, revisiting core concepts, and understanding the intricacies of database design, you build a strong foundation. Equally important is your ability to articulate your problem-solving process clearly and confidently.

Remember, interviews are not just assessments of your knowledge but also opportunities to showcase your analytical thinking and adaptability. Engage with the interviewer, ask clarifying questions, and walk them through your reasoning. This not only demonstrates your technical proficiency but also your collaborative mindset.

Ultimately, your dedication to continuous learning and your ability to apply knowledge effectively will set you apart. Approach each interview as a learning experience, and let your passion for data and problem-solving shine through.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *