How Important Is Data Structures And Algorithm Knowledge Important For Data Scientist?

Emma Delaney
5 min readMay 25, 2023

--

Introduction

Algorithms and data structures are considered core skills for software developers. How useful are these skills for data scientists and data analysts?

A typical data scientist spends most of his time on high-level languages like Python/R/SQL and rarely has to think about the underlying implementations. Indeed, most machine learning and data analysis algorithms are already built into highly optimized, ready-to-use libraries such as Scikit-Learn, Numpy, Pandas and others (R fans have their own tools).

To answer the question of whether algorithms and data structures are worth it, I’m going to list a few tools that will be available to you after a typical algorithms course.

Similar Post : Data Structures And Algorithm Knowledge Important For Data Scientist

Algorithmic complexity

The first useful concept you will come across is algorithmic complexity and big-oh notation. This method can help you understand how your code scales with data. This concept is important for data scientists because more and more information needs to be processed every day.

By stripping out the less important details, you can judge the performance of an algorithm, whether it’s written in Python or C and run on a NASA laptop or supercomputer. In a sense, it defines a foundational vocabulary for designing and analyzing algorithms, while removing architecture- and language-dependent details. These are considered a constant factor and are irrelevant to the overall picture.

For example, imagine you are asked to choose a machine learning algorithm for a specific classification task and consider using Support Vector Machine (SVM) as it has a good ability to handle non-decision bounds. linear. Open the documentation to check the SVM details and note the following:

After understanding the complexity of the algorithm, you may reconsider using SVM for large datasets due to high computational requirements in the O(n²) to O(n³) range. This means that if we double the amount of data, the execution time will probably increase by four to eight times.

A little caveat about Big-Oh notation: it’s considered an “asymptotic analysis”, which means it’s only theoretically valid for large enough inputs, because there are constant factors that can be quite large. great in practice.

Graphical algorithms

Graph algorithms are relevant in the world of data science and find applications in fraud detection, clustering, classification, recommender systems, etc.

Algorithm courses tend to focus heavily on diagrams, probably because they make it easier to demonstrate common algorithmic paradigms that use them.

Among the graphics algorithms, you can find:

  • Generic Diagram Search — Allows you to search for generic diagrams.
  • Shortest Path: Calculates the shortest path between two vertices. Applicable for tasks such as network analysis, browsing, social media logins and fraud detection.
  • Connected Components and Minimal Disturbance: Enables detection of communities and clusters in the diagram.
  • Minimal Spanning Tree: Useful for grouping and segmenting images.

They spend a lot of time studying graphics algorithms and can implement quick fixes.

While you’re more likely to use existing implementations of graphics algorithms than write one from scratch, knowing the inner workings and limitations will help you choose the right algorithm for the job.

Data Structures

Python is an example of a programming language with convenient and versatile data structures. So convenient that many people tend to use the Python list as their default data store for all their needs. Can cause severe performance bottlenecks in some situations.

By studying data structures, you will understand their strengths and weaknesses and be able to think about the applicability of data structure for different tasks.

You will discover useful data structures, such as:

Heaps: For use cases that require computational objects with minimum (or maximum) values. Interesting applications of heaps are event simulations and online (streaming) calculations of the median. Binary Search Trees: A Swiss army knife data structure, very efficient for a variety of operations such as insertion, selection, and search.

Hashtables — Provides an understanding of how tables and dictionaries are implemented behind the scenes and why they are better for research applications. You will learn about hash functions, collisions, and security risks caused by incorrect implementation.

One of the most interesting data structures is the Bloom filter. It is a probabilistic data structure suitable for efficient storage and retrieval. Small warning: there is a non-zero probability of giving a false positive by asserting that an object is present, when in practice it is not. For some applications, this disadvantage is not critical. Bloom filters are used, for example, in web browsers to cache addresses of recently visited pages and in Internet routers.

By identifying bottlenecks and optimizing data processing using appropriate data structures, you can speed up specific applications by an order of magnitude. It is particularly relevant for data scientists working on simulation and optimization problems.

Algorithmic Paradigms

You will understand common approaches to various algorithmic problems:

  • Divide and conquer algorithms: Break larger problems into smaller subproblems and solve them recursively. This is best demonstrated using classification algorithms.
  • Random algorithms: which rely on probability theory to obtain an optimal solution. These are particularly interesting from a data science perspective. Examples of use: QuickSort or search with minimum, maximum or average order statistics.
  • Greedy Algorithms: Intuitive algorithms useful for applications such as scheduling and caching.
  • Dynamic Programming: Applicable to a variety of problems. This is usually demonstrated using the “backpack problem”, which is a background constrained optimization problem that occurs in several real-world scenarios.

The advantage of studying algorithmic paradigms: you learn how algorithms are designed independent of a programming language and you can apply this knowledge in your work.

Intractable Problems

If you’ve ever been curious about what “NP-complete” means, welcome to the world of unsolvable problems. You will understand which problems are considered unsolvable and why there is no proven way to ensure an optimal solution other than brute force searching. These problems remain an active area of ​​research to this day and are considered the most difficult problems in theoretical computation.

You will be able to recognize NP-complete problems and different approaches to approximate solutions using methods such as heuristics and local search. A typical example of the NP-complete problem is the Peddler problem.

Full NP problems will rarely arise in practice, but it is useful to know that they exist and how to approach them.

Similar Post : Machine Learning Engineers Need To Know Data Structures And Algorithms

--

--