Designed by Freepik

“Data science” has become a hot term all throughout the tech industry. If a company has data to be scienced you can be sure they’ll call on their elite team of individuals to perform their computer voodoo magic to process the data and use their skills to provide value to the business.

But what exactly does that entail? If you wanted to join this elite team of data scientists, what skills should you train for?

Saying you want to become is data scientist is about as specific as saying you want to be a consultant. Different companies and teams have greatly varying objectives which means two separate roles advertised for data scientists could have very few common tasks. This is especially true if you consider different roles within a team and the workflow process.

With all that said, data science is built on a foundation of knowledge and skills. Much in the same way that a consultant needs exceptional interpersonal skills regardless of their specialty. This is great news for all of you beginners and aspiring data scientists because it means you can begin your general studies and wait until you have some experience before specializing in a sub-field.

So let’s break down which skills every general data scientist should have and what you should include in your studies.

 

KEY SKILLS

  • Programming
    • Python
    • SQL
    • HTML
  • Mathematics
    • Calculus
    • Linear Algebra
    • Statistics
  • Communication
    • Effective
    • Technical
    • Non-Technical
  • Business Expertise
    • Domain Knowledge
    • DS Application

The Melting Pot

To start let’s take a look at the data science practice as a whole.

Data science is an interdisciplinary field of study which means it’s a blend of several pre-existing subjects and domains. The four main pillars are

  • Programming,
  • Mathematics & statistics,
  • Communication, and
  • Business/domain expertise.

There are other subjects and skills that can factor in and be desirable as well depending on the projects at hand.

Anyone who considers themself a master in the field should be an expert in all four pillars. In reality, few actually are. Since data science is a relatively young and growing field, the majority of professionals are still in the early stage of their careers and on their way to master the different subjects.

Diagram of the pillars of Data Science [Source: Stephen Kolassa]

The role that each of these pillars plays in the work can be made clear by reviewing the general workflow process that data science projects follow.

There are three main stages:

  1. Preparation – The data requirements for the project are defined and subsequently collected. Depending on the method through which the data was obtained, thorough cleaning and reformatting are performed as well. This stage is generally one of the most time-intensive because most data is unclean and unstructured, requiring extensive preparation to ensure it’s good quality and machines can learn from it.
  2. Experimentation – The scientific method is employed to define the project in a way that ensures the results are rigorous and can be verified. The includes starting with a hypothesis, exploring the statistical properties of the data, generating models, and verifying the model’s results.
  3. Distribution – Once the grunt work of the experiment is done it’s time to reflect on the results it produced. This can include compiling a comprehensive report of the project or putting together a presentation for the management team. Once management signs off on the results, they can be used to deploy the solution as a product or develop business actions.

The pillars of programming and mathematics are utilized most during the preparation and experimentation phases. During that time the work revolves around using programming tools to perform calculations and automate tasks, as well as using statistical concepts and the scientific method to ensure the methods employed are valid.

The communication pillar is most prevalent during the distribution phase when scientists need to take their highly technical work and present it in layman terms that are understood by management and clients.

Lastly, all three stages are sandwiched between the business problem and business value which is where knowledge of the domain/business comes into play. Fully understanding a business will help formulate the problem in a way that lends itself to data analysis, and will allow you to extract actionable products from your results.

Now let’s break down, in more detail, the skills of each pillar that you should strive to study.

Programming

The pillar of programming is the one that beginners tend to focus on most. This is especially true for any data science program and/or course that you take. Even our data science program at Lantern starts new students off with a course in Python.

The reason for this is that is considered a core necessity for modern data scientists. It’s typically one of the first things that hiring managers will filter and test candidates on. What’s more, of the four pillars introduced, programming is the easiest to teach. This makes it an ideal starting point and allows us at Latern to incorporate topics of math and communication in a more natural manner.

The most common languages that are used throughout the data science field are Python, R, and sometimes JavaScript. These are known as high-level languages which means they are a bit more abstract and easier to learn. Their structure and syntax make them similar to English which makes computer logic easy to follow for beginners.

There are also low-level programming languages that are sometimes used in data science. For example, many machine learning libraries (and even most Python libraries) are built using C++ or C. The benefit of these languages is that they can be used to perform faster computations which may be required for massive amounts of data.

Knowing low-level languages can be beneficial for pioneers in the data science space who are writing new libraries and algorithms. However, for most data scientists, especially at the start of their career, a high-level language like Python (which has many data science and machine learning libraries) is more useful to learn. Once you become intimately familiar with programming you can move on to more languages, and you’ll also find that all core concepts are the same for all languages.

In addition to languages like Python, there are a few more languages that you will no doubt encounter in your data science journey and should pick up along the way. The first is SQL (Structured Query Language) and the second is HTML (HyperText Markup Language).

As a data scientist you will be working with data, and sometimes lots of it! While working with Excel and CSV files is no doubt the simplest, there are better ways of storing data. Namely, in databases that can store thousands of tables and millions of records. And to access these records in a clear and systemic way we can employ SQL to send requests, or queries, for data and receive the corresponding tables and records.

Given the importance of collecting, managing, and formatting data, SQL should be among the first few languages that you learn and continue to practice throughout your studies. If you’re looking for an accelerated course, Lantern offers one as part of our Data Science curriculum.

And let’s not forget HTML. HTML is a language that is used for frontend development. The frontend refers to what a typical user would see, i.e. the user interface. In this case, HTML is a language that is used to give web pages their layout and structure. It defines how elements on the page should be ordered and where the content (e.g. text, values, images) should be displayed.

While HTML is not a crucial part of study data science, being familiar with its basic workings can unlock new areas of the workflow that make you a more functional scientist (and a more attractive candidate).

For example, there is a ton of data that is available on the internet. Sometimes this data is conveniently placed and easy to download (e.g. a database or CSV files). However, there is also a lot of data that is readily displayed but cannot be downloaded will a single click of a button. For this, you can write programs that utilize data scrapping techniques to collect useful data and place it into a local database for later use. Knowing how data is stored on web pages (i.e. HTML) can make the whole process much easier to figure out.

Furthermore, on the other side of the workflow, we have the reporting process. Data scientists will often use reporting dashboards to display data and key information. Creating such dashboards involves frontend developments and requires a functional understanding of HTML.

The takeaway: programming is an important aspect of the job which you should work on learning first. Start by focusing on understanding the basics of Python, SQL, and some HTML before moving on to more advanced topics.

Mathematics & Statistics

The second pillar of data science is that of mathematics. Like that of programming, this pillar is a technical one which means it is one that can often be easily taught in a classroom setting. Unlike programming, most people choosing to pursue a career in data science have a fundamental understanding of math. Anyone who has complete a bachelor’s degree in a STEM subject will have likely been required to complete courses in calculus and statistics.

For those who take our program, or just want a quick practical refresher, we offer a course in statistics that reviews those fundamental concepts and shows how to compute models and statistics using Python.

There are three branches of mathematics that students should be familiar with, namely, calculus, statistics, and linear algebra.

Of those three, statistics is certainly the most important to have a firm grasp on. The overlap between statistics and data science is significant. It’s used throughout the analytical workflow to understand the data, create models, and validate the performance of those models. Without statistics, your results will be considered unreliable and effectively useless. Regardless of the type of data science career you wish to pursue, you should study up on your statistical knowledge.

The other two branches, calculus and linear algebra, are rooted in the modelling and analysis aspects of the field. Having an understanding of the basics will help you make sense of the underlying mathematics of various techniques and machine learning algorithms that are used.

Communication

Our second pillar is communication. Effective communication is important in any workplace and data science is not different. Being able to communicate with your team as your work through a project is part of the norm. As mentioned, however, near the end of the project workflow it becomes crucial for technical data scientists to be able to summarize their work and findings in a non-technical way that can be understood by any layperson. The end goal is generally to create some actionable plan which will create value for the business. If you are not able to clearly communicate then it will make it difficult for any management team to take action since they will not understand the benefits, costs, or risks of the proposal.

Communication is one of the major components that I find students overlooking most often. They might write effective code and produce good analytical results but not provide any indication of their thoughts, process, or insights they find from the data.

It’s not a skill that can be taught in the span of an hour, but rather something that needs to be practiced through trial and error. It’s something that we try to build in our programming courses as well so that we can provide feedback to students on what needs to be made clearer. The best way to practice is to summarize your analysis and then have a less technical person review it. If they can follow along with the logic and process that you used then you’re on the right track.

Example dashboard that data science team might create and use.

Business Expertise

Our final pillar is the one about business and/or domain expertise. While academic researchers don’t need to concern themselves with the business applications of their findings, data scientists do. In fact, having domain and business knowledge can go a long way when applying to positions.

Data scientists are hired by companies for the purpose of creating products and provide actionable insights. Being familiar with the area that a company operates in (e.g. the medical field, or advertising, or finance) will come in useful as you work through a project. It will help you take a business problem and define a useful data project for it. It will help you build the various analytics and models in a way that aligns with the business goals. And it will help you convert your results into actionable items that will add value to the business.

Building that domain/business expertise that makes an effective data scientist is something that is done outside of the classroom. This is because data scientists can work in such a wide range of industries that each student has different needs, making it hard to teach in a classroom setting. This type of knowledge is typically built from previous experience (e.g. studies or work) or gained as you progress through your data science career.

Journal

Are you serious about starting a career in data science, or do you just want to pick up a few new skills? Be sure to check out Lantern’s programs and courses to see if it’s right for you.

With the continuous growth and developments in the fields, it’s never a bad time to enter the industry!

5 1 vote
Article Rating