9 Phase 5: Exploration, execution, pivot
At this point, you have done your homework. You have explored the project’s feasibility, you have scoped the work and you have put together a project plan that resonates with the project owner, usually a client (for contract work) or a manager (for internal projects). (For the sake of simplicity, we will refer to this person as the “client” throughout this chapter.) You have determined that the proposed work makes sense in terms of feasibility and business value to your stakeholders, and you have decided that it passes muster in a larger, contextual sense. You have prepared, worked hard and thought about what lies ahead of you. It’s time to start with the implementation phase!
However, the fact that the project has begun does not mean that you have no more questions nor need to tap into the knowledge of subject matter experts. It is, therefore, essential that the stakeholders involved in the project remain available for question-and-answer sessions. Questions will range from data clarification to issues around industry knowledge and how the end-user will be using the product. The more the stakeholders remain involved, the more input they will have along the way and the happier they will be with the result. We recommend scheduling time slots that together accumulate to a minimum of 0.5 days a week for clarifying meetings.
At this point, we would like to address a common misconception: that developers and data scientists need to have domain expertise to solve domain-specific problems. While we do find that familiarity with a specific industry can certainly help a data science project get started, we have rarely seen that industry knowledge is essential. We would further argue that a lack of domain expertise can be a positive thing – it allows one to view a problem in a way that is unencumbered by prior assumptions and biases. Naturally, anyone working in a new field will have a lot to learn and may get off to a slow start. Digital marketing, for instance, has a nuanced vernacular, with terms such as “reach”, “engagement” and “conversion” meaning very specific things in specific contexts. But in our experience, as long as subject matter experts are available to help answer questions, prior industry knowledge is not essential.
In thinking about our projects and how we have worked, we generally have four main stages to project execution: research, prototyping, building and evaluation. Naturally, every project is different and has different needs, but rarely have our projects not at least touched on each. The movement between them is not always linear or ordered and your project may not smoothly flow through these stages, but we find it useful to think in terms of how your work fits in to these stages.
As a rule of thumb, we normally spend 25% of our time researching, 25% on prototyping and the remaining 50% of our time iterating between building and evaluating (which tend to go hand-in-hand). However, it should be noted that some projects are, by definition, experimental with a “proof-of-concept” (POC) as the objective. In such projects, research and prototyping carry a bit more weight. However, in most cases, these projects do require substantial documentation, reporting and knowledge transfer (which we discuss in Chapter 10, Phase 6). Thus, the amount of time you spend on developing your POC may increase at the expense of building an optimised solution, but you should leave plenty of time to wrap up your work. Often this is more time-consuming than building a solution, and you should budget your time accordingly.
During the early stages of this phase, you will likely be focusing on two questions: “Is the goal of the project possible?” and “How should I get to it?” These loosely correspond to the research and prototyping stages of a project, respectively. The outcome of this work can be thought of as a POC. Naturally, these questions are related: you can’t predict that a goal is possible without having some idea of how to get there. Often you have several possible approaches to choose from; your aim should be to identify the best of those possibilities, test it on your data and determine if it is, indeed, a sensible approach to build and develop further. Similarly, building and evaluation often go hand-in-hand, as we discuss in more detail below. At the end of this chapter we have included a section entitled “Go to client”, which is less of a stage and more of a call for frequent check-ins throughout the course of the engagement.
9.1 Research
Nearly every project includes periods of research, focused on taking an in-depth look into the problem at hand. This includes considering different ways to frame the question and weighing the potential approaches to answer it. Almost always this will include a visit to your favourite search-engine. Many data scientists will also have a library of resources they can turn to, which could include books, blogs, articles, colleagues, mentors or even tweets. We recommend that you build your own such library of favourite resources, and organise them in a way that makes intuitive sense to you and that you can quickly access when needed. For example, we make heavy use of bookmarks on our favourite browser, being sure to organise them by topic and update them from time to time.
We all learn in different ways, but you should bear in mind that talking to an expert is often the most efficient manner to get information; the time you save can be put towards other activities that could add value to the project. Naturally, you will need to have a network of people you can turn to, and this is not something that can be built quickly. Your network of colleagues will likely be one of your most valuable assets, and you should not underestimate how important it is to build, nurture and contribute to it. Please see Chapter 13 and the Build a network Section of Chapter 12 for advice on how to build your professional network.
While gathering information from experts can go a long way towards understanding the task at hand, it is important to do your research as well. Unless you are very knowledgeable, you may have to do some homework to find out how industry leaders solve similar problems to those that you are facing. While you may not need to use the state-of-the-art technology (and very often you won’t), it is good to know what benefits the most cutting-edge technologies offer, as well as what costs they incur. Indeed, every solution will have pros and cons and it is your task to make an informed decision. The time you spend researching will translate into substantial improvements in efficiency. Consider whether you can find research papers describing or comparing similar approaches and if software packages already exist with basic implementations of the algorithms you need. Utilising what is available lets you invest your time in creating something new and maybe even allows you to give back to the community. For instance, you may end up extending someone else’s open-source software. If so, this can be a good opportunity to contribute a pull/merge request – most developers welcome suggestions (however you should look for contributing guides before creating one). Did you end up comparing different techniques? Publish a paper or blog about it. Others will appreciate the information and you will get your name out into the wider community.
Getting up to speed in a domain that is unfamiliar can be slow and, at times, little overwhelming. Often the work requires a fairly comprehensive understanding of the field or potential scientific approaches before we can build a solution. While this may seem daunting, you should remember that when you are new to a data science project, you have a window of time in which you’re allowed to be ignorant. In other words, it’s acceptable to not know the field because you’re new to it. That empowers you to ask questions, even ones that you worry may be “stupid”. This window of opportunity will not be open forever, so use it when you’re starting and don’t be self-conscious about it!
When immersing yourself in a new field, it is good to be aware of biases and industry knowledge that may be accepted as general knowledge but that you may lack. A good example of this is the financial sector and the stock market. When predicting stock prices, most people are aware that they need to do backtesting to test for statistical significance. Without any further financial knowledge, you might simply think that when your model has significant predictive power you will likely make money on the stock market. Many amateur algorithmic traders have found out from experience that reality is not that simple – they lacked the financial knowledge to see the pitfalls in their analysis and so their algorithms failed due to a lack of information.
9.2 Prototyping
Once you have researched possible approaches to your project’s goal it’s time to choose the best candidate and see if it will work. This is the goal of prototyping. You may have found through your research that there is a single, clear best path. In that case, you may be able to quickly create a minimal prototype and move on to building directly. In other cases the situation is more uncertain; you will have to test the possible approaches empirically on your data to make an informed decision about which seems the best based on the evidence you gather. In either case, we recommend building at least a very simple prototype for your project so that you can be confident that what you are going to build out in the next stage is, in fact, a viable solution.
Once you have chosen an approach that you believe will work, you can move on to the building stage. (However, before doing so, it’s a good idea to check in with your client, as we discuss below.)
On the other hand, your prototyping efforts may give you unwelcome news. For example, you may find that your top choice candidate approach does not do well with your data. In this case, we’re afraid you will have to start the research again, re-evaluating if you still think that the result is possible and exploring another candidate approach for prototyping. This is frustrating but important – research can be slow and you may have to be patient. We all want to start building a solution, but building the wrong solution will be costly! Be sure to use your research time wisely so that you gather accurate, reliable information and make well-informed decisions.
By the end of this stage, you should have decided on a course of action, selected the algorithms you intend to use, demonstrated that the chosen solution is going to work and started iterating on putting it together into a coherent product. As a general guide, you should be about half-way through the hands-on phase of the project. If you don’t have a working prototype by this point, it’s time to think carefully about whether you will be able to deliver the expected outcome. The need to discuss the state of the project is especially important in this situation: while it may not be a conversation you want to have, waiting to communicate the state of a project that may be in trouble will usually make the situation worse, not better.
9.3 Build, assess, rinse and repeat
For many data scientists (your authors included) the building stage is where the fun happens. Here you focus on developing your prototype further to create a product that meets – and sometimes exceeds – your project’s objectives. The objective could be anything from a great model that runs locally, to a full-fledged solution that scales in production.
If you have entered this phase with a respectable prototype, you will most likely start the build by trying to improve upon your results. Often this comes down to expanding your dataset and using scientific creativity to devise new ways of using it. Correspondingly, feature engineering is often a big part of the building process. If you have identified shortcomings in the data you started with, you may want to look for ways to bring in other data. Sometimes this involves external datasets that are publicly available. For example, the Index of Multiple Deprivation can be a very useful dataset to include in your analysis when trying to find relationships between UK geography and a pattern in human behaviour.
Building is iterative by nature. As you build, you will trying to make improvements to you models and data pipelines. This process is, by nature, experimental – some changes you make will result in improvements, and some will not. Similarly, some improvements will come at a cost that is prohibitive to the final product. For example, a model that uses a deep neural network may give you a 5% improvement to model performance over a logistic regression model, but it may come at the expense of speed or interpretability. You will likely have to make strategic choices throughout, and we encourage you, where possible, to include your client in these decisions (see Go to client below).
This iterative approach applies not only to the methodology underpinning the analysis, but also to the code that underlies it. A good example of this is in code optimisation: we all want fast code, but trying to get the code logic working while simultaneously optimising the code can be difficult. Our advice is to get it working first, then worry about making it fast and robust. Often framing your coding work in the context of building a software package or library can help: the process of building a package/library inherently forces you to write better code that is more robust and efficient. Similarly, testing your code rigorously will force it to be better and less brittle.
As you are building and testing the output, you will undoubtedly explore various possible methods and assess their strengths and weaknesses as you refine the work. But how should you make these comparisons – what should you take into account when deciding on how to proceed? Assessment is a key part of this stage that can help ensure that you are making wise choices.
When many non-specialists think of assessment the word “accuracy” comes to mind. Experienced data scientists often avoid this term as it means something very specific and is seldom a meaningful metric to evaluate model performance. You should consider other metrics such as precision, recall or F1 score. However, assessment extends far beyond model performance; many factors will dictate what the right choice is for your project. For example, you may need a highly interpretable model, or you may need something fast. Exactly what is important for your project can vary, and you should keep this in mind throughout this phase. How you assess your work should not be an afterthought – on the contrary, you should think about how you intend to measure performance from the outset.
Having a meaningful way to assess your work will also be important for benchmarking. Benchmarks are a handy way to set a standard against which you can compare your model’s performance. We encourage you to set benchmarks early in your project and set aside time to develop these benchmarks. Knowing what the previous standard was can give you a demonstrable way to show added value from your work. Your final results should always be quantifiable, so set yourself up for success by defining your KPIs and measure your final success against them.
Understanding what’s going on inside your model
A common notion is that artificial intelligence is complicated and cannot be understood. It is often treated as a black box with a magical outcome. Developers who use machine learning libraries without fully understanding them often create or support this misconception. This viewpoint, however, originates from a lack of knowledge. Most machine learning libraries are open source and a skilled data scientist will be able to dig into their models and alter or understand the results it produces.
At their core, machine learning models are mathematical models – data is manipulated through a series of mathematical operations. These operations are not random but should be well-chosen to produce favourable results. Understanding this process lets you understand the limitations and the overall behaviour of the model. Once this is properly understood, one can make deliberate changes that improve the model overall, remove bias or simply handle outliers that had been previously neglected.
What is open source?
Open Source is any computer software that is distributed with its source code available for modification. That means it usually includes a license for programmers to change the software in any way they choose: They can fix bugs, improve functions, or adapt the software to suit their own needs.
As a side note, it is important to be conscientious when choosing your machine learning toolkit. If you use closed and proprietary systems like those provided by IBM, Google and Amazon, be aware that you will lose visibility and the ability to understand the system fully.
Testing
As with software development, data science projects should be rigorously tested to ensure model performance. Create test cases and make sure your model adheres to them. Later if you need to update the model with new data, these test cases should still hold and make any update robust. Testing is somewhat of an art form, and it takes a lot of experience and practice to become good at devising a sensible, rigorous testing scheme. Furthermore, the notion of “unit testing” is not entirely sufficient in data science. Nonetheless, it’s good to have an understanding of testing principles.
If you or a colleague has made changes to some code and want to replace the old version, how do you know it has not introduced any problems (let alone that it’s improved)? Unit tests help ensure this for software, along with code reviews. For data science work, we have found it useful to recruit reviewers to inspect results from experiments.
Another common use case for unit testing in software is to ensure things work properly before automatic deployment to a production environment. For machine learning models in a production environment, you can run code that retrains your model on new data automatically and redeploys your updated model to production. In this situation, you may want the deployment to fail if your updated model fails to meet certain conditions. To handle this, abort the deployment if your retraining script throws an error. This allows you to build any gatekeeping checks you wish while ensuring consistent quality.
A note on notes
Any experienced researcher will know that a critical part of the work lies in keeping detailed records. This is no different for the data scientist – keeping track of results, as well as detailed records of assumptions and choices that you make, is essential. Aside from the fact that this is scientific best-practice, it also makes practical sense: it is not uncommon to examine your final results and realise that something doesn’t make intuitive sense. In this case, you will have to do some detective work to figure out why. Good notes that include information about the assumptions and potential errors you made along the way will make that task a lot easier.
In a recent episode of one of our favourite podcasts, Not So Standard Deviations, host Roger Peng and guest Jenna Krall discussed the topic of how assumptions and undocumented workflows can impact data science research. If you are interested in learning more about this topic, than Project TIER is a great place to start.
While saying that you should take good notes may seem obvious in principle, it is very easy to fall short in practice. Data science projects can move fast, and we all get caught up in the excitement of discovery or writing code that works well. It’s good to be excited – the work is exciting – but don’t let yourself forget to be a good scientist in the process.
9.4 Evaluate
As you are building, it’s a good idea to keep in mind the four levels of project evaluation we outlined in Chapter 2. Recall that we described them as such:
- The process level is focused on the actions taken towards producing deliverables.
- The product level is concerned with the deliverables themselves and whether they meet the technical requirements of the project.
- The business level describes how well the project brings value to your client.
- The contextual level is the most abstract and relates to the circumstances surrounding a project and the externalities that affect it.
We encourage you to revisit these levels often throughout your project. During the Build stage, you will mostly focus on the process and product levels. But it is also important to not lose sight of the higher business and contextual levels. When you have finished building and are ready to hand over your work to your client, consider how well your project, as a whole, has satisfied these higher, more abstract levels of evaluation. If you feel that you have done a good job on all four levels, then you can be reasonably confident that you have designed, executed and delivered something of value.
9.5 Go to client
Your client should be an active participant throughout your project’s lifetime, providing valuable input and feedback that can help keep the project on a successful trajectory. We recommend engaging with your client frequently, if only for brief updates: it will give them the reassurance that comes from having a better understanding of what you’re doing and why you’re doing it, and it will help you to ensure that your decisions and plans are aligned with their expectations.
It is interesting to note that some large organisations have recently started seating AI departments in close proximity to the CEO and other decision-makers. This is driven by anecdotal findings that casual interactions with AI practitioners can increase AI-favourable decision-making. While this observation may not stand up to the statistical rigour needed for causal inference, the takeway is still valuable: increasing your interactions with decision-makers will likely be beneficial to both you and them.
9.5.1 The client’s role
Sometimes clients or managers mistakenly believe that their role is to give you their requirements, data and infrastructure so that you can go away and silently build a product that is exactly what they imagined. Naturally, this is misguided: data science projects are complex and involve many decisions. While you may know the dataset and problem well, you will not be in a position to understand the business case as well as the project owner or other stakeholders. Some decisions are yours to make, but the responsibility for others lies squarely with the client. You should not try to make these decisions on your own. If your client resists, you should make every attempt to get them to understand that data science projects require an iterative process and that the more that you can get their feedback, the more the final product will meet their requirements.
9.5.2 Your relationship with your client
Some find client interactions to be difficult and stressful; even the easiest of these relationships can become strained when a project does not go according to plan. One of the most common barriers to building a good relationship with your client is face time: people are busy and you might feel uncomfortable asking for their time. As mentioned above, you may find that your client does not want to be involved, or thinks that they don’t have much to contribute. Quite the contrary is true in reality: their involvement is crucial for project success. We recommend you emphasise this point early in your working relationship and make sure that your client knows the expectations you have for their involvement.
However too much client involvement can also be detrimental. Some clients can be overbearing and may want to be more engaged than is helpful. They may have short attention spans, changing the aims of the project so often that your project trajectory looks more like a tangle of yarn than a linear progression. They may even feel that they know better than you about what is best for the project, and while this may be true in some cases and you should keep an open mind to this possibility, more often than not this stems from a lack of understanding and a large amount of nervousness.
If you have followed our advice until now, you should have established a working relationship with your client already. However, during this phase of the project, the work takes on a different form and the way you and your client interact may also change. For this reason, we encourage you to be explicit with them about when and how you will meet. In other words, set clear boundaries. For example, agree on when you will give updates so that you can don’t spend too much time in meetings. Establishing such things early on will go a long way towards building a solid, trusting relationship.
9.5.3 Planning and running meetings
Once you have convinced stakeholders of the importance of their involvement, you should plan regular meetings with them. To keep these meetings going, you need to make them see that the meetings are valuable to them, so use the time provided effectively. You can do this by planning the meeting and taking ownership over the agenda. It is a good idea to prepare slides with any findings you have uncovered, visual representations of what you are working on or simply a list of questions to go through. People tend to be able to focus better on the problem at hand when they have something to look at.
9.5.4 Building a good client relationship
Your client or manager will have a vision that they need to realise. Your job is to transform that vision into something tangible. Their vision may not be exactly what you end up delivering; in some ways it might be better and in other ways worse. Data science projects are full of tradeoffs, so you and your client will invariably have to make choices from time to time. A common hurdle is that your client will likely not understand technical challenges and the trade-offs that you make. It is essential, therefore, that you get buy-in for the decisions that must be made. This is not always easy; we suggest you emphasise how your decisions affect the outcome of the project in terms of tangible results and business value – the things your client cares most about.
It’s important to bear in mind that you and your client have different vocabularies. This is a surprisingly common obstacle to building a trusting relationship because it impedes communication and, in the worst cases, can lead to different understandings of project objectives. Often the onus is on you, the data scientist, to bridge this gap by adapting to their terms and terminology. You should not be shy about asking stakeholders to define or clarify terminology so that you both understand one another. Don’t assume that you can necessarily look up terms you don’t know later or that you and your counterpart are using the same definition of a word. If you feel there may be room for ambiguity, it’s important to clarify this from the start. Make notes of these definitions and use them when you talk to the project stakeholders.
Of course, your relationship with you client is a two-way street, so you should also be prepared to share some of your knowledge with other project stakeholders. Most clients will welcome this (within limits). However we urge you to bear in mind that ignorance can be disorienting and disempowering, especially to a client. It’s important to be empathetic to this and to avoid being arrogant or patronising. Instead, focus on the collaborative aspect and choose your words in a way that states that you can learn from each other. Everyone does have different knowledge and knowledge sharing is in general very powerful. The more that people understand how machine learning works the more they will understand the challenges you face and how their knowledge can improve the outcome.
An important tool to use when speaking to your client is repetition. Repeating what you have heard in your own words can help build trust because it shows that have understood the problem and that this understanding is important to you. For instance, a good habit can be to describe the business problem you are solving with the exact words your client has used. This shows that you have understood, it give your client an opportunity to clarify or correct and, by using their language, it reduces their cognitive load. Again, a lot of these tricks come down to empathy for your client.
When dealing with other people, it is important to remain kind, positive and in control of your emotions. Especially when situations become difficult. Dealing with highly demanding clients can be extremely frustrating and stressful. It can be good to remember that the situation is probably more stressful for them than it is for you. Management also has targets to meet and they are ultimately responsible for the success of the project even though the execution is out of their control. Empathise with them and make them feel in control by involving them in the process as much as possible. It helps to always remain positive but realistic when speaking about your project as this gives hope without overselling your deliverables.
Whenever problems do occur don’t avoid them, they don’t go away just because you don’t discuss them. This is the main reason why projects fail: challenges aren’t communicated or addressed early on. Rather than see it as a problem, tackle it as a challenge, make a plan and present this new plan in a positive light. It is important to never go to your client without a plan and expect them to help you figure out how your project can be saved. Your client needs results and if the results they are expecting are impossible to obtain, make a Plan B and figure out how you could still provide value. Your client wants to know that you are the best choice to work with and that you can overcome challenges and tackle them effectively. Tackling problems can allow you to show your true creativity, ensuring that you remain their first choice on future projects.
9.6 Summary
In this chapter we have discussed the meat of a data science project, where you execute the development of a solution and build a product. It includes processes such as researching various approaches, experimenting with prototypes, choosing a single best option and building it into well-developed product that aligns with the business and project objectives. For most projects, this is an iterative process of one form or another. For example, during prototyping, you will likely iterate through various possible approaches. Similarly, as you build, you will test different ways to optimise your model and your final product, using carefully selected assessment criteria to guide your decisions. Not all of these attempts will yield improvements, but the overall process should move you towards a final solution that is the best you can create given your constraints.
This raises an important point: you will almost always be limited in the resources that are available to you – time, money, compute power, etc – so you will often have to be pragmatic about how you best use the resources you have. It’s often a strategic choice; you simply have to do the best with the resources you have. In the wise words of Gandalf:
“All we have to decide is what to do with the time that is given us.” (Tolkien 1954)
As a data scientist, you are your client’s guide and are responsible for seeing the project through to completion. While this is not always an easy job, if you define success correctly, plan, execute and communicate effectively then you will give yourself the greatest chance to succeed.