⚠️ Anomaly Detection 🚨 AMONG US

Ben Roshan
Analytics Vidhya
Published in
11 min readFeb 23, 2021

--

1. Introduction

On an important mission to Mars, Red, Blue, Green, Pink, Orange, Yellow, Black, White, Purple, Brown, Cyan and Lime were boarded on to the spaceship. On their way to explore the Mars, the spaceship undergoes series of internal component failure in the navigation and flight parts. It was time for the crew to repair the issues . But there was one Imposter one Betrayer one Anamoly among them and to identify who that is there are set of tasks which has to be performed in the data to reveal who is the imposter in the dataset.

To understand what this this project really focuses on is that there are many realtime cases we are witnessing on tracking abnormal data which possess a serious threat to the business in the field of IT, health and various other sectors. Even though the cyber security teams are forging to figure out the anomaly behaviour in the transactions, the system built using algorithms are not efficient enough to capture all anomaly’s. Huge millions of money are lost due to the cyber attacks. It not only affects the business revenue but also the reputation and trust of doing business with the firm.

Where to find the Imposter ?

TO find the imposter in our spaceship, we use Numenta Anomaly Benchmark (NAB), where we consider speed_6005 which has the dataset with 2500 rows of the speed for specific sensors in the spaceship.

CSV name: speed_6005.csv

In these dataset above, The crew will analyse the dataset with time-series visualizations and perform analysis to detect the anomaly records and thereby capture the imposter. These are crucial records which can help in identify suspicious speed recorded in the sensors.

You can also get this code on my GitHub wall github page.

Tasks in Spaceship

  • Getting and tweaking the data
  • Visualization of data and Emergency meetings
  • Building model to trace anomalies
  • using Isolation Forest
  • using Facebook Prophet

2. Welcome the Libraries

3. Getting the data

#Reading the data
imposter=pd.read_csv('../input/nab/realTraffic/realTraffic/speed_6005.csv')

#Viewing the data shape and head
print(f'speed_6005.csv : {imposter.shape}')
imposter.head()

According to dataset information, it has the following features :

  • timestamp: This is the date and time when click is made by the visitor in the website
  • value: This is the speed recorded in the specific sensor

📌Blue: “Take a note that the speed (value) doesn’t actually have any units nor the metadata doesn’t have any information on that.”

imposter.info()

From the information we can identify that

  • We don’t have any null records in the dataset. BAM !
  • timestamp column is an object data type. small bam!

4. Tweaking of data

Changing the datatype of timestamp

📌Red is angry again to see the timestamp being an object. “ We are dealing with time God damnit, every second is precious to fix our ship “ he says. Since our timestamp variable is in ‘object’ datatype, we need to convert it into datetime format

#Converting timestamp object to datetime
imposter['timestamp'] = pd.to_datetime(imposter['timestamp'])

#Check for the change
imposter.info()

Date information

📌Lime: “Crew !,We can extract a lot of data from the time stamp like year, month, day, hour, weekdays . This will help us to reveal a lot of information from the data. We gotta look for all possible ways to find the imposter among us”

Renaming column

📌Black: “Since I don’t have much task , let’s rename columns helps the other crewmates or any non-technical crew member to understand the feature . Here I change the columns timestamp-> Datetime. Since timestamp feels more like a jargon “

#Renaming the columns to simple words
imposter.rename(columns = {'timestamp':'Datetime'}, inplace = True)

5. Emergency Meetings !!!!

General pre-assumptions

📌Blue: “Let’s put forth our work status and check it’s hypothesis by looking at the visualizations”

  • Pink: “I was working on increasing speed during the weekends at medbay”
  • Orange: “I was working on increasing speed during holiday months such as December and January at admin”
  • Yellow: “I was working on increasing speed during the late night hours at storage”
  • Cyan: “I proposed a strategy to form a seasonality across 2015 for speed at shields”
  • Red: “I worked really hard during Sep 4- Sep 10 to fix our ship at reactor”

📌 Please refer to my Kaggle notebook for the code to all viz

Overview of time series data

Let’s take the timestamp as x axis and plot the values and identify whether it has the characteristics of time series data and also check against our assumptions.

Discussion:

  • Purple: “We don’t have the data for entire 2011, instead we have only for 1 month(Sep-17days) and it doesn’t exhibit seasonality — we can reject our 4th assumption and since cyan said it’s true I feel cyan is sus”
  • Brown: “Let’s straight away reject the 2nd assumption as we don’t have enough data to prove it, ORANGE IS SUS MAX”

Actual Insights:

  • The speed across 17 days exhibits only stationarity and not seasonality
  • The drop seen in the later time is huge compared to the drop happened in the initial days of September
  • Even though we have same pattern, you can visually see the same speed at Sep4-Sep8. Could it be an ANOMALY ?

Histogram and Scatter on datetime

Let’s plot combined chart. If you wanna find some imposters in our data, scatter and box plot are the best.

Actual Insights

  • We have only one data point from Aug and we can’t consider the left bar and if you see the September bar we can see the average speed was around 81.9
  • We can see 3 points on Sep end which looks like an outlier, but that doesn’t mean they are anamoly
  • We can also notice that there are no speed recorded in the mid Sep, Could it be a shutdown ?and that might even invite cyber attacks in our spaceship

Which hour and day of the month we had high CPC?

Let’s use altair library to plot a beautiful heatmap which can help us to identify which hour and which day of the month were speed higher

Discussion:

  • Black: “We can clearly see there is not much speed during late night hours compared to morning hours — our 3rd assumption is false, and since Yellow said it’s true I feel yellow is sus”

Actual Insights

  • Isn’t it cool to find all shut down of sensors happened after 12 ?
  • We also can see the recordings started at Aug 31 6pm to Sep 17 6 pm, you can see in these days no speed recorded during the rest of the hours
  • We can also notice that there are 6 shutdown of sensors happening between the hours. Are those Anomalies ? Let’s find out

Behaviour during weekend

Let’s check out our final assumption of whether there is a rise in speed during weekends. Since most of them will be free to surf internet and tend to click more ads

Discussion:
White: “The highest was recorded on a weekend- Friday.But there has been low records of speed during Wednesday and Thursday and also Sunday has low speed. We can accept our 1st assumption of hike of speed during weekend to be more specific it was only the start of the weekend and end of the weekend doesn’t have much speed in sensors. Did the imposter arrive in our ship at Sunday. I feel Red is sus”

Sus pattern via visuals- RED SUS?

Blue: “Let’s see the anomaly patterns that are visible to naked eye. Here, the anomaly points which are highlighted may not be an anamoly thrown by the algorithm since it purely based on visualizations”

Dead body found:
Blue: “Red has been found dead in medbay. We have two suspicious pattern in the time series chart between Sep 05–08 and Sep 09–10. The first is highly visible to a naked eye and can be literally seen violating the pattern the graph follows and it is highly likely to be an anomaly

SABOTAGE by Imposter

Let’s take a closer look at the anomaly patterns

Suspicious activity 1:
There are no speed recorded during this period, the sensor got stuck, the imposter must have SABOTAGED the sensors to get in the spaceship without alerting anyone. Imposter is still among us and he possibly could have entered during this time

Suspicious activity 2:
Once there is a subotage the imposter hid himself and someone else arrived there to fix the sensors and that person has been killed by the imposter and later blue reported it

6. Building model to trace anomalies

Isolation Forest

“Pink: So we have arrived at the isolation forest to detect the anamolies ?”

“White: Yes, you are certainly right, Can anyone tell me what can this do?”

“Cyan: Sure, Isolation forest is an unsupervised learning algorithm for anomaly detection that works on the principle of isolating anomalies, instead of the most common techniques of profiling normal points.

“Brown: I know what is an iForest, but how does it classify the anomalies ?”

“Green: In the first stage, a training dataset is used to build iTrees as described in previous sections.In the second stage, each instance in the test set is passed through the iTrees build in the previous stage, and a proper “anomaly score” is assigned to the instance.Once all the instances in the test set have been assigned an anomaly score, it is possible to mark as “anomaly” any point whose score is greater than a predefined threshold, which depends on the domain the analysis is being applied to.”

“ Brown: That’s cool . Thanks for the explanation”

“ Green : Alright, I’ll initalize the library with contamination as 1%. We can also fix the contamination rate as per the domain. Since we got only one imposter I have set the threshold very low”

“Black : Team, Let’s catch this imposter by detecting the anomaly !”

“Black: We have the anomalies spread over the upper and lower regions of the distribution”

“White: So the imposter was sabotaging our work from the beginning”

“Blue: I was watching orange from the security room and he wasn’t doing any tasks. I feel orange is sus”

“Orange: No I was at the medbay and I scanned at that time. Blue is wrongly accusing me”

“White: Let’s vote out orange and check whether Blue was right”

“ Black: Alright. We also have one more method for figuring out the anomaly, let’s try that and kick orange out”

Facebook Prophet

“Black: Let’s awaken the prophet which shall give us the answer for “Who is the imposter among us ?””

“Cyan: But how are we going to do it ?”

“Black: First let’s rename the columns according to the prophet’s standards since it doesn’t work in other cases “

“Blue: I can confirm that the speed was high incase of weekends and in rest of the days it as dim except Wednesday. Hmm. Anomaly might fall in that day”

“Lime: We shouldn’t jump into conclusions blue,also if you notice according to the time the speed was high from early morning till evening and later dipped down midnight”

“Black: You both are right, considering the days, there was a steep increase in the initial recording days and there was a steep downhill and was never risen again. Can we connect all the dots ?”

“Purple: Before passing any judgements, let’s figure out the error value from the predictions and also calculate the uncertainty by differencing the lower and upper interval, that leaves us with the records which lies above the intervals which is an unusual case , we can term those as anamolies “

“Black: You are absolutely right, let’s do it”

#Calculating the error in prediction
results['error'] = results['y'] - results['yhat']

#Calculating the uncertainity- the region where the predicted values are less likely to fall
results['uncertainity'] = results['yhat_upper'] - results['yhat_lower']

#Displaying the data records which fall beyond the threshold
results[results['error'].abs()>1.5*results['uncertainity']]

“Purple: Now that we have got both error and uncertainty, Orange can you classify the anamoly as yes if the error lies beyond 1.5 times of uncertainty ?”

“Orange: I’m wondering why 1.5 ?”

“Purple: It depends on the application we are working on, here we are not concerned about the value that landed as uncertain but what lies beyond those uncertain values which has to be classified as anamoly”

“Black: Yes we got the anomaly points which were recorded on Sep 1 and Sep 17”

“Blue: I’m wondering who was working at those time. Orange is really sus, I couldn’t find him anywhere actually doing the task”

“Orange : Blue really blames me all the time he is sus “

“White: I saw blue vent on Sep 17”

“Black: I thought so, Let’s vote blue out guys”

All voted blue and he was kicked out

Blue was not the imposter

Imposter wins — White

You expected a good ending eh ? Welcome to AMONG US where the people with wits survive !

Flashback

--

--