Detecting malicious behaviour in participatory sensing settings

Security is crucial in modern computer systems hosting private and sensitive information. Our systems are vulnerable to a number of malicious threats such as ransomware, malware and viruses.  Recently, a global cyberattack (ransomware) affected hundred of organisations, most notably the UK’s NHS.  This malicious software “locked” the content stored on organisations’ hard drives, requiring money (to be paid in bitcoins) to “unlock” it and make it available back to their owners. Crowdsourcing (the practice of obtaining information by allocating tasks to a large number of people e.g. Wikipedia) is not immune of malicious behaviour. On the contrary, the very openness of such systems make them ideal for malicious users to alter, corrupt or falsify information (data poisoning). In this post, we present an environmental monitoring example, where ordinary people take air quality readings (using mobile equipment) to monitor air pollution of their city or neighbourhood (see our previous post for more details on this example). Arguably, some people participating in such environmental campaigns can be malicious. Specifically, instead of taking readings to provide information about their environment,  they might deviate by following their own secret agenda. For instance, a factory owner might alter the readings showing that their factory pollutes the environment. The impact of such falsification is huge as it basically changes the overall picture of the environment, which in turn leads authorities to wrong actions regarding urban planning.

We argue that Artificial Intelligence (AI) techniques can be of great help in this domain. Given that measurements have a spatio-temporal correlation, a non-linear regression model can be overlaid over the environment (see previous post). The tricky part however is to differentiate between truthful and malicious readings. A plausible solution is to extend the non-linear regression model by assuming that each measurement has an individual and independent noise (variance) from each other (heteroskedasticity). For instance, a Gaussian Process (GP) model can be initially used and then extended to Heteroskedastic GP (HGP). The consequence of this action is that this individual noise can indicate the deviation of each measurement compared to the truthful measurements, which can either be attributed to sensor noise (which is always present in reality) or in malicious readings. An extended version of HGP, namely Trust-HGP (THGP), assigns a trust parameter to the model that captures the possibility of each measurement being malicious between the interval of (0,1).  The details of the THGP model as well as how it is utilised in this domain will be presented end of October at the fifth AAAI conference on human computation and crowdsourcing (HCOMP 2017). Stay tuned!

Research Internship – Data science/Machine Learning

This post aims to describe my experiences from my three-month research internship at Toshiba Research Labs, Bristol, UK and the project I have been working on (September – December 2015).

I remember the day I first went there for my interview. The building was between a wonderful small square park and a river, and it was just 5 minutes walk from the city centre. But this was not the only thing I really liked. Working there I realised the importance of the culture in a firm. I appreciated the importance of collaboration, brainstorming and creativity. It was an academia-like environment; friendly, down-to-earth people with lots of ideas and knowledge on a variety of subjects. Everyone was approachable and you could discuss with them about anything.  I could communicate with colleagues effectively without having to worry about business formalities.

The project I worked on was intriguing. That was the main reason I applied for this research internship in the first place. It combined my academic interest on Machine Learning and my personal interest on human wellbeing.  In short, the project was about Mood Recognition ar Work using Wearable devices. In other words, understand, learn and attempt to predict someone’s mood (happiness/sadness/anger/stress/boredom/tiredness/excitrment/calmness) using just a wearable device (could be a smart wristband, a chest sensor or anything that is able to capture vital signs). Sounds impossible right? How can you predict such a complicated thing as human emotions? We, as humans, are not able to understand our mood. For example, how would you say you feel right now? Happy, Sad? Ok? This is indicative of the complexity of the problem we were facing. However, we wanted to do unscripted experiments, meaning we did not want to induce any emotions to the participants of our study. We rather wanted them to wear a smart device amd log on their mood in 2-hour intervals while they were still in work as accurate as they could. Surprisingly, at least for me, there was variation in their responses in general. Some higher, some lower but all of them varied. That was encouraging.

We had to study the literature, do some research to answer the following question: How could we extract meaningful features from vital signs and accelerometer signals that will have predictive capabilities in terms of emotions? After some digging around, we found the relevant literature. It was not new concept. There were studies both in Medical literature and in Computer Science, associating heart rate with stress and skin temperature with fatigue. We wanted to take this further. We wanted to check whether a combination of all these could have a more powerful predictive ability. Intuitively, think about the times you felt stressed. Your heart might pumps faster, but sometimes your foot or hand might be shaking as well. These could be captured by the accelerometer and together could be used as an additional indicator stressful situation.

We ended up with hundreds of features, and tested a number of basic machine learning techniques, such as Decision Trees and SVMs.

Our results were good enough, comparable to those in literature. Thus, we decided to publish our findings in the PerCom 2016 conference proceedings (WristSense Workshop)(http://ieeexplore.ieee.org/document/7457166/).

Further, a number of ideas for patents were discussed and exciting new venues for potential work was drawn.

Overall, I would recommend an internship during a PhD programme as it is a very rewarding experience.

I would like to take this opportunity to thank all of the employees, managers, directors there for the unique experience and their confidence in me.

Submodularity in Sensor Placement Problems

Many problems in Artificial Intelligence and in computer science in general are hard to solve. What practically this means is that it would take a computer probably hundred/thousands/millions of years of computation to solve it. Thus, many scientists tend to create algorithms that approximately solve difficult problems but in a sensible time period, i.e., seconds/minutes/hours.

A problem like this is the sensor placement problem. The key question here is to find a number of locations to place some sensors in order to achieve better coverage of the interested area. In order to solve this problem the computer has to compute all the possible combinations of placing the sensors we have in all different locations. To give some numbers, having 5 sensors and 100 possible locations, one has to try 75287520 combinations in order to find the best arrangement. Imagine what happens when the problem is about placing hundreds of sensors in a city where there are hundreds or thousands of options.

In such problems submodularity comes handy. It is an extremely important property used in many sensor placement problems.  It is a theorem that describes the behavior of functions. In particular, the main idea is that an addition to a small set has a higher return/utility/value rather than adding the same thing to a larger set. This can be better understood with an example. Imagine having 10000 sensors scattered in a big room taking measurements of the temperature every 2 hours. Now imagine adding another sensor to that room. Have we really gained much for doing so? So, we have a large set and we add something. Similarly, imagine the same room having only 1 sensor. Adding 1 more can give us better understanding of probably some corner or get a better estimate of the true average temperature of the room. So, this sensor was much more valuable to have that in the previous case. This is what i mean by saying that adding something to a smaller set has a higher utility.

It turns out that this property is very useful at maths and in computer science and AI in particular as it allows us to build algorithms that have theoretical guarantees. It has been proved that a greedy algorithm has a 63% of the optimal algorithm in terms of performance. This was initially proved from Nemhauser in maths contents and later from Krause et. al in the field of computer science and especially for the sensor placement problem. The image below shows this property in terms of diagrams to get a better feeling of what this property is about.

Submodularity (taken from Meliou et al. power point presentation)
Submodularity (taken from Meliou et al. power point presentation)

Artificial Intelligence to save the environment or destroy the world?

In a recent post I briefly describe my experiences from the AAMAS conference in Turkey. What I haven’t talked about is the topic and the details of the paper of mine that got accepted there. This post aims to introduce my research and provide a summary of my recently published paper.

In 2014, the World Health Organization estimated that 7 million people have died by diseases associated with air pollution. These lives could have potentially been saved if measures were taken on time. But can we really take measures when we do not know where and when pollution is high but  only know vaguely that air pollution is caused by traffic and industrial pollutants released to the atmosphere? What I mean is that a more collective effort is required to really understand air pollution in terms of its spatial as well as its temporal distribution. In fact, there are indeed statics sensors scattered in cities all over the world. But are they enough? These are expensive sensors placed in areas away from pollution sources in order to estimate the average pollution in that area. Is that what we want? Sort of. How about the kid that goes to school, walking everyday for 10 minutes next to a congested road? How about the cyclists that chose to cycle to be environmental friendly and importantly to be more healthy? But, are they really healthy, cycling behind buses and cars? Well, I am sure that the air quality index displayed by the static sensor is nowhere near the reality of those  that spent their time in busy roads.

Here is the alternative! Give people the power to measure their own pollution exposure! Well, this is already happening and this is what participatory sensing is about. Citizens, carrying around sensors are taking all sort of measurements. Let’s think about that. Well, carrying around sensor… We all do. Our smartphone, that at least 7 out of 10 owns in the UK (according to studies in 2013), is a sensor. In fact, it is multiple sensors embedded in a single handheld device.  At the moment, they are not able to measure air pollution but we are getting there. I mean, phones can monitor your heart rate and for each generation of phones a new sensor is added. Even if monitoring air pollution from your phone might be a few years away, there are mobile sensors that could be easily paired to the phones via USB or Bluetooth.

However, people live their lives and follow their own daily agenda. They are not going to run around the city all day and night to take measurements in order to spatio-temporally cover their city.  Even if they want to, their mobile phone’s battery will betray them. How long can it last utilizing their battery draining sensors?

Enough of the introduction. My paper is focused on making these environmental campaigns that expect citizens to take measurements succeed.  How? We first of all assume that people have a cost for taking a measurement. This could represent the inconvenience that the user gets into in order to take a measurement. It could also represent a small payment if the environmental campaigns have the resources to do so. Or, it could even represent the battery life of user’s phone that it was just reduced because of the activation of multiple sensors such as GPS (and Bluetooth if it is paired with an air quality sensor).

Another factor that we consider is the mobility patterns of people. It is known (at least in the research community) that people are typically predictable in their daily routine. I do not know about you but this is definitely true for me. Except some times. Sometimes I deviate. Or so I think. Anyway, there is a lot of literature on this topic and I am not getting into details.

So, the big question is: Where, When and Who should take measurements in order to better monitor the environment for a period of time given that each user occurs a different cost for taking a single measurement? Well, this is the question that my algorithm addresses. It is about mapping each participant to a location at some point in time in a way that taking this suggested measurement would be as significant as possible in the effort of monitoring the environment. The good thing is that no one should deviate their route. Given that I always wake up and go to “work”,  the algorithm could tell me to take a measurement on my way at some point. This is the point of using human mobility patterns in the first place. To exploit some available intel.

Well, what do you think? I think this is better than having people walking around like zombies trying to take measurements for your experiments compensating them with 20 dollars each in a project that will cease to exist the day the funding is over and that no one will actually use it in practice after you have successfully published your paper.  Don’t you think?  Or, your phone could even deal with everything given that privacy concerns are met. For example, you could set it to make measurements where and when it is decided to without you explicitly knowing. These kind of ideas i am circling in my mind. It might not be that good, yet’ but this is the idea of my work. To make participatory sensing campaigns a thing.  We are still a bit far from a real-world trial but we are getting there. We need to get our facts right (in terms of assumptioms about the problem) and make it as good as possible given the uncertain environment and uncertain human behavior that AI will encounter.

So, will AI destroy the world? I don’t know. What I know is that the same research done to save the world, could easily be modified to destroy it. Imagine that 10 F-16 are deployed to bomb different terrorist bases. Or, a number of unmanned aerial vehicles sent with a pre-determined target. Now, someone could ask, where and when should these planes or drones release some bombs on their way to their targets or on their way back in order to maximize the damage caused to the enemy given that each bomb has a specific cost? Well, unfortunately,  the solution is already given by my AI algorithm. However, thankfully, the answer to this cannot be computed now as one important component is missing.

What does it mean when we say we collect information by taking a measurement? We imply that there is some sort of model over the environment that will give us a number or a value or something other than the air pollution index. Fortunately, for environmental monitoring there is a lot of work on how to do this. We choose to use Gaussian Processes because of their power, flexibility but most importantly because they give you the uncertainty over the locations of interest both in space and time. More about them in another post! So, to destroy the world you would need such a model.

AAMAS (Autonomous Agents and Multi-Agent Systems) 2015 Conference

I recently have had the opportunity to attend one of the most well-known and prestigious conferences in the area of Artificial Intelligence and Agents more specifically. The conference this year took place in the Congress Centre in Constantinople.

For me, it was the first conference ever I attended and I have to admit it was a wonderful experience overall. It was also the first time that I gave a talk in front of so many people, experts in the field! I was a bit shaky and nervous but everything went as planned.

My talk was allocated in the Applications session on Wednesday evening on the 6th of May 2015. I was at the conference centre early on to watch other talks. In particular, I was in Bio-inspired Approaches session. in my opinion, the very first talk was the best one in this session as the speaker got into the trouble of explaining in layman’s terms the important bits and pieces of the talk. Second on my list would be the Firefly-Inspired Synchronization in Swarms which I believe was an important concept. Specifically, it was about the way fireflies synchronize their flashing without ever explicitly communicating with each other. As the speaker noted, it is kind of the same with women’s menstrual cycle.

Other talks that got my attention was HAC-ER presented in my session (Applications) which was about a big project, joint paper among three universities (Southampton, Oxford, Nottingham) which was about enabling authorities or first responders to take better action after a major disastrous scenario like earthquake.

Beside attending friends’ talks I had the opportunity to discuss for a while with a guy working in a related area as me in the poster session. Hopefully, a collaboration can come out of this.

All in all, I hope that I will get lots of opportunities like this in the future.

AAMAS15

Participatory Sensing Applications

NoiseTube

NoiseTube is a project that tackles the noise pollution problem in many large cities in Europe. In particular, the deployment is focused on Brussels, Paris and London. It proposes a participative approach of monitoring noise pollution by involving the general public. Part of this project is the use of the NoiseTube app, a smartphone application which turns smartphones into noise sensors, enabling citizens to measure the sound exposure in their everyday environment. Each participant is able to share their geolocalized measurement data in an attempt to create a collective map of noise pollution, which will be available to NoiseTube community members. The main motivation for participation is the social interest. In other words, people contribute in order to understand their noise exposure, to build a collective map, to help local governments in tackling noise pollution by understanding noise statistics and to assist researchers by providing real data to analyse.

On the other hand, this project enables system designers to assess the potential of the participatory sensing approach in the context of environmental monitoring. In particular, developing a smartphone application, which is a widely adopted technology, can potentially reach thousands people that could cover large cities. Thus, provide a complete and accurate, in terms of noise exposure of individuals, noise pollution map to interested parties in order to take action.

The authors argue that although noise pollution is a major problem in cities around the world, current air pollution monitoring approaches fail to assess the actual exposure experienced by citizens. In particular, static sensors are located away from streets and emission sources in order to reflect the average pollution over an area. Consequently, it might underestimate the true exposure of people to air pollution. Thus, participatory sensing provides a low-cost solution for the citizens to measure their personal exposure and contribute to the community by taking measurements at the sources of the air pollution. This approach seems to work well, achieving the same accuracy as standard noise mapping techniques but at a significantly lower cost, as no expertise nor expensive sound level meter equipment is required.

GasMobile

GasMobile is a low-power and low-cost mobile sensing system for participatory air quality monitoring. Instead of relying on the expensive static measurement stations operated by official authorities for highly reliable and accurate measurements, GasMobile relies on the participatory sensing paradigm. In particular, GasMobile is a system developed from the combination of a small-sized, low-cost ozone sensor and an off-the-shelf smartphone. This system, besides taking ozone measurements to calculate air quality, can also exploit nearby static measurement stations to improve calibration and consequently the system’s accuracy. This system was used in a two-month campaign in an urban area. Specifically, the system was attached to a single bicycle and took measurements from several rides all around the city. The sampling interval was pre-set to five seconds, collecting a total of 2815 spatially distributed data points. Data collected were aggregated based on the area excerpt selected by a user interested in the results. To produce the map they divided this area into rectangular regions of 35×35 pixels and took the average ozone concentration of the observations in that region. Then, each region was classified into one of three categories: green, yellow, red depending on the average concentration value.

The system is currently at a prototyping stage but has great potential as it shows that air pollution monitoring can be achieved in a cost-effective manner. The results also show participatory sensing can produce results of high accuracy as the mean error for 2815 measurements was 2.74ppb which is only slightly higher than in static setting.

Citisense

Another important participatory sensing application is Citisense, which its purpose is to monitor air pollution in large regions such as San Diego, California, US. Citisense consists of three components: A wearable pollution sensor, a mobile phone application and a web interface. Users carry the pollution sensor and the mobile phone with them throughout the day in order to learn their air pollution exposure. The web interface provides more detailed reflection on the air pollution exposure as well as air pollution maps built with historic user’s air pollution data. The sensor is connected via Bluetooth to the mobile phone and it is able to take measurements for five days in a single charge. The mobile phone app is responsible for collecting readings from the sensor and presenting them to the user. Each reading is time-stamped and geo-tagged by utilizing mobile phone’s GPS and network-based localization services. Citisense was conducted in the field for one month, involving 16 participants. The results show that the users exposure differ from the average measurements displayed by static sensors scattered in cities. In particular, participatory sensing approach is able to identify pollution hot spots in the micro-environment that have been developed due to busy roads, buildings and natural topology. Also, Citisense made an impact on the awareness of people. Specifically, participants understood better the properties of air pollution and in particular, they realized that being near busy streets or buses, air pollution spikes. However, as the authors admit, power management is an important challenge.

ExposureSense

ExposureSense is a participatory sensing project that attempts to monitor air pollution in cities. It exploits the increasingly number of sensors that smartphones tend to have to convert them in to powerful mobile sensor devices. ExposureSense has a different approach than other participatory sensing applications for air pollution. It attempts to correlate humans’ daily activities and air quality monitoring in order to estimate user’s daily pollution exposure. To do so, smartphone’s accelerometer is used to infer the activities of users and external mobile sensor is used for air quality monitoring. In particular, machine learning techniques are applied on accelerometer data to infer users’ daily activities. In order to gather data from mobile devices they connect smartphones to air quality sensors via a USB cable. Data are also collected from external sensor networks which are combined with data collected from the users and interpolation is performed. Data is spatio-temporally correlated in order to estimate people’s daily pollutant exposure. Exposure intensity is scaled based on activity type, burned calories and movement speed.

HazeWatch

HazeWatch is another low-cost participatory sensing system for urban air pollution monitoring in Sydney. HazeWatch uses several low-cost sensor units attached to vehicles to measure air pollution concentrations, and users’ mobile phones to tag and upload data in real-time. This project identifies the disadvantages of current approaches, i.e., using static sensors to monitor air pollution in cities. In particular, typically, there are only a few statics sensors scattered in cities and air pollution is inferred with the use of mathematical models which require complex input, such as land topography, meteorological variables and chemical compositions. This leads in to potentially inaccurate inferences as well as underestimation of the true exposure of the public to air pollution. HazeWatch aims to crowdsource fine-grained spatial measurements of air pollution in Sydney and to engage users in managing their pollution exposure via personalized tools. Specifically, HazeWatch among others, suggest low pollution routes to users.

P-sense

P-sense is a work in progress that utilizes the concept of participatory sensing to monitor air pollution. The ultimate goal of this project is to allow government officials, international organizations, communities, and individuals to access the pollution data to address their particular problems and needs. P-sense enables air pollution measurements in a finer granularity rather than what is currently achieved by having static sensors in cities. It also enables users to assess their exposure to pollution according to the places visited during their daily activities. P-sense is easily extensible to allow the integration of existing data acquisition systems that could enrich the air quality dataset. P-sense consists of four main components: the sensing devices, the first-level integrator, the data transport network, and the servers. The environmental data are collected by a number of sensors such as gas, temperature, humidity, carbon monoxide, carbon dioxide and air quality sensors integrated to mobile phones via Bluetooth. All environmental data acquired from those sensors are transmitted to first-level integrator device, i.e., mobile phones. The phone is able for real-time analysis of data, providing visual feedback to users. The first level integrators transmit environmental data over the Internet (data transport network) to a dedicated server where they are stored and processed. Users are able to connect to the server and get visual feedback for the data. However, there are several important research challenges to address before this system is deployed in the real-world. These are related to data validity, incentives, visualization, privacy and security. Moreover, as in other applications, the Bluetooth connection drains mobile phone’s battery.

CommonSense

CommonSense is a participatory sensing project that aims to design a mobile air quality monitoring system. They conducted interviews with citizens, scientists and regulators in order to derive the principles and the framework for data collection and citizen participation. Unlike other applications, they break analysis into discrete mini-applications designed to scaffold and facilitate novice contributions. This approach allows the community members impacted by poor air quality to engage in the process of locating pollution sources and exploring local variations in air quality. Based on the fieldwork, a set of personas was developed to characterize relevant stakeholders. Specifically, `Activists’ are responsible for orchestrating actions and publicizing environmental issues. `Browsers’ are interested in environmental quality but not directly involved in sensing. `Data collectors’ are novice community members which are likely to be affected by air pollution. Also, the main principles extracted from the interviews with people are: Goal-oriented, i.e., what is the personal exposure of individuals and what are the hot spots in the city? Local and relevant, i.e., participants are interested mostly about their neighborhood and areas that frequently visit. Elicit latent explanations and expectation as well as prompt realizations are about taking into consideration the local knowledge of people and expertise, such as beliefs about the sources of air pollution in their area. Language barriers, i.e, users could be benefited by be introducing them to the scientific language where possible. This analysis lead to the development of a framework that is divided to six phases: Collect, Annotate, Question/Observe, Predict/Infer, Validate, Synthesize. Collect is the phase where the actual sensing takes place. Annotating is the step after collecting data where data collectors provide addition insights that contextualize and supplement it. Question/Observe is the step where data collectors begin to ask basic questions such as what is the personal exposure of each one or whether air quality is bad in their home based on their own and other collectors data. Infer/Predict builds on these questions and predictions are made for the unobserved locations. Validation is the stage where data collectors’ data are compared against with data from organizers and activists and check whether there is enough coverage of the interested area. Finally, Synthesize is the highest level where data are integrated and documentation, reports and other deliverables are produced.

Besides relying on citizens to take measurements, CommonSense attempts to monitor air pollution by other means. In particular, in one study they run trials with air quality monitoring devices attached to the rooftops of street cleaners in the city of San Francisco. These devices are associated with mobile phones that send data into CommonSense servers. This way, a systematic coverage of a large city can be achieved as well as test, refine and calibrate the system for future deployments.

OpenSense

OpenSense is a project that aims to monitor air pollution in large cities like Zurich, Switzerland. More than 25 million measurements were collected in over a year from sensors attached to the top of public transport vehicles. Based on these data, land-use regression models were built to create pollution maps with spatial resolution 100m X 100m. One of the challenges that this approach aims to tackle is the lack of fine-grained spatio-temporal air quality data. Static sensors are expensive to acquire and to maintain and thus only a few are placed in every city. The proposed system consists of 10 nodes installed on top of public transport vehicles that cover a large urban area on a regular schedule. The collected data are processed and predictions about the unobserved locations are made using the regression models. Although this is a good approach for providing fine grained spatio-temporal information about air pollution, nothing is said about the battery consumption of the sensors that are used to send the measurements in real-time over GSM and use GPS satellites to get their location. Also, measurements are only taken in roads where there are buses routes and since sensors are placed on top of them they endure vibrations, heat, humidity and long operating times which might lead into inaccurate measurements.

The Next Big One

The Next Big One is a participatory sensing application for the early detection of earthquakes. These events are difficult to model and characterize a priori. Thus, it utilizes the accelerometer sensors available on smartphones in a way to detect rare events and earthquakes. The focus of this project is to harness the power of the crowd, i.e., the wide availability of accelerometer sensors, for early earthquake detection. In shake table experiments, it is found that it is possible to distinguish seismic motion from accelerations due to normal daily use. However, for this application to be robust thousands of phones must be utilized. It is estimated that a million phones would produce 30 Terabytes of accelerometer data per day.

TrafficSense

TrafficSense is a participatory sensing application for the monitoring of road and traffic conditions. In particular, this application relies on people carrying their smartphones with them while traveling and utilizing their sensors like accelerometer, microphone, GSM radio, and/or GPS sensors to detect potholes, bumps braking and honking. The effectiveness of the sensing functions were tested in the roads of Bangalore and it is shown that is it possible to monitor the roads using a variety of sensors built in the smartphones that users carry with them. In particular, the accelerometer was used for braking detection and to distinguish pedestrians from users stuck in traffic. Also, it is used to detect spikes that would suggest bumps in the roads. Audio was recorded using phones’ microphone in order to detect noisy and chaotic traffic conditions. Finally, GPS and GSM cell triangulation are used to localize users’ positions.