Events play a very prominent role in our lifes. Therefore many social media documents describe or are related to some event. However, it is difficult for a human to gather relevant information without any structure in these documents. The organization of social media documents with respect to events thus seems to be a promising approach to better manage and organize the ever-increasing amount of content that is shared using social media applications. It is a challenge to automatize this process so that incoming documents can be assigned to their corresponding event without any user intervention.
In this dissertation we present an event-based stream classification framework that is able to classify a never-ending stream of social media data into a growing and evolving set of events. By doing this, we successfully perform the assignment of a social media item newly uploaded to some social media site to its corresponding event (if it already exists) or create a new event to which future data items can be assigned. We refer to this problem as the event detection problem and propose to use machine learning techniques to tackle it.
We successfully address several key challenges that arise in this context: i) handling the data in a stream-based setting, i.e. addressing the challenges arising from the need to process a never-ending stream of data, ii) scaling to the data sizes and rates usually encountered in social media applications, and iii) tackling the new event detection problem, i.e. the problem of determining whether an incoming data item belongs to a new or to an already known event.
We address these challenges through a classification approach allowing us to process the data in one single pass. Furthermore, we include a suitable candidate event retrieval step which retrieves a set of event candidates that the incoming data point is likely to belong to and we include a function trained using machine learning techniques that determines whether the incoming data point belongs to the top-scored candidate or rather to a new event. The performance of our system is maximized using different optimization strategies so that it outperforms many other state-of-the-art approaches.
Further, we extend our framework so that it can be used in a multi-pass setting. Using this approach we show that we can improve the quality of the clustering significantly in comparison to the single-pass approach, while also lowering the computational time by one order of magnitude. We show that this extension can be used in a stream-based setting while reaching the quality of a computationally very expensive offline clustering algorithm.
We prove that our highly efficient approach is capable of successfully clustering a real-world and non-toy dataset by introducing a new dataset consisting of user-contributed images together with associated metadata describing the events they depict. The dataset was already published earlier and is well known in the community. Our single-pass and multi-pass strategies reach an F-measure score of 88.6% and 93.9%, respectively. In conclusion, we show that our framework is not only capable of addressing the above mentioned challenging issues but also outperforms other state-of-the-art approaches in terms of quality and scalability.