{"id":37,"date":"2016-07-01T18:19:48","date_gmt":"2016-07-01T16:19:48","guid":{"rendered":"https:\/\/www.wurzer.com\/dominikweb\/?page_id=37"},"modified":"2020-02-10T11:38:47","modified_gmt":"2020-02-10T10:38:47","slug":"37-2","status":"publish","type":"page","link":"https:\/\/www.wurzer.com\/dominikweb\/","title":{"rendered":"Home &#8211; neu"},"content":{"rendered":"<p>[et_pb_section fb_built=&#8221;1&#8243; background_color=&#8221;#ffffff&#8221; background_video_mp4=&#8221;https:\/\/www.wurzer.com\/dominikweb\/wp-content\/uploads\/2016\/07\/razer.mp4&#8243; parallax=&#8221;on&#8221; parallax_method=&#8221;off&#8221; padding_mobile=&#8221;off&#8221; fullwidth=&#8221;on&#8221; admin_label=&#8221;Section&#8221;][et_pb_fullwidth_header title=&#8221;Dominik Wurzer&#8221; subhead=&#8221;PhD student at the University of Edinburgh&#8221; background_layout=&#8221;dark&#8221; header_fullscreen=&#8221;on&#8221; background_overlay_color=&#8221;rgba(12,113,195,0)&#8221; parallax=&#8221;on&#8221; admin_label=&#8221;Fullwidth Header&#8221;] I am PhD student at the University of Edinburgh focusing on information retrieval on high volume data streams and applied machine learning. [\/et_pb_fullwidth_header][\/et_pb_section][et_pb_section fb_built=&#8221;1&#8243; background_color=&#8221;#fcfcfc&#8221; padding_mobile=&#8221;off&#8221; admin_label=&#8221;Section&#8221;][\/et_pb_section][et_pb_section fb_built=&#8221;1&#8243; fullwidth=&#8221;on&#8221;][et_pb_fullwidth_header title=&#8221;\u00a0&#8211; Publications -&#8221; text_orientation=&#8221;center&#8221;][\/et_pb_fullwidth_header][\/et_pb_section][et_pb_section fb_built=&#8221;1&#8243; specialty=&#8221;on&#8221; admin_label=&#8221;Section&#8221;][et_pb_column type=&#8221;1_3&#8243;][et_pb_text admin_label=&#8221;Text&#8221;]<\/p>\n<p class=\"\"><strong>Counteracting Novelty Decay in First Story Detection<\/strong><\/p>\n<p>[\/et_pb_text][\/et_pb_column][et_pb_column type=&#8221;2_3&#8243; specialty_columns=&#8221;2&#8243;][et_pb_row_inner padding_mobile=&#8221;off&#8221; column_padding_mobile=&#8221;off&#8221; parallax_method_1=&#8221;off&#8221; parallax_method_2=&#8221;off&#8221; admin_label=&#8221;Row&#8221;][et_pb_column_inner type=&#8221;1_2&#8243; saved_specialty_column_type=&#8221;2_3&#8243; column_padding_mobile=&#8221;off&#8221; parallax=&#8221;off&#8221; parallax_method=&#8221;off&#8221;][et_pb_text admin_label=&#8221;Text&#8221;]<\/p>\n<p>In this paper we explore the impact of processing unbounded data streams on First Story Detection (FSD) accuracy. In particular, we study three di\ufb00erent types of FSD algorithms: comparison-based, LSH-based and k-term based FSD. Our experiments reveal for the \ufb01rst time that the novelty score of all three algorithms decay over time. We explain why the decay is linked to the increased space saturation and negatively a\ufb00ects detection accuracy. We provide a mathematical decay model, which allows compensating observed novelty scores by their expected decay. Our experiments show signi\ufb01cantly increased performance when counteracting the novelty score decay.<\/p>\n<p>[\/et_pb_text][\/et_pb_column_inner][et_pb_column_inner type=&#8221;1_2&#8243; saved_specialty_column_type=&#8221;2_3&#8243; column_padding_mobile=&#8221;off&#8221; parallax=&#8221;off&#8221; parallax_method=&#8221;off&#8221;][et_pb_button button_url=&#8221;https:\/\/www.wurzer.com\/dominikweb\/wp-content\/uploads\/2017\/03\/ecirNovelty.pdf&#8221; url_new_window=&#8221;on&#8221; button_text=&#8221;download paper&#8221; button_alignment=&#8221;right&#8221; admin_label=&#8221;Button&#8221; button_icon=&#8221;%%76%%&#8221; background_color=&#8221;#7EBEC5&#8243;] [\/et_pb_button][et_pb_button button_url=&#8221;https:\/\/www.wurzer.com\/dominikweb\/wp-content\/uploads\/2017\/03\/posterECIR.pdf&#8221; url_new_window=&#8221;on&#8221; button_text=&#8221;download poster&#8221; button_alignment=&#8221;right&#8221; admin_label=&#8221;Button&#8221; background_color=&#8221;#7EBEC5&#8243;] [\/et_pb_button][\/et_pb_column_inner][\/et_pb_row_inner][\/et_pb_column][\/et_pb_section][et_pb_section fb_built=&#8221;1&#8243; specialty=&#8221;on&#8221; admin_label=&#8221;Section&#8221;][et_pb_column type=&#8221;1_3&#8243;][et_pb_text admin_label=&#8221;Text&#8221;]<\/p>\n<p class=\"\"><strong>Spotting Information biases in Chinese and Western Media<\/strong><\/p>\n<p>[\/et_pb_text][\/et_pb_column][et_pb_column type=&#8221;2_3&#8243; specialty_columns=&#8221;2&#8243;][et_pb_row_inner padding_mobile=&#8221;off&#8221; column_padding_mobile=&#8221;off&#8221; parallax_method_1=&#8221;off&#8221; parallax_method_2=&#8221;off&#8221; admin_label=&#8221;Row&#8221;][et_pb_column_inner type=&#8221;1_2&#8243; saved_specialty_column_type=&#8221;2_3&#8243; column_padding_mobile=&#8221;off&#8221; parallax=&#8221;off&#8221; parallax_method=&#8221;off&#8221;][et_pb_text admin_label=&#8221;Text&#8221;]<\/p>\n<p>Newswire and Social Media are the major sources of information in our time. While the topical demographic of Western Media was subjects of studies in the past, less is known about Chinese Media. In this paper, we apply event detection and tracking technology to examine the information overlap and di\ufb00erences between Chinese and Western &#8211; Traditional Media and Social Media. Our experiments reveal a biased interest of China towards the West, which becomes particularly apparent when comparing the interest in celebrities.<\/p>\n<p>[\/et_pb_text][\/et_pb_column_inner][et_pb_column_inner type=&#8221;1_2&#8243; saved_specialty_column_type=&#8221;2_3&#8243; column_padding_mobile=&#8221;off&#8221; parallax=&#8221;off&#8221; parallax_method=&#8221;off&#8221;][et_pb_button button_url=&#8221;https:\/\/www.wurzer.com\/dominikweb\/wp-content\/uploads\/2017\/03\/spottingInformationBias.pdf&#8221; url_new_window=&#8221;on&#8221; button_text=&#8221;download paper&#8221; button_alignment=&#8221;right&#8221; admin_label=&#8221;Button&#8221; button_icon=&#8221;%%76%%&#8221; background_color=&#8221;#7EBEC5&#8243;] [\/et_pb_button][\/et_pb_column_inner][\/et_pb_row_inner][\/et_pb_column][\/et_pb_section][et_pb_section fb_built=&#8221;1&#8243; specialty=&#8221;on&#8221; admin_label=&#8221;Section&#8221;][et_pb_column type=&#8221;1_3&#8243;][et_pb_text admin_label=&#8221;Text&#8221;]<\/p>\n<p class=\"\"><strong>Spotting Rumors via Novelty Detection<\/strong><\/p>\n<p>[\/et_pb_text][\/et_pb_column][et_pb_column type=&#8221;2_3&#8243; specialty_columns=&#8221;2&#8243;][et_pb_row_inner padding_mobile=&#8221;off&#8221; column_padding_mobile=&#8221;off&#8221; parallax_method_1=&#8221;off&#8221; parallax_method_2=&#8221;off&#8221; admin_label=&#8221;Row&#8221;][et_pb_column_inner type=&#8221;1_2&#8243; saved_specialty_column_type=&#8221;2_3&#8243; column_padding_mobile=&#8221;off&#8221; parallax=&#8221;off&#8221; parallax_method=&#8221;off&#8221;][et_pb_text admin_label=&#8221;Text&#8221;]<\/p>\n<p>Rumour detection is hard because the most accurate systems operate retrospectively, only recognizing rumours once they have collected repeated signals. By then the rumours might have already spread and caused harm. We introduce a new category of features based on novelty, tailored to detect rumours early on. To compensate for the absence of repeated signals, we make use of news wire as an additional data source. Unconfirmed (novel) information with respect to the news articles is considered as an indication of rumours. Additionally we introduce pseudo feedback, which assumes that documents that are similar to previous rumours, are more likely to also be a rumour. Comparison with other real-time approaches shows that novelty based features in conjunction with pseudo feedback perform significantly better, when detecting rumours instantly after their publication.<\/p>\n<p>[\/et_pb_text][\/et_pb_column_inner][et_pb_column_inner type=&#8221;1_2&#8243; saved_specialty_column_type=&#8221;2_3&#8243; column_padding_mobile=&#8221;off&#8221; parallax=&#8221;off&#8221; parallax_method=&#8221;off&#8221;][et_pb_button button_url=&#8221;https:\/\/www.wurzer.com\/dominikweb\/wp-content\/uploads\/2017\/03\/SpottingRumours.pdf&#8221; url_new_window=&#8221;on&#8221; button_text=&#8221;download paper&#8221; button_alignment=&#8221;right&#8221; admin_label=&#8221;Button&#8221; button_icon=&#8221;%%76%%&#8221; background_color=&#8221;#7EBEC5&#8243;] [\/et_pb_button][\/et_pb_column_inner][\/et_pb_row_inner][\/et_pb_column][\/et_pb_section][et_pb_section fb_built=&#8221;1&#8243; specialty=&#8221;on&#8221; admin_label=&#8221;Section&#8221;][et_pb_column type=&#8221;1_3&#8243;][et_pb_text admin_label=&#8221;Text&#8221;] <strong>Twitter-scale New Event Detection via K-term Hashing<\/strong> [\/et_pb_text][\/et_pb_column][et_pb_column type=&#8221;2_3&#8243; specialty_columns=&#8221;2&#8243;][et_pb_row_inner padding_mobile=&#8221;off&#8221; column_padding_mobile=&#8221;off&#8221; parallax_method_1=&#8221;off&#8221; parallax_method_2=&#8221;off&#8221; admin_label=&#8221;Row&#8221;][et_pb_column_inner type=&#8221;1_2&#8243; saved_specialty_column_type=&#8221;2_3&#8243; column_padding_mobile=&#8221;off&#8221; parallax=&#8221;off&#8221; parallax_method=&#8221;off&#8221;][et_pb_text admin_label=&#8221;Text&#8221;] First Story Detection is hard because the most accurate systems become progressively slower with each document processed. We present a novel approach to FSD, which operates in constant time\/space and scales to very high volume streams. We show that when computing novelty over a large dataset of tweets, our method performs 192 times faster than a state-of-the-art baseline without sacrificing accuracy. Our method is capable of performing FSD on the full Twitter stream on a single core of modest hardware. [\/et_pb_text][\/et_pb_column_inner][et_pb_column_inner type=&#8221;1_2&#8243; saved_specialty_column_type=&#8221;2_3&#8243; column_padding_mobile=&#8221;off&#8221; parallax=&#8221;off&#8221; parallax_method=&#8221;off&#8221;][et_pb_button button_url=&#8221;https:\/\/www.wurzer.com\/dominikweb\/wp-content\/uploads\/2016\/07\/twitterscaleNewEventDetectionViaKtermHashing.pdf&#8221; url_new_window=&#8221;on&#8221; button_text=&#8221;download paper&#8221; button_alignment=&#8221;right&#8221; admin_label=&#8221;Button&#8221; button_icon=&#8221;%%76%%&#8221; background_color=&#8221;#7EBEC5&#8243;] [\/et_pb_button][et_pb_button button_url=&#8221;https:\/\/www.wurzer.com\/dominikweb\/wp-content\/uploads\/2016\/07\/kTermHashingEmnlpPoster.pdf&#8221; url_new_window=&#8221;on&#8221; button_text=&#8221;download poster&#8221; button_alignment=&#8221;right&#8221; admin_label=&#8221;Button&#8221; background_color=&#8221;#7EBEC5&#8243;] [\/et_pb_button][\/et_pb_column_inner][\/et_pb_row_inner][\/et_pb_column][\/et_pb_section][et_pb_section fb_built=&#8221;1&#8243; specialty=&#8221;on&#8221; admin_label=&#8221;Section&#8221;][et_pb_column type=&#8221;1_3&#8243;][et_pb_text admin_label=&#8221;Text&#8221;] <strong>Tracking unbounded Topic Streams<\/strong> [\/et_pb_text][\/et_pb_column][et_pb_column type=&#8221;2_3&#8243; specialty_columns=&#8221;2&#8243;][et_pb_row_inner admin_label=&#8221;Row&#8221;][et_pb_column_inner type=&#8221;1_2&#8243; saved_specialty_column_type=&#8221;2_3&#8243; parallax=&#8221;off&#8221; parallax_method=&#8221;on&#8221;][et_pb_text admin_label=&#8221;Text&#8221;] Tracking topics on social media streams is non-trivial as the number of topics mentioned grows without bound. This complexity is compounded when we want to track such topics against other fast moving streams. We go beyond traditional small scale topic tracking and consider a stream of topics against another document stream. We introduce two tracking approaches which are fully applicable to true streaming environments.\u00a0When tracking 4.4 million topics against 52 million documents in constant time and space, we demonstrate that counter to expectations, simple single-pass clustering can outperform locality sensitive hashing for nearest neighbour search on streams. [\/et_pb_text][\/et_pb_column_inner][et_pb_column_inner type=&#8221;1_2&#8243; saved_specialty_column_type=&#8221;2_3&#8243; parallax=&#8221;off&#8221; parallax_method=&#8221;on&#8221;][et_pb_button button_url=&#8221;https:\/\/www.wurzer.com\/dominikweb\/wp-content\/uploads\/2016\/07\/trackingUnboundedTopicStreams.pdf&#8221; url_new_window=&#8221;on&#8221; button_text=&#8221;download paper&#8221; button_alignment=&#8221;right&#8221; admin_label=&#8221;Button&#8221; background_color=&#8221;#7EBEC5&#8243;] [\/et_pb_button][\/et_pb_column_inner][\/et_pb_row_inner][\/et_pb_column][\/et_pb_section][et_pb_section fb_built=&#8221;1&#8243; specialty=&#8221;on&#8221; admin_label=&#8221;Section&#8221;][et_pb_column type=&#8221;1_3&#8243;][et_pb_text admin_label=&#8221;Text&#8221;] <strong>Randomized Relevance Model<\/strong> [\/et_pb_text][\/et_pb_column][et_pb_column type=&#8221;2_3&#8243; specialty_columns=&#8221;2&#8243;][et_pb_row_inner admin_label=&#8221;Row&#8221;][et_pb_column_inner type=&#8221;1_2&#8243; saved_specialty_column_type=&#8221;2_3&#8243; parallax=&#8221;off&#8221; parallax_method=&#8221;on&#8221;][et_pb_text admin_label=&#8221;Text&#8221;] Relevance Models are well known retrieval models and capable of producing competitive results. However, because they use query expansion they can be very slow. We address this slowness by incorporating two variants of locality sensitive hashing (LSH) into the query expansion process. Results on two document collections suggest that we can obtain large reductions in the amount of work, with a small reduction in effectiveness. Our approach is shown to be additive when pruning query terms. [\/et_pb_text][\/et_pb_column_inner][et_pb_column_inner type=&#8221;1_2&#8243; saved_specialty_column_type=&#8221;2_3&#8243; parallax=&#8221;off&#8221; parallax_method=&#8221;on&#8221;][et_pb_button button_url=&#8221;https:\/\/www.wurzer.com\/dominikweb\/wp-content\/uploads\/2016\/07\/randomizedRelevanceModel.pdf&#8221; url_new_window=&#8221;on&#8221; button_text=&#8221;download paper&#8221; button_alignment=&#8221;right&#8221; admin_label=&#8221;Button&#8221; background_color=&#8221;#7EBEC5&#8243;] [\/et_pb_button][\/et_pb_column_inner][\/et_pb_row_inner][\/et_pb_column][\/et_pb_section]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>[et_pb_section fb_built=&#8221;1&#8243; background_color=&#8221;#ffffff&#8221; background_video_mp4=&#8221;https:\/\/www.wurzer.com\/dominikweb\/wp-content\/uploads\/2016\/07\/razer.mp4&#8243; parallax=&#8221;on&#8221; parallax_method=&#8221;off&#8221; padding_mobile=&#8221;off&#8221; fullwidth=&#8221;on&#8221; admin_label=&#8221;Section&#8221;][et_pb_fullwidth_header title=&#8221;Dominik Wurzer&#8221; subhead=&#8221;PhD student at the University of Edinburgh&#8221; background_layout=&#8221;dark&#8221; header_fullscreen=&#8221;on&#8221; background_overlay_color=&#8221;rgba(12,113,195,0)&#8221; parallax=&#8221;on&#8221; admin_label=&#8221;Fullwidth Header&#8221;] I am PhD student at the University of Edinburgh focusing on information retrieval on high volume data streams and applied machine learning. [\/et_pb_fullwidth_header][\/et_pb_section][et_pb_section fb_built=&#8221;1&#8243; background_color=&#8221;#fcfcfc&#8221; padding_mobile=&#8221;off&#8221; admin_label=&#8221;Section&#8221;][\/et_pb_section][et_pb_section fb_built=&#8221;1&#8243; fullwidth=&#8221;on&#8221;][et_pb_fullwidth_header title=&#8221;\u00a0&#8211; Publications [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"page-template-blank.php","meta":{"_et_pb_use_builder":"on","_et_pb_old_content":"","_et_gb_content_width":"","footnotes":""},"class_list":["post-37","page","type-page","status-publish","hentry"],"amp_enabled":false,"_links":{"self":[{"href":"https:\/\/www.wurzer.com\/dominikweb\/index.php?rest_route=\/wp\/v2\/pages\/37","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wurzer.com\/dominikweb\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.wurzer.com\/dominikweb\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.wurzer.com\/dominikweb\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wurzer.com\/dominikweb\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=37"}],"version-history":[{"count":39,"href":"https:\/\/www.wurzer.com\/dominikweb\/index.php?rest_route=\/wp\/v2\/pages\/37\/revisions"}],"predecessor-version":[{"id":100,"href":"https:\/\/www.wurzer.com\/dominikweb\/index.php?rest_route=\/wp\/v2\/pages\/37\/revisions\/100"}],"wp:attachment":[{"href":"https:\/\/www.wurzer.com\/dominikweb\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=37"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}