© https://data-science-group.github.io/ COMP8210/COMP7210 Big Data Technologies Assignment 2 Semester 2, 2021 Macquarie University, Department of Computing Due: 26 September 2021 (Sunday) at 5pm Total Mark: 100 Weighting: 25% This Assessment Task relates to the following Learning Outcomes: • Obtain a high level of technical competency in standard and advanced methods for big data technologies • Understand the current status of and recognize future trends in big data technologies • Develop competency with emerging big data technologies, applications, and tools Background. Social data analytics have become a vital asset for organizations and governments. For example, over the last few years, governments started to extract knowledge and derive insights from vastly growing open/social data to personalize the advertisements in elections, improve government services, predict intelligence activities, as well as to improve national security and public health. A key challenge in analyzing social data is to transform the raw data generated by social actors into curated data, i.e., contextualized data and knowledge that is maintained and made available for use by end-users and applications. In this assignment you will explore Big Data Technologies for analysing the data generated on social networks. Reference. Beheshti et al., "DataSynapse: A Social Data Curation Foundry". Distributed Parallel Databases 37(3): 351-384 (2019). Download: https://doi.org/10.1007/s10619-018-7245-1 Dataset. The Twitter dataset, including 10k tweets, is available on iLearn. Twitter1 serves many objects as JSON2, including Tweets and Users. These objects all encapsulate core attributes that describe the object. Each Tweet has an author, a message, a unique ID, a timestamp of when it was posted, and sometimes geo metadata shared by the user. Each User has a Twitter name, an ID, a number of followers, and most often an account bio. With each Tweet, Twitter generates 'entity' objects, which are arrays of common Tweet contents such as hashtags, mentions, media, and links. If there are links, the JSON payload can also provide metadata such as the fully unwound URL and the webpage’s title and description. 1 https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json 2 JSON is based on key-value pairs, with named attributes and associated values. These attributes, and their state, are used to describe objects. © https://data-science-group.github.io/ So, in addition to the text content itself, a Tweet can have over 140 attributes associated with it. Let’s start with an example Tweet: The following JSON illustrates the structure for these objects and some of their attributes: Source: https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview Part 1. Extraction (30%) You will use a technology to provide a customizable feature extraction to harness desired features from each tweet. These features should include: Schema-based features (10%). This category is related to the properties of a social item. For example, according to the Twitter schema, a tweet may have attributes such as text, source and language; and a user may have attributes such as username, description and timezone. Lexical-based features (10%). This category is related to: o the words or vocabulary of a language such as keyword, topic, phrase, abbreviation, special characters (e.g., ‘#’ in a tweet), slangs, informal language and spelling errors. o entities that can be extracted by the analysis and synthesis of natural language (NL) and speech, such as part-of-speech (e.g., verb, noun, etc), named entity type (e.g., person, organization, product, etc), and named entity (i.e., an instance of an entity type such as ‘Malcolm Turnbull’ as an instance of entity type Person). © https://data-science-group.github.io/ Time-based features (5%). This category is related to the mentions of time in the schema of the item (e.g., ‘tweet.Timestamp’ and ‘user.TimeZone’ in Twitter) or in the content of the social media posts (e.g., in Twitter the text of a tweet may contain ‘3 May 2017’). Location-based features (5%). This category is related to the mentions of locations in the schema of the item (e.g., in Twitter ‘tweet.GEO’ and ‘user.Location’) or in the content of the social media posts (e.g., in Twitter the text of a tweet may contain ‘Sydney’; a city in Australia). Notice: Your solution can be implemented using the technologies that you have learned during the lecture from Microsoft, or other technologies offer by IBM, AWS, etc. You also have the option to use Python to develop your solution. You can reuse any existing libraries. Part II. Enrichment (20%) You will use a technology to provide a customizable feature enrichment to provide higher separation and discrimination among patterns in different classes in machine learning problems. The enrichment should include: Lexical-based Semantics (10%). You will leverage knowledge sources such as WordNet3, to enrich Lexical-based features with their Synonyms, Stems, Hypernyms, Hyponyms, and more. NL-based Semantics (10%). You will leverage knowledge sources such as WikiData4, Google- KG5, and DBPedia6 to enrich Natural-Language-based features with similar and related entities. For example, ‘Malcolm Turnbull’7 is similar to ‘Tony Abbott’8 (they both acted as the prime minister of Australia) but ‘Malcolm Turnbull’ is related to ‘University of Sydney’9 (the University where he attended and graduated). Notice: Your solution can be implemented using the technologies that you have learned during the lecture from Microsoft, or other technologies offer by IBM, AWS, etc. You also have the option to use Python to develop your solution. You can reuse any existing libraries. Part III. Analytics and Visualization (50%) You will use a technology to provide a customizable visualization to facilitate understanding trends, outliers, and patterns in data. The visualisation should include: A) Classification: Sentiment Analysis and Visualization (25%). Sentiment analysis is the automated process of identifying emotions in text. quickly make sense of opinions ‒ like those in social media posts, surveys, product reviews, and support conversations ‒ and understand how customers feel about your business. Your Task: Leverage an existing sentiment analysis algorithm, classify Tweets into 3 classes (Positive, Negative, and Neutral), and visualize your sentiment analysis. You will get 15% for discussing the quality of the classification results, and 10% for the visualization. 3 https://wordnet.princeton.edu/ 4 https://www.wikidata.org/ 5 https://developers.google.com/knowledge-graph/ 6 http://wiki.dbpedia.org/ 7 https://en.wikipedia.org/wiki/Malcolm_Turnbull 8 https://en.wikipedia.org/wiki/Tony_Abbott 9 https://en.wikipedia.org/wiki/University_of_Sydney © https://data-science-group.github.io/ B) Clustering: Similarity Computation and Visualization (25%). Twitter users are likely to generate similar tweets, e.g., about some popular topics. By clustering similar tweets together, we can generate a more concise and organized representation of the raw tweets, which will be very useful for many Twitter-based applications (e.g., truth discovery, trend analysis, search ranking, etc.). Your Task: Leverage an existing tweet clustering function (e.g., using the Jaccard Distance metric and K-means clustering algorithm) to group similar tweets into the same cluster. You should use the features that you have extracted and enriched in Part I and Part II. Then visualise your similarity analysis. You will get 15% for discussing the quality of the clustering results, and 10% for the visualization. Notice: Your solution can be implemented using the technologies that you have learned during the lecture from Microsoft, or other technologies offer by IBM, AWS, etc. You also have the option to use Python to develop your solution. You can reuse any existing libraries. Evaluation and Marking • This is an individual assignment worth 25%. • Late penalties: 10% (out of 100%) mark per day late. • Your assignment will be evaluated by the tutor independently. • You will need to create a video (max 10 minutes) and upload it on YouTube. Then, share the YouTube link in your assignment submission.
欢迎咨询51作业君