程序辅导案例 > Program >

程序代写案例-COMP8210/COMP7210 -Assignment 2

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

© https://data-science-group.github.io/

COMP8210/COMP7210
Big Data Technologies

Assignment 2

Semester 2, 2021
Macquarie University, Department of Computing

Due: 26 September 2021 (Sunday) at 5pm
Total Mark: 100
Weighting: 25%

This Assessment Task relates to the following Learning Outcomes:
• Obtain a high level of technical competency in standard and advanced methods for big data
technologies
• Understand the current status of and recognize future trends in big data technologies
• Develop competency with emerging big data technologies, applications, and tools

Background. Social data analytics have become a vital asset for organizations and governments. For example,
over the last few years, governments started to extract knowledge and derive insights from vastly growing
open/social data to personalize the advertisements in elections, improve government services, predict
intelligence activities, as well as to improve national security and public health. A key challenge in analyzing
social data is to transform the raw data generated by social actors into curated data, i.e., contextualized data
and knowledge that is maintained and made available for use by end-users and applications.
In this assignment you will explore Big Data Technologies for analysing the data generated on social networks.

Reference. Beheshti et al., "DataSynapse: A Social Data Curation Foundry". Distributed Parallel Databases
37(3): 351-384 (2019). Download: https://doi.org/10.1007/s10619-018-7245-1

Dataset. The Twitter dataset, including 10k tweets, is available on iLearn.
Twitter1 serves many objects as JSON2, including Tweets and Users. These objects all encapsulate core
attributes that describe the object. Each Tweet has an author, a message, a unique ID, a timestamp of when it
was posted, and sometimes geo metadata shared by the user. Each User has a Twitter name, an ID, a number
of followers, and most often an account bio.
With each Tweet, Twitter generates 'entity' objects, which are arrays of common Tweet contents such as
hashtags, mentions, media, and links. If there are links, the JSON payload can also provide metadata such as
the fully unwound URL and the webpage’s title and description.

1 https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json
2 JSON is based on key-value pairs, with named attributes and associated values. These attributes, and their state, are used to describe objects.
© https://data-science-group.github.io/
So, in addition to the text content itself, a Tweet can have over 140 attributes associated with it. Let’s start
with an example Tweet:

The following JSON illustrates the structure for these objects and some of their attributes:

Source: https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview

Part 1. Extraction (30%)
You will use a technology to provide a customizable feature extraction to harness desired features from each
tweet. These features should include:
 Schema-based features (10%). This category is related to the properties of a social item. For example,
according to the Twitter schema, a tweet may have attributes such as text, source and language; and a user
may have attributes such as username, description and timezone.
 Lexical-based features (10%). This category is related to:
o the words or vocabulary of a language such as keyword, topic, phrase, abbreviation, special
characters (e.g., ‘#’ in a tweet), slangs, informal language and spelling errors.
o entities that can be extracted by the analysis and synthesis of natural language (NL) and speech,
such as part-of-speech (e.g., verb, noun, etc), named entity type (e.g., person, organization, product,
etc), and named entity (i.e., an instance of an entity type such as ‘Malcolm Turnbull’ as an instance
of entity type Person).
© https://data-science-group.github.io/
 Time-based features (5%). This category is related to the mentions of time in the schema of the item (e.g.,
‘tweet.Timestamp’ and ‘user.TimeZone’ in Twitter) or in the content of the social media posts (e.g., in
Twitter the text of a tweet may contain ‘3 May 2017’).
 Location-based features (5%). This category is related to the mentions of locations in the schema of the
item (e.g., in Twitter ‘tweet.GEO’ and ‘user.Location’) or in the content of the social media posts (e.g.,
in Twitter the text of a tweet may contain ‘Sydney’; a city in Australia).
Notice: Your solution can be implemented using the technologies that you have learned during the lecture
from Microsoft, or other technologies offer by IBM, AWS, etc. You also have the option to use Python to
develop your solution. You can reuse any existing libraries.

Part II. Enrichment (20%)
You will use a technology to provide a customizable feature enrichment to provide higher separation and
discrimination among patterns in different classes in machine learning problems. The enrichment should
include:
 Lexical-based Semantics (10%). You will leverage knowledge sources such as WordNet3, to
enrich Lexical-based features with their Synonyms, Stems, Hypernyms, Hyponyms, and more.
 NL-based Semantics (10%). You will leverage knowledge sources such as WikiData4, Google-
KG5, and DBPedia6 to enrich Natural-Language-based features with similar and related entities.
For example, ‘Malcolm Turnbull’7 is similar to ‘Tony Abbott’8 (they both acted as the prime
minister of Australia) but ‘Malcolm Turnbull’ is related to ‘University of Sydney’9 (the University
where he attended and graduated).
Notice: Your solution can be implemented using the technologies that you have learned during the lecture
from Microsoft, or other technologies offer by IBM, AWS, etc. You also have the option to use Python to
develop your solution. You can reuse any existing libraries.

Part III. Analytics and Visualization (50%)
You will use a technology to provide a customizable visualization to facilitate understanding trends, outliers,
and patterns in data. The visualisation should include:
 A) Classification: Sentiment Analysis and Visualization (25%). Sentiment analysis is the
automated process of identifying emotions in text. quickly make sense of opinions ‒ like those in social
media posts, surveys, product reviews, and support conversations ‒ and understand how customers
feel about your business.
Your Task: Leverage an existing sentiment analysis algorithm, classify Tweets into 3 classes (Positive,
Negative, and Neutral), and visualize your sentiment analysis. You will get 15% for discussing the quality
of the classification results, and 10% for the visualization.

3 https://wordnet.princeton.edu/
4 https://www.wikidata.org/
5 https://developers.google.com/knowledge-graph/
6 http://wiki.dbpedia.org/
7 https://en.wikipedia.org/wiki/Malcolm_Turnbull
8 https://en.wikipedia.org/wiki/Tony_Abbott
9 https://en.wikipedia.org/wiki/University_of_Sydney
© https://data-science-group.github.io/
 B) Clustering: Similarity Computation and Visualization (25%). Twitter users are likely to
generate similar tweets, e.g., about some popular topics. By clustering similar tweets together, we can
generate a more concise and organized representation of the raw tweets, which will be very useful for
many Twitter-based applications (e.g., truth discovery, trend analysis, search ranking, etc.).
Your Task: Leverage an existing tweet clustering function (e.g., using the Jaccard Distance metric and
K-means clustering algorithm) to group similar tweets into the same cluster. You should use the features
that you have extracted and enriched in Part I and Part II. Then visualise your similarity analysis. You
will get 15% for discussing the quality of the clustering results, and 10% for the visualization.

Notice: Your solution can be implemented using the technologies that you have learned during the lecture
from Microsoft, or other technologies offer by IBM, AWS, etc. You also have the option to use Python to
develop your solution. You can reuse any existing libraries.

Evaluation and Marking

• This is an individual assignment worth 25%.
• Late penalties: 10% (out of 100%) mark per day late.
• Your assignment will be evaluated by the tutor independently.
• You will need to create a video (max 10 minutes) and upload it on YouTube. Then, share the YouTube
link in your assignment submission.

欢迎咨询51作业君