51作业君
首页
低价平台
服务介绍
代写程序
代写论文
编程辅导
程序案例
论文案例
联系方式
诚邀英才
代写选择指南
程序辅导案例
>
Program
>
程序代写案例-CS 242-Assignment 2
欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
2021/9/21 Assignment 2.0 (Due 23:59 CDT on Sep 27, 2021) - CS 242 - Illinois Wiki https://wiki.illinois.edu/wiki/pages/viewpage.action?pageId=616243854 1/8 br>页面 / Home / Assignments 由 Wang, Ren-Jay创建, 最终由 Liu, Miranda修改于 八月 23, 2021 Assignment 2.0 (Due 23:59 CDT on Sep 27, 2021) Assignment 2.0 - Building a Digital Library Table of Contents Overview For the following two weeks, you will be building a Digital Library. This week the goal is to gather data and store it properly. Therefore, there will be 3 major parts of this assignment: (1) Scraping data (2) cleaning and processing the data, and (3) storing the data. For this week, we would be using primarily Python 3 (3.5+). If you would want to use an alternative language, please send an email to the TAs for approval. You don't need to do peer review for this assignment so please merge the branch to master while keeping the source branch. Motivation and Goals There are many methods of data collection in the rapidly-evolving world of information and technology, but web scraping is among the most popular and accurate. In layman's terms, web scraping is the act of using bots to extract specific content and data from a website. Web scraping is especially useful because it has the ability to convert non-tabular, nonsensical, and poorly constructed data into something both in format and in content. Web scraping is also championed for its ability to acquire previously inaccessible data. However, web-scraping is not about mere acquisition-- it can also assist you to track changes, analyze trends and keep tabs on certain patterns in specific fields. The purpose of this particular assignment is to introduce you to the real-world application of web-scraping tech, as well as get you thinking about the creative process that accompanies the tasks you are assigned. There will be a number of directives that you will have to solve both in this assignment as well as when you graduate and break into industry-standard workplaces, so keep this in mind as you work on this assignment. Web scraping may be the focus of this particular assignment, but it very well may be a potential, real-life approach you use in the future. For this practice assignment, we will be using Goodreads as our web source. Goodreads is a website that collects information on books as well as reviews from the community. Think of it as an IMDB repository for Books. In this assignment, we will not be scraping reviews and users as that is disallowed according to the robots.txt of the site. Also, just to be mindful, please avoid requesting large amounts of volume in a short period of time when writing the scraper. Programming Language and IDE We will be using Python for this assignment. For other languages, please provide equivalent tools that fulfill certain criteria within the rubrics to the TAs for approval. (ie. Ruby, Go. Javascript....) PyDev for Eclipse PyCharm Visual Studio Code (with plugins) 2021/9/21 Assignment 2.0 (Due 23:59 CDT on Sep 27, 2021) - CS 242 - Illinois Wiki https://wiki.illinois.edu/wiki/pages/viewpage.action?pageId=616243854 2/8 Part 0: Reading Before you begin web scraping, make sure to read the following links: https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/ https://www.promptcloud.com/blog/dont-get-blacklisted-legitimate-web-scraping-process https://www.promptcloud.com/blog/some-traps-to-avoid-in-web-scraping/ To avoid harming the website, please avoid extensive scraping. Be a responsible scraper! Part I: Web Scraping Your program should be able to: Gather information of Authors and Books from Goodreads. You should gather information from a large number of book pages (>200) and authors pages (>50) Essentially saying you need to contain at least information of at least 200 books and 50 authors. You should not go beyond 2000 books or authors. The starting page must be a book page. The starting page should be a variable and should not be hard-coded. (For example, starting from clean code: https://www.goodreads.com/book/show/3735293-clean-code). The order of traversal doesn't matter: for example, you can find the next books to scrape by visiting all books that the same authors have written, or you can just use similar books listed on the GoodReads website. Report progress and exceptions (e.g., books without ISBN). There is no limit as to how this should be implemented. Represent Books and Authors efficiently. There is no one structure for this assignment, and there are no type constraints for fields. However, you are required to scrape at least the following information: For Authors, you need to store: name Name of the Author author_url the page URL author_id a unique identifier of the author rating the rating of the author rating_count the number of ratings the author received review_count the number of comments the author received image_url A URL of the Author's image related_authors A list of authors related to the author author_books A list of books by the author For Books, you need to store: 2021/9/21 Assignment 2.0 (Due 23:59 CDT on Sep 27, 2021) - CS 242 - Illinois Wiki https://wiki.illinois.edu/wiki/pages/viewpage.action?pageId=616243854 3/8 book_url URL of the Book title Name of the Book book_id A unique identifier of the book ISBN the ISBN of the book author_url URL of the author of the book author Author of the book rating the rating of the book rating_count the number of ratings the book received review_count the number of comments the book received image_url A URL of the book's image similar_books A list of books similar or related to the book You can not use the GoodReads API nor existing scrapers for Goodreads on Github. You are allowed to use generic scraping libraries: BeautifulSoup (Documentation) or Scrapy Your web scraper should not run for an extended period of time (like over half an hour). Part II: Data Storage in an External Database Very often, we do not want to manually manage huge files or store large data structures in memory. In this case, we can utilize databases that reside on the cloud (or server), just like a repository. In the third part of this assignment, we will be using the same scraper you build in part one and store data into a database. We do not have a constraint in the type of database you are using, Here are some suggestions: Important Notice You should be aware of the number of requests you are calling when designing the scraper. Servers can have protective mechanisms that prevent users from abusing them. Some general methods to avoid this includes but is not limited to: (1) Use less than 5 books and authors to test and design your scraper (2) Download the HTML of the relevant links first to finish the parser (3) Make-use of try/except functions and recording failed instances (for instance logging) You would be in a bad position in finishing this assignment if you are being blocked. 2021/9/21 Assignment 2.0 (Due 23:59 CDT on Sep 27, 2021) - CS 242 - Illinois Wiki https://wiki.illinois.edu/wiki/pages/viewpage.action?pageId=616243854 4/8 MongoDB (including MongoDB Altas) + PyMongo firebase SQLite MySQL Your program should be able to Store your data into the database while scraping Storing everything after scraping is done is not allowed Again, you can use any database for this task. Some helpful resources: On MongoDB, On Firebase Part III: Command Line Interface To make the scrapers more versatile, we can write a simple command-line interface that allows other users to configure the behaviors of the program. In this assignment, you are only required to take user inputs through command-line arguments (however, you will need to make the program interactive by the next assignment). Python for instance now has several built-in/external packages to assist this process (for example, the argparse library). It can handle and complete simple preprocessing of the arguments a user provided to the system. Your program should be able to: Accept any valid starting URL Check if the URL is valid, if it points to GoodReads, and if it potentially represents a book Accept an arbitrary number of books and authors to scrape Print warning for numbers greater than 200 books and 50 authors Read from JSON files to create new books/authors or update existing books/authors Print error for invalid JSON file (e.g., syntax error) and malformed data structure (e.g., what if the JSON is an array, or if the object doesn't have id). Discuss your design choice during discussions Print what new entries are updated or created Export existing books/authors into JSON files The output must be valid JSON What is JSON? JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. - json.org JSON is most commonly used in web applications, as a lightweight format to transfer data asynchronously between a client and a server. For instance, Facebook transmits status updates to your newsfeed using JSON so that new posts appear without having to reload the page. But, JSON is not limited to web applications. Parsers for JSON are 2021/9/21 Assignment 2.0 (Due 23:59 CDT on Sep 27, 2021) - CS 242 - Illinois Wiki https://wiki.illinois.edu/wiki/pages/viewpage.action?pageId=616243854 5/8 available for just about every language out there. See http://www.json.org for parsers for some of the most common programming languages. It would be wise for you to pick a language for this assignment that already has a well-developed parser. For instance, JSON parsing is built into Python's core library. You can read more about JSON on Wikipedia. Part IV: Miscellaneous Requirements Testing As usual, we require that you write extensive unit tests for each part of this assignment. We understand that it can be difficult to test for web scraping. However, make sure to exhaustively test all other parts of your code. If your language does not have a testing framework (unlikely), you will need to implement your own test runner and utilities to accomplish this part of the assignment. In order to test your web scraper, your moderator will ask you to scrape a book page of their choice in section, so be prepared. You should also be demoing your data storage. Please use Python's standard unittest library for testing. You are welcome to use other testing tools in python to implement the test. Linting Linting is a tool to check for programming errors, bugs, stylistic errors, and etc. Python has a comprehensive styling standard and many linters have been developed to help us understand how we wrote our code. In this task, we ask you to utilize one or more linter of your choice to assist you in following the PEP 8 standards. Most IDEs have plugins that support these linters. It is best for you to provide a report from the linter for your project. Moderators will be using pylint to grade this task. A tutorial is here: http://pylint.pycqa.org/en/latest/tutorial.html (Screenshot of part of the pylint report) 2021/9/21 Assignment 2.0 (Due 23:59 CDT on Sep 27, 2021) - CS 242 - Illinois Wiki https://wiki.illinois.edu/wiki/pages/viewpage.action?pageId=616243854 6/8 Environment Variables Since we are using a database in this assignment, all of these systems would likely require you some sort of username and password to complete read/write permissions. It is always a bad idea to leave this information inside your code, not to mention committing them into git. Environment Variables are values stored outside of your code, like a key or secret. Most languages support the use of environment variables or a .env file. In this task, you are required to handle environment variables using external packages. Again, there is no limitation in which library you use to read in environment variables. Some resources to complete this task: A easy-to-read tutorial on using dotenv, a demonstration of multiple methods. Part V: Bonus (1-2 pt) This is a 1-2 point bonus toward the overall assignment. Please utilize a network library to complete a graph demonstrating the Authors and Books relationship. For instance, https://networkx.github.io/ provides some good-to-use elements. Summary Reading Code Complete Chapter 7: High-Quality Routines Optional: ThePragmaticProgrammerChapter 5-26: Decoupling and the Law of Demeter Submission Moderators are asked to grade styles according to PEP8 Standard, so please follow it. If you do follow an alternative style guide, please provide them to the moderator and TAs and demonstrate that you are following this certain convention. This implementation is to maintain grading standards across sections. (Alternatives, Airbnb Ruby Style Guide, Airbnb Javascript Style Guide) Please use Python 3.5+. This assignment is due on 23:59 CDT Sep 27, 2021. Please be sure to submit in Gitlab, grant correct access, and ask your moderator or TA before the deadline if you have any questions. Please make sure you follow the repo naming conventions listed on the piazza. You have to create a new repository for this assignment. Please make sure that you create a branch with the name assignment-2.0 and merge it back to the master (while keeping the branch) Objectives The readings are due before lectures every Friday. 2021/9/21 Assignment 2.0 (Due 23:59 CDT on Sep 27, 2021) - CS 242 - Illinois Wiki https://wiki.illinois.edu/wiki/pages/viewpage.action?pageId=616243854 7/8 Learn about responsible web scraping Learn to work with data in JSON format Learn to work with databases Learn how to effectively decompose your code into reusable modules Learn to use different tools to assist in writing better code Resources Python Grading (see Grading Policy for full rubric) Category Scoring Notes Basic Requirements (Total: 22) Code Submission 3 Same as listed in Grading Policy. DO NOT commit env files. Decomposition/Overall Design 4 -1 pt per infraction: Any piece of the project must execute its core functionality well. The project should be well decomposed into classes and methods. The project is adequately decomposed into abstract classes and classes. The abstract classes and methods with shared functionality from different parts should generalized and reusable. Functions are not duplicated and you make an effort to ensure that duplicate pieces of code are refactored into methods. Style guide 3 Follow PEP 8 guideline up until to Comments (anything up to where the link is at) Functional Requirements (Total: 15) Web Scraping 5 0 pt: Lacks any form of a Web Scraper, or utilizes a non-scraping library to fetch information -1 pt: Web Scraper semi functions (may crash on certain web pages), does not handle exceptions and errors gracefully +1 pt: Scraped some Authors +1 pt: Scraped some Books +1 pt: Complete Authors requirements (>200) +1 pt: Complete Books requirements (>50) +1 pt: Report scraping progress and errors effectively -1 pt for scraping >2000 in one go, your program should stop early. Database Setup 2.5 +0.5 pt: System setup with a database +1 pt: System can write from a database without errors +1 pt: System connected to a database and can read from a database without errors Command Line Interface 5.5 +0.5 pt: Configurable starting URL +0.5 pt: Starting URL error checking +0.5 pt: Configurable book and author counts 2021/9/21 Assignment 2.0 (Due 23:59 CDT on Sep 27, 2021) - CS 242 - Illinois Wiki https://wiki.illinois.edu/wiki/pages/viewpage.action?pageId=616243854 8/8 +0.5 pt: Counts error checking +2 pt: JSON to Database for books and authors and supports create and update +0.5 pt: JSON to DB error checking +1 pt: Database to JSON -0.5 pt: Invalid JSON output Linter 1 Shows a score of 8.5/10 or above in the pylint report Environment Variables 1 Utilize environment variables for the database and other sensitive/dynamic variables Testing Requirements (Total: 10) Unit tests 5 0 points: No unit tests or not using python testing libraries/tools for every 2 unit tests, gain 1 point (at most 5 points) -1 pt per infraction: Single unit test is testing multiple cases or unrelated things Missing obvious test cases or edge cases Tests are easy to understand test name indicates the purpose of the test correctness of the test is easy to verify significant code smells that make the test hard to understand Manual Test Plan 5 0 pts if there are no manual test plan 1 pts if the test plans include only environment setup OR scenario descriptions 2 pts for test plans that contained only some content and can be further improved (~8 pages) 4 pts for test plans that contained most of the content (~10 pages) 5 pts for well-composed test plans covering all aspects of the system(~12 pages) Bonus (Total: 3) Network 2 +1 pt for constructing the network and display using the terminal +1 pt for visualizing a static image of the network CI 1 Utilize continuous integration on Gitlab to do testing 无标签
欢迎咨询51作业君
官方微信
TOP
Email:51zuoyejun
@gmail.com
添加客服微信:
abby12468