辅导案例-COMP30023-Assignment 1

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
COMP30023 Computer Systems
Assignment 1 tips and suggestions

The following may be helpful when planning and writing your web crawler.
This is not part of the project specification. If any suggestion here contradicts the specification, then
follow the specification.
1. Scalability + Modularisation
It’s a good idea to make sure you’re building your project in a scalable way. Some suggestions to
help you achieve this:
1. Modularize code into several .c and corresponding .h files (e.g., one file for fetching files,
one for regular expressions, etc.). Keep in mind that having too many files may increase the
build complexity unnecessarily. One option is to start with one file and “refactor” into
multiple files as your project grows. This is easier if your first file groups functions together
in the way you are likely to split later. A suggestion might be:
• Every .c file (aside from perhaps the file with ‘main’) has a corresponding .h file
• .h files declare all the public methods and constants to be accessed by other
modules, and the corresponding .c file provides the implementations of these
functions and also private helper functions
• .h files should always have “definition guards” (briefly read up on how to use these
“#ifndef” guards before you get started)
2. Use a makefile (this is actually a requirement! Your project needs to build when ‘make &&
./crawler is run). A good example to learn from / use is here:
http://www.cs.colby.edu/maxwell/courses/tutorials/maketutor/
3. Ensure that you test functions before assuming they work, especially considering the edge
cases that C can present (NULL pointers, unterminated strings etc.). You might consider a
separate file (with a different ‘main’ function) that executes tests, and compile that in a
separate rule in your makefile.
An example project structure might be:

This is a common project structure, and is a similar structure to that used in the final makefile
provided in the reference above (the difference is the makefile has been moved to the root of the
project rather than being located in /src, as per the spec ‘make’ should be run from the root of the
project. This will require you to modify the Makefile accordingly!). In this structure, your source .c
files are in a directory MyProject/src, headers are in a directory MyProject/include and object files
should be placed by the compiler in the MyProject/src/obj folder.
If you were going to use a library (e.g., PCRE below), you might add a folder MyProject/lib folder to
this structure.
Good programming practice is always to encapsulate your code and only provide access to the
methods / constants that you wish to be ‘public’. Make sure you’re using “.h” (header) files for this
purpose!
2. Debugging + Testing with C
Memory Sanitizers + Memory Checkers
C requires manual memory management (there is no garbage collection!) and so programs written in
C are prone to memory leaks and segfaults.
Memory sanitisers can be useful tools in several ways:
1) If your program is crashing with a segfault or exhibiting strange behavior which you believe
might be memory related, you can use a memory sanitiser to hunt down the bug
2) Even if your program appears to be working correctly, more likely than not your program is
leaking memory (it is pretty likely that you forgot a free() at least one time somewhere).
There are several options you can choose from, including the Clang MemorySanitizer (requires
compiling with Clang instead of gcc), or Valgrind. Another option is to use gdb (see below section)
with the ‘backtrace full’ command which can help you trace memory related issues.
For MemorySanitizer, you’ll need to install Clang from your package manager (apt-get, or possibly
homebrew on mac) and then use the instructions on their site:
https://clang.llvm.org/docs/MemorySanitizer.html
Valgrind can also be installed from your package manager. You can use it in a basic form to check
memory errors by running “valgrind ./mycompiledprogramwith-gflag” , where your compiled
program must have been compiled with the ‘-g’ flag in gcc.. Some extra functionalites you should
certainly investigate are related to the “leak-check” and the “track-origins” flags (man valgrind).
Debuggers for C
Debugging in C is difficult, and using an fprintf statement is often very useful for fixing simple issues
while developing. If a serious bug is presenting itself, it might be time to graduate to a full debugger.
If you’ve never used a debugger in any language before, you might start by researching what a
‘breakpoint’ is , and what the following mean: ‘step over’, ‘step into’, ‘step out of’, ‘continue’ (these
are all options you have when the execution of your code is suspended).
After you understand what debugging is, you’re ready to use the GNU debugging tool (gdb).
I (Colton) recommend this video, which not only has information on what a debugger is, but gives in
depth information on how to use gdb. https://www.youtube.com/watch?v=bWH-nL7v5F4 (yes, it is
a 20-minute video, but could save you hours of debugging with fprintf!)
If you’re interested in getting a bit fancier, you can setup visual debugging with vscode using this
resource: https://code.visualstudio.com/docs/cpp/cpp-debug
3. Testing: A python3 webserver in one line
If you’re interested in doing some testing on your own machine, you can set up a webserver on your
own computer (after installing python3) in just one line:
“sudo python3 -m http.server 80”
This will start a web server on your localhost hosted on port 80 (type 127.0.0.1 in your browser to
see it). Start the server in the location where the files you want to serve are located.
4. Types of URLs
There are several forms that a URL enclosed in an tag can come in. In a reference from another
resource located at http://www.mysite.com/examples/test2.html, the following are all equivalent
links:
Type Example
Absolute (fully specified) “http://www.mysite.com/examples/test.html”
Absolute (implied protocol) “//www.mysite.com/examples/test.html”
Absolute (implied protocol and hostname) “/examples/test.html”
Relative “./test.html”

An important note is that a URL without a ‘resource’ is still valid: e.g. http://www.google.com is a
valid URL despite not specifying a path to a resource (compare with http://www.google.com/, note
the training ‘/’). The http protocol, however, specifically notes that ‘if the absolute path in the URL is
empty, it MUST be given as “/” (the server root) in the http ‘Host’ header’
(https://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html). As such, the two urls
http://www.google.com and http://www.google.com/ will result in identical HTTP requests.
5. Regular Expressions
Regular expressions are a powerful tool to recognize and capture components of unstructured (or
structured) data. For this assignment, you can either use the standard C regex functions (which uses
POSIX syntax), or the PCRE (Perl-Compatible regular expressions) library.
Note: You’ll learn a lot more about the theoretical side of Regular Expressions + Regular Languages
next semester in COMP30026 Models of Computation.
The regular expression syntax used by PCRE is very widely used today, including in languages like
python and JavaScript, so learning it will potentially be more useful. The downside is that you will
have to bundle the PCRE files into your application package, and build them from the source when
your program is compiled (since PCRE is not installed by default, and you don’t have root access to
install it on your VM). See below for some hints on including + compiling from source for PCRE.
a) Never Used Regular Expressions Before?
If you’ve never used regular expressions before, you’ll need to start by doing some research on what
they are. (See the regex validation tool in below sections to test some regular expressions while you
learn). Regular expressions are widely used, and having at least a basic understanding of what they
are and how they are used will be useful in your future.
If you’re planning on use PCRE, you may start off by learning in a simple language (e.g., python or
javascript, which have pcre-like syntax in their regular expressions), and then you’ll understand the
gist of pcre.
b) Regex Validation
When you’re designing regular expressions and you want to test that they are correct, there are
several online regex validation tools that can help.
One such example is https://regex101.com/. (This tool has support for PCRE-like syntax. If you’re
using the built-in C regex library, you’ll need to find a tool that supports that syntax, but this tool
might still be useful for testing the logic of your regular expressions, or for learning what regular
expressions are)
Firstly, click to change delimiter (set it to something we won’t be searching for, like ‘;’)

Here’s an example of what the interface looks like with a regex “.*a(.*)a.*” executing on a string
“this is a test string which is a bit short”. Notice that it shows the ‘capture group’ in green, and the
full match in blue:


c) Escaping the Escape Characters
Escape characters are used in before characters that have special meaning to interpret them
literally, or to make letters mean something special (e.g., “\.” matches ‘.’ literally, compared with the
usual special meaning of ’.’ being ‘any char’. “\d” matches a digit instead of literally matching ‘d’).
This gets tricky when you take it up one level, because like the pcre compiler, the C compiler (gcc) is
also using that escape character ‘\’ to escape things (e.g., ‘\n’). As such, we need to double escape
the escape characters when typing into our code. For example, say we want to match a string which
has a single digit in it, like the string “1”. The regular expression string we’d write is “\\d”, here’s
why:

Parser Conversion Interpretation
C compiler “\\d” => “\d” The first \ tells the compiler to interpret the
second \ literally (first \ is consumed in the
parsing of the string)
PCRE compiler “\d” => “[0123456789]” The \ tells pcre to interpret the d as a special
character, ie to match any digit

d) Installing PCRE
�The VMs that will be used to run your assignment do NOT have PCRE installed on them, and you
don’t have the ability to do so (no sudo access). �
As such, to be able to use PCRE in your assignment, you’ll need to include the source files and
compile these as part of your build process. It’s not as scary a process as it sounds, and it’ll teach you
a bit about the usual way in which packages are compiled for Linux distribution.
The broad stroke steps are to:
a) Download the source files for PCRE as a package
b) Place the unzipped source files somewhere in your project structure, perhaps in a ‘/lib’ folder, so
that when copying over to the server, PCRE’s source code is included as part of your project
c) Do some research to learn about the so-called “./configure, make, make install” default process
for building unix applications from source. Many applications use this process to compile and
install, however always check the README (available from https://www.pcre.org/readme.txt)
d) Create a step in your Makefile that compiles the PCRE source code and builds a static library file
(.a file).
• A good way to perform the ‘configure’, ‘make’, ‘make install’ process might be to write a
small shell script (.sh file) which just runs the commands necessary (remember the
./configure, make, makeinstall process)
• A hint: look back at the README for PCRE and figure out how to use the ‘prefix’ flag for
make install to control where the library files are created: by default install tries to install
them for the whole system, but this requires sudo privileges which you don’t have on the
VMs.
• Another hint: the ‘prefix’ flag requires an absolute path not a relative one, try using cd to
manoeuvre to the location using relative paths, then use $(pwd) to get the absolute path the
shell is currently at. You could also use another tool to resolve relative paths, but be sure to
test compatibility on the VMs
e) Connect the static library file to your program when compiling in your Makefile
e) PCRE Usage
The Linux manual for PCRE (“man pcre”) is also available online or as a download. I (Colton) would
specifically recommend the pcre-demo/pcresample program that it comes with it as a good starting
point to see how to the process of compiling and executing a regular expression works.
f) Global Regular Expressions with PCRE (or Built-in Library)
Global regular expressions in PCRE (or the Built-in Library) aren’t as easy in C as in other languages
(which typically return a list of matches when a “g” flag is included in the regular expression).
To get an idea on how this might be done, consider looking at the PCRE documentation, specifically
the pcre-demo/pcresample program mentioned previously. This might also help you if you’re using
the built-in library, to get a broad idea of how to perform such a search.
6. Getting Help
The Manual: Detailed Information and Behaviours
As you’ve seen in labs, the manual (‘man’) is an important Unix tool which you should use to get
detailed information on the use of Unix programs, but it also has information on C functions:
Try the following on your Unix terminal:
- “man sleep”
- “man 3 sleep”
The integer argument can be used to specify which part of the manual to look in. System calls are in
the (2) section, library functions (C functions, since Unix is written in C) are in the (3) section.
The sections you are likely to use are:
1 Executable programs or shell commands
2 System calls (functions provided by the kernel)
3 Library calls (functions within program libraries)
4 Special files (usually found in /dev)
5 File formats and conventions e.g., /etc/passwd
… (there are a few other sections: for more information, there’s a manual on the manual! Type
‘man man’)
The Search Engine + Lecture Slides: Understanding the Problem and the Tools Available
The manual is a useful reference for the detailed implementation of functions and programs, but it’s
not as much help for figuring out which functions to use.
The first step to solving a problem is to understand the problem. Your favourite search engine can
help you with this, as can the lecture slides. Once you’re confident that you understand what needs
to be done to solve the problem, then it’s time to start thinking about how to implement that
solution, in this case this would involve reading about how to complete this process using C and Unix
standard library functions.
As a simple example: Imagine yourself having zero computing/coding experience, and you want to
learn how to print out the numbers 1 to 10 on the command line.
a) The first step would involve researching the problem, and eventually figuring out what
“loops” and “print statements” are.
b) From there, you might learn about how to implement these constructs in python, and read
some example programs that do a similar thing.
c) You now have a solid base on which to begin attempting to implement that program.
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468