3.3. File and Directory Discovery in Web Application



In this lesson, we will discover the structure of a website and find entry points.

Why do we need this?

Understanding a website's structure provides us with the following information:

  • Framework: By analyzing the specific directory structure, file names, and directory names, we can determine which CMS/framework was used to create the application. This gives us information about known vulnerabilities and possible exploits.
  • Hidden Directories/Files: Websites can contain hidden directories and files that may assist in compromising the site or contain confidential data.
  • Parameters (including hidden ones): Discovered parameters serve as entry points/access to the website, through which we can interact with the application and its stored data. Practically all injection attacks work through GET/POST parameters.

Below are some examples of file extensions and directory names that can help identify server technologies:

File Extensions:

  • .asp – Microsoft Active Server Pages
  • .aspx – Microsoft ASP.NET
  • .jsp – Java Server Pages
  • .cfm – Cold Fusion
  • .php – PHP
  • .d2w – WebSphere
  • .pl – Perl
  • .py – Python
  • .dll – Compiled native code (C or C++)
  • .nsf or .ntf – Lotus Domino

Directory Names:

  • servlet – Java servlets
  • pls – Oracle Application Server PL/SQL gateway
  • cfdocs or cfide – Cold Fusion
  • SilverStream – SilverStream web server
  • WebObjects or {function}.woa – Apple WebObjects
  • rails – Ruby on Rails

What research methods are available for us?

To obtain the above-mentioned data, we can use the following methods:

  • Analyzing the robots.txt file
  • Analyzing comments and HTML/JavaScript code
  • Analyzing the sitemap.xml file
  • Spidering/Crawling – manually/automatically exploring available links and navigating them to build a site map.
  • Directory Busting/Dictionary Attack – hidden directories/files and parameters are identified using a large list of common file and directory names.
  • Fuzzing – sending a large amount of data in parameters to observe the server's response and determine which parameter values the server accepts.

Now, let's go through each of these methods step by step.


Robots.txt file

The "robots.txt" file is a plain text file that contains directives for search engine robots, such as Google and Yandex. The directives are simple and there are only two: whether to index or not to index specific directories/files. The file itself is in the root directory of the website and can be accessed through a simple URL like https://example.com/robots.txt. You can open it in a regular web browser, or you can also view it using programs like Netcat and Curl.

Here's what a typical "robots.txt" file might look like:

Example view of robots file

We are interested in the "Disallow" directives and their values because they typically contain the names of directories or paths that the website owner wants to prevent search engine robots from indexing. However, it's important to note that search engine robots may choose to ignore these instructions. Additionally, not all websites have a "robots.txt" file, and its presence or absence does not affect the functioning of the website.

As an example, you can start the Juice Shop application and then enter the link  in your web browser to view its "robots.txt" file:

Found robots file in Juice Shop

The file contains a single disallow directive for the "ftp" directory. Let's attempt to access this directory:

Found FTP link in robots file

As you can see, there are indeed some files listed in the "ftp" directory. Analyzing these files may provide us with more information about the website.


Sitemap.xml file

The "sitemap.xml" file is also used for search engine optimization (SEO) and is referred to as a sitemap. It contains links to the website's pages and assists search engines in indexing those pages more efficiently. The file is typically located in the root directory of the website, alongside the "robots.txt" file, and can be accessed through a URL like http://example.com/sitemap.xml.

The primary information in the "sitemap.xml" file consists of full URLs and the date they were added to the website. Sitemaps are usually generated automatically based on specific filters that dictate which links should be included in the sitemap and which should be excluded. When these filters are misconfigured, unwanted links, such as directories containing configuration or debugging files, can end up in the sitemap.

Having a sitemap is not mandatory, but it is recommended for faster indexing of pages by search engines. It helps search engine crawlers discover and index the content on your website more efficiently, improving its visibility in search results.

If you have access to a website's "sitemap.xml" file, you can review its contents to gain insights into the website's structure and potentially discover additional information about its pages and directories.



Spidering or crawling refers to the automated process of identifying available links, pages, and other resources on a website. A search engine robot or program scans a page, and if it finds links within that page, it adds them to a list of discovered links. The robot then follows these links, and if the newly opened page contains additional links, they are also marked as discovered, and the process continues until all links have been identified. In essence, the search process resembles a tree, with the root being the main or home page of the website, and branches and child elements representing pages, images, files, and other resources.

As mentioned earlier, this process is automated. However, it's worth noting that manual exploration can also be a valuable approach. Some applications require filling out various forms, clicking buttons, and interacting with other interactive elements to navigate to another page or display different content. Unfortunately, automated scanners may struggle with this task, which is why I recommend combining both methods.

By using a combination of automated scanning tools and manual testing, you can ensure comprehensive coverage when exploring a website's structure, links, and potential vulnerabilities. This approach allows you to leverage the strengths of automated tools while addressing the limitations they may have in handling complex interactions on web pages.



OWASP ZAP (Zed Attack Proxy) is a powerful tool that serves as a proxy intercepting requests and responses between your web browser and a web server. If a website uses SSL, the secure connection is established not directly with the browser but through the ZAP proxy. ZAP has numerous features, and we will explore many of them throughout this course. Right now, let's focus on the manual and automatic scanning functions for website structure. To get started, you'll need to install the program. You can do this by running the following command:

sudo apt install zaproxy -y

Once it's installed, you can open the main menu of the program and navigate to "03-Web Application Analysis" -> "Zap." Alternatively, you can search for "ZAP" in the program's search bar:

Searching and launching ZAP in start menu

Immediately after launching the program, it offers you options for how you want to save your session and results. Here are the three options:

By Timestamp: This option will save the session using the current timestamp as the name. This is useful for organizing sessions based on time.

Choose Session Name: With this option, you can manually specify a name for the session. This allows you to give the session a more descriptive name.

Do Not Save Session: Select this option if you do not wish to save the session currently, or if you intend to save it manually later.

Choose one of the three options that best suits your preference, and then click "Start" to proceed with using OWASP ZAP. The choice of how you save your session will depend on your workflow and organization preferences:

Options of session persist in ZAP

The program offers both automatic and manual methods of investigation. Let's start by selecting the manual method:

Viw of scanning modes in ZAP

The program includes built-in browsers, either Chrome or Firefox. You can choose either one, and then click on the "Launch Browser" button to proceed:

zap manual mode select

After opening the browser, enter the address and navigate to the WackoPicko application. Explore the application, fill out some forms, and then switch to ZAP. You will see the following view:

Manual scanning of WackoPocki with ZAP

The program has organized the links into a tree-like structure, and in the lower panel, all the requests made during manual exploration are displayed.

Now, let's initiate the automatic scanning of links. To do this, follow these steps:

  1. Select the root directory of WackoPicko.
  2. Right-click to open the context menu.
  3. Choose "Attack" -> "Spider."

In the dialog box that opens, leave everything as is and simply click "Start Scan." After the scanning process is complete, the directory will be populated with new links. Each link is labeled with either the GET or POST method.

You can choose any link, and then in the right panel, select "Request" to view the request or "Response" to view the response:

Results of website spidering with ZAP

What can you do with the obtained results?

First, you should go through all the links and understand their purposes. Then, make note of those links that use GET/POST/PUT parameters because they are primarily susceptible to attacks like SQL Injection, Command Injection, Cross-Site Scripting, and others. However, this doesn't mean that links without parameters are immune to attacks. Different types of attacks can be targeted at them, and we'll discuss those later.

Scanning directories can also be done with other tools like ZAP, such as Burp Suite Pro, WebScarab, and others. These tools provide various features and can help you further assess the security of a web application. Understanding the purpose and potential vulnerabilities of each link is crucial for conducting comprehensive security testing and identifying and addressing any security issues in the web application.