How and Why Publicly Available Information Is Used

Bruce R. Wilkins
Author: Bruce R Wilkins, CISA, CRISC, CISM, CGEIT, CISSP
Date Published: 11 August 2021

Tips of the Trade

There are thousands of servers on the Internet ready to satisfy any data request that one might have. Although the Internet is commonly divided between the open web, the deep web and the dark web, only the open and deep webs coexist on the same infrastructure. The dark web, which will be explored in a forthcoming article, exists on an entirely separate network.

Originally designed and implemented by the US military as a science and technology project, the open and deep webs are used every day. Since the inception of its original 12 routers, the Internet has become incredibly complex. However, it is important to remember that, at its core, it is simply a large wide area network (WAN) with select organizations at the top of the network architecture providing some degree of calm and discipline to the chaos by managing its multiplex layers and Internet Protocol (IP) addresses. People who use the network are hosted in user communities connected to the Internet’s WAN. Such user communities may consist of Internet service providers, cloud providers or other types of enterprises conducting online sales or marketing, or merely informing everyone of their existence. These technologies, along with your connected computer, comprise the Internet. 

There is a whole community of people who have moved from being users of the Internet to consumers of the Internet. This consumption or harvesting of data has increased in popularity for a wide range of data mining purposes. These efforts are targeted at publicly available information (PAI), which is found on social media platforms and commercial websites. This information can be used to determine everything from a product’s quality and availability to the best times to go to the emergency room to when a government may be overthrown to who likes you—and who does not.

There are copious amounts of technical papers that have been written about harvesting the web, or as it is often referred to, using the web as a sensor. Woefully, not every topic can be examined here, but the following are some of the most important things to remember about the subject:

  • PAI exists on more than social media. It is also collected by commercial sites used to sell goods and services, and sites that provide the status of shipping, product inventory, medical beds, oil reserves and more.
  • Social media is a subset of PAI. Most people consider social media to consist of protected sites that require a login. However, all social media sites I am aware of do 1 of 2 things (some do both):
    1. They provide programming interfaces that allow one to access all accounts and their associated data, or at least some subset of the accounts without authentication. In some cases, payment is required based on data usage.
    2. They sell user data. One has no say in what is done with their data, since one does not have an expectation of privacy. In short, this is the cost of a free service. It is not common for sites to outright sell the raw data because it is viewed as a corporate asset. It is more likely they will sell statistical data, or second order data, based on their raw data. Some data could also come back to haunt users in the form of unsolicited advertisements. Data collected from people on the web participating in social media or commerce is far more accurate and complete than the census conducted by the US government.
  • For the most part, individuals are users of PAI. To leverage its potential, one must become a consumer of the PAI. This involves some programming, but mostly a better understanding of the characteristics of each type of website (e.g., what types of data are stored? Who visits the sites? What the data life is for that site?). It also requires cloud storage and several large servers to hold the quantity of data required to feed the analytics that have been written. In addition, one may need to dive into artificial intelligence (AI), or at least deep learning techniques, to become a PAI consumer.
  • Before trusting a given website, one should research it. What organization owns the website? Where is it physically located? Who has invested in the website? Certain governments do not recognize intellectual property (IP) rights, which could impact the ability of, for example, a website located in one country to legally send information to the owners of a website located in another country. These circumstances are especially prevalent with collaboration sites that are physically located in the United States, but owned outside the country.

There are hundreds of applications that can be built to capture data and turn them into meaningful information. However, some precautions should be taken. Bots are everywhere. They are based on the latest heuristics and can take a product or perspective and distort it to make it more advantageous or less attractive. A single bot can manage up to a thousand bots, all of them able to publish comments or ratings that greatly distort reality. In some areas of the world, bots have skewed reality to such a degree that PAI cannot be trusted.

Finally, when evaluating a source of information for accuracy and credibility, consider the physics of a website. It is difficult to convey the meaning of life in 126 characters. People tend to comment on negative events rather than positive ones. Censoring removes a given perspective. So, when an application is written to communicate the sentiment of a nation and its population, or whether a product is good or bad, it is important to remember the source.

Bruce R. Wilkins, CISA, CRISC, CISM, CGEIT, CISSP, is the chief executive officer of TWM Associates Inc. In this capacity, Wilkins provides his customers with secure engineering solutions for innovative technology and cost-reducing approaches to existing security programs.