Frequently Asked Questions
-
What is a "Web Harvest"?
Web harvesting is the process of automatically copying and organizing unstructured information from pages and data on the World Wide Web. It is also known as web mining, web scraping and web crawling. Websites are identified with a "seed list" of URLs which are "harvested" so that content within, or linked to an identified site, is captured and copied.
-
How accurate is the harvest?
The accuracy of each harvest was affected by these factors:
- The completeness of URL source lists;
- Whether URLs resolved successfully; and
- The capabilities of crawler tools used (see Heritrix and Brozzler) and the server environment being crawled. See a report on limitations of capabilities.
NARA has made every reasonable effort to ensure that websites' code and programming were captured accurately. NARA is not responsible for any websites' compliance with Federal laws, regulations, and requirements. NARA is responsible for providing public access to these copied websites but is not responsible for maintaining code such as links, accessibility features, search or site maps, or other functionality that may have been true of the sites before they were copied.
-
How often is the harvest conducted?
NARA conducts the Congressional web harvest at the end of each Congress. Crawling begins in September every two years, and continues until January 3 when the new Congress begins.
NARA will determine the regularity of White House harvests once the viability of conducting a web harvest of Presidential Records Act websites is determined.
Agencies should continue to follow current guidance on the scheduling and transfer of permanent web records. See Web Records at NARA for more information. -
Who conducted the harvest?
NARA is currently under contract with the Internet Archive (IA), a San Francisco nonprofit, to perform the harvest.
-
How large is the collection?
The harvest collection includes 190 terabytes of archived websites. The most recent harvest from the 118th Congress preserved over 400,000,000 URLs totaling over 32 terabytes of web data.
-
Why doesn't form input or streaming video work in the collection?
A harvest engine is not able to read and use the forms, video, or complex javascript. That means that forms and databases will not be active in the harvest, and files that can only be streamed from a website have not been harvested.
-
Can I search the archive?
Yes, by:
- Navigating to the harvest collection you’d like to search;
- Entering a search term, which searches the combined House and Senate harvests, in the search bar at the top of the collection; or
- Browsing from the House or Senate home pages.
-
Why isn't the site I'm looking for in the archive?
Sites were not harvested because:
- they were not linked to one of those supplied-URLs;
- they were password protected; or
- the harvest engine could not find or access them (Note: Harvest engines do not capture dynamic web content. See a report on limitations of capabilities.)
-
Does webharvest.gov track usage statistics?
This website uses Google Analytics. Please refer to the following policies on Google's website for more information:
- Google's main privacy policy
- Cookies & Google Analytics on Websites
- Opt out of Google Analytics Cookies
