trckpd

Document

A document is a full copy of a document found in source. A full copy means a copy of HTML, optionally plain text, all images and video files.

OfflineWeb maintains data snapshots from three document source.

  1. Wikipedia
  2. Gutenberg
  3. YouTube

There are multiple types of document application has to maintain, due to the inherent difference of the source and various purposes of the application. The index requires plain text, but the user facing application requires HTML, images and videos.

Document types served to user are (user facing)

  1. HTML pages
  2. Plain texts
  3. Images, mainly JPEGs and PNGs
  4. Videos, mainly MP4s

Document types kept internally (internal)

  1. Plain text.
  2. Index, built to search through the user facing document.
  3. Persistent database entries, preserved to manage index and box management duties.

Internal documents types are maintained for each source. But external types vary according to source.

  1. Wikipedia
    • HTML pages
    • Plain texts
    • Images
  2. Gutenberg
    • HTML pages
    • Plain texts
    • Images
  3. YouTube
    • Videos
    • Plain texts, video title, summary
    • Images, thumbnail

Properties of a document

  • Document Id (string, UUID) : Unique id for a document. Unique, generated.
  • Doc Id at source (string) : Each source assigns a id to a document. It is needed to track status of the document at source. so that update
  • Title (string) : Title of the document.
  • Abstract (string) : First 160 characters of the document.
  • Source (string) : Origin of the document. Can be one of three, WIKIPEDIA, GUTENBERG and , YOUTUBE.
  • State (string) : Indicates staleness of the document in our datastore. Can be one of three, FRESH true copy of source. STALE older than source. UIP document is being updated.
  • Create date (timestamp) : Time of creation.
  • Update date (timestamp) : Last update time.