A document is a full copy of a document found in source. A full copy means a copy of HTML, optionally plain text, all images and video files.
OfflineWeb maintains data snapshots from three document source.
- Wikipedia
- Gutenberg
- YouTube
There are multiple types of document application has to maintain, due to the inherent difference of the source and various purposes of the application.
The index requires plain text, but the user facing application requires HTML, images and videos.
Document types served to user are (user facing)
- HTML pages
- Plain texts
- Images, mainly JPEGs and PNGs
- Videos, mainly MP4s
Document types kept internally (internal)
- Plain text.
- Index, built to search through the user facing document.
- Persistent database entries, preserved to manage index and box management duties.
Internal documents types are maintained for each source. But external types vary according to source.
- Wikipedia
- HTML pages
- Plain texts
- Images
- Gutenberg
- HTML pages
- Plain texts
- Images
- YouTube
- Videos
- Plain texts, video title, summary
- Images, thumbnail
Properties of a document
- Document Id (string, UUID) : Unique id for a document. Unique, generated.
- Doc Id at source (string) : Each source assigns a id to a document. It is needed to track status
of the document at source. so that update
- Title (string) : Title of the document.
- Abstract (string) : First 160 characters of the document.
- Source (string) : Origin of the document. Can be one of three,
WIKIPEDIA
, GUTENBERG
and , YOUTUBE
.
- State (string) : Indicates staleness of the document in our datastore. Can be one of
three,
FRESH
true copy of source. STALE
older than source. UIP
document is being updated.
- Create date (timestamp) : Time of creation.
- Update date (timestamp) : Last update time.