Glomming a Database – Part I

The first thing to do when wrapping and glomming a database is to extract the endeme sets.

Part I – Extracting the endeme sets

To extract candidate endemes from a database, it would be nice to build a tool that would extract endeme sets from a database in preparation to glomming a database. Then you could build an endematic wrapper around an entire database.

Process the items of each column to identify the endeme sets:

  • use Levenshtein matrix to coalesce similar stuff.
  • use my endemes for 5000 words document to coalesce similar meanings.
  • more approaches based in data science.

Lookup tables may be implicit or explicit.

Table Column analyses

Column analyses:

  • general content/data columns
    – mostly unique items columns [U] – no endeme sets extracted
  • id/plumbing column
  • context column
  • lookup id column
  • implicit lookup table columns
    • Bits – multiple bit columns ‘column’, process multiple items of multiple columns with same type (bit) [B] – 1 set extracted
    • Conflated – two or more endeme sets [C] – 2+ sets extracted
    • Denormalized – zero normal form column [D] – 1 or 2 sets extracted
    • Endematic – endematic range – 16-32 rows [E] – 1 set extracted
    • Few – few rows – under 8 [F] – a fraction of a set extracted
    • Many different items column [M] – 1 set extracted
  • Freetext
    – R 1+  freetext column with repeating words
    – R 1+  freetext column with repeating concepts
  • Look for concept sets, concepts have additional structure that endemes do not have
    – concepts are generally characterized by two, 3 or 4 endeme sets in a row or chain

Lookup table analyses

Explicit lookup table analyses, process lookup id’s in a databse to build stuff

  • explicit lookup table content column
    • Bits – multiple bit columns ‘column’, process multiple items of multiple columns with same type (bit) [B] – 1 set extracted
    • Conflated – two or more endeme sets [C] – 2+ sets extracted
    • Denormalized – zero normal form column [D] – 1 or 2 sets extracted
    • Endematic – endematic range – 16-32 rows [E] – 1 set extracted
    • Few – few rows – under 8 [F] – a fraction of a set extracted
    • Many – many rows – over 64 – [M] 1 set extracted
    • Unique – most rows are unique – no endeme sets extracted
  • where it’s used as context

Junction table analyses

  • there’s got to be something I can do with junction tables.

Views and Reports

Endemizing reports, views, and stored procedures that return data sets (‘report’ and ‘get’ sp’s)

  • Context profiles?
  • Endematic metadata – reports mostly show numbers, endemes can store relative values
  • The row is the endeme item, the endeme indicates how it ‘compares’/’relates’ to other items

Other stored procedures (and inline SQL code)

These may be used to identify contexts and relationships between tables and columns.

 

I wonder if I can do the same thing with code?

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s