Re: what is a data warehouse?
- From: "Dave Wickert [MSFT]" <dwickert@xxxxxxxxxxxxxxxxxxxx>
- Date: Sun, 11 Sep 2005 21:36:18 -0700
There are actually two major approaches to data warehousing. Ralph Kimball's
dimensional modeling approach is one; the other is enterprise modeling. For
that, look at books by Bill Inmon and/or Claudia Imhoff, such as:
http://search.barnesandnoble.com/booksearch/isbnInquiry.asp?isbn=0471399612
or
http://search.barnesandnoble.com/booksearch/isbnInquiry.asp?isbn=0471081302
Both approaches are called "data warehouing" although they use *very*
different approaches, which is one of the causes of the misunderstandings
and challenges to understand the technology that you've mentioned below.
Before you fall in love with Kimball's dimensional modeling (which is also
my favorate), you should at least be aware of Inmon's and Imhoff's approach.
--
Dave Wickert [MSFT]
dwickert@xxxxxxxxxxxxxxxxxxxx
Program Manager
BI Systems Team
SQL BI Product Unit (Analysis Services)
--
This posting is provided "AS IS" with no warranties, and confers no rights.
"Ian Boyd" <admin@xxxxxxxxxxx> wrote in message
news:eqvJgT0tFHA.904@xxxxxxxxxxxxxxxxxxxxxxx
>i asked this same question about a year ago. i kept reading about data
>warehouses, but they would only use made up terms to describe data
>warehouses. i had to way "in" to understand what they were talking about.
>
> Someone then recommended "The Data Warehouse Toolkit" by Ralph Kimball.
> And now i understand; and i recommend the same book.
> Some of the concepts are so foreign and alien to someone who has only ever
> done "normal" databases, that i wouldn't have understood it if i didn't
> see his examples in the book.
>
> The fundamental goal of a data warehouse is to arrange your data in a way
> that normal non-technical end-users (i.e. managers) can peek into the
> database themselves, and assemble whatever queries they want. The
> fundamental goal is to let people "get at the data".
>
> One major design difference between regular databases (your transactional
> system), and a data warehouse database is denormalized "dimension" tables.
> i'll give a limited example, that really hit me over the head with how
> different data warehouse database design is from regular database design.
> This example comes from the book (but the book is much more
> comprehensive).
>
> Take a grocery store's database, that has a Transactions table. This table
> holds all the items on customers receipts. In a regular database, you
> would probably see a design like:
>
> CREATE TABLE Transactions (
> Date datetime,
> ProductID int,
> StoreID int,
> TransactionNumber int,
> UnitCost money
> Quantity float,
> Amount money)
>
> The first change that is made when we copy this data into our warehouse,
> is to make the "Date" column more useful. What if a manager wants to,
> group sales by month, or by day of the week, or quarter. The only way you
> could do would be to start assembling queries that involves clauses like:
>
> GROUP BY DatePart('month', Date)
> or
> SELECT DatePart('dddd', Date)
> or
> SELECT DatePart('weekday', Date)
>
> But DatePart("weekday") returns a number. If they want the report to
> say"Monday", "Tuesday", ..., "Saturday", "Sunday" they'll have to do some
> extra work. Besides, is 2 a Tuesday? Do the numbers start at 1 or zero? Is
> Monday the first day of the week, or is it Sunday. Managers aren't going
> to know these things. Plus, if you want to have SQL Server doing some
> grouping based on parts of dates, the performance drops a lot.
>
> The solution, is to change the date column in your transactions table
> into:
>
> CREATE TABLE Transactions (
> DateID int,
> ...)
>
>
> And now you create a Date "dimension"
>
> CREATE TABLE Dates (
> DateID int,
> Date datetime,
> FullDateDescription varchar(200), --i.e. Saturday, February 14th, 1998
> DayOfWeek varchar(50), --i.e. "Saturday"
> DayNumberInEpoch int,
> WeekNumberInEpoch int,
> MonthNumberInEpoch int,
> DayNumberInCalendarMonth int, --i.e. 14
> DayNumberInCalendarYear int, --i.e. 45
> DayNumberInFiscalMonth int,
> DayNumberInFiscalYear int,
> LastDayInWeekIndicator varchar(50), --("Last Day In Week", "Not Last
> Day In Week")
> LastDayInMonthIndicator varchar(50), --("Last Day In Month", "Not Last
> Day In Month")
> CalendarWeekEndingDate datetime, i.e. 2-15-1998
> CalendarWeekNumberInYear int,
> CalendarMonthName varchar(50), --i.e. "February"
> CalendarYearMonth varchar(100), --i.e. "1997-02"
> CalendarQuarter varchar(2), --i.e. "Q1"
> CalendarYearQuarter varchar(10), --i.e. "1997-01"
> CalendarHalfYear varchar(2), --"1H"
> CalendarYear varchar(4), "1998"
> FiscalWeek varchar(3), "F06"
> FiscalWeekNumberInYear int,
> FiscalMonth varchar(20), --i.e. "February"
> FiscalMonthNumberInYear int,
> FiscalYearMonth varchar(50), --i.e. "F1998-06"
> FiscalQuarter varchar(4), --i.e. "FQ01"
> FiscalYearQuarter varchar(10), --i.e. "F1998-Q01"
> FiscalHalfYear varchar(3), --"F1H"
> FiscalYear int, --i.e. 1998
> SellingSeason varchar(100), --("New Years", "Valentine's Day", "St.
> Patrick's Day", "Easter", "Summer", "4th of July", "Memorial Day", "Labor
> Day", "Back to School", "Halloween", "Thanksgiving", "President's Day",
> "Christmas", "None")
> MajorEvent varchar(100), (Similar to above, but also things like "Super
> Bowl Sunday" or "Labor Strike")
> HolidayIndicator varchar(50), --("Holiday", "Non-holiday")
> WeekdayInsicator varchar(50), --("Weekday", "Weekend")
> and more!
>
> i'll add a few quotes from the book:
>
> "we can cover the history we have stored, as well as several years in the
> future. Even 10 years' worth of days is only about 3,650 rwos, which is a
> relativly small dimension table"
>
> "The holiday indicator takes on the values of Holiday or Nonholiday.
> Remember that the dimension talb eattributes serve as report labels.
> Simply populating the hiliday Insicator with a Y or N woudl be far less
> useful. Imagine a report whre we're comparing hiliday sales for a given
> product verses non-holiday sales. Obviously, it would be helpful if the
> columns has meaningful values such as Holiday/Nonholiday verses a cryptic
> Y/N. Rather than decoding crypting flags into understandable labels in a
> report application, we prefer that the decode be stored in the database so
> that a consitent value is available to all users regardless of their
> reporting environment.
>
> "Some designers pause at this point to ask why an explicit date dimension
> talbe is needed. They reason that if the date key in the fact talbe is a
> date-type field, then any SQL query can directly contrain on the fact
> table date key and use natural SQL date semantics to filter on month or
> year while avoiding a supposedly expensive join. This reasoning falls
> apart for several reasons. First of all, if our relational database can't
> handle an efficient join to the date dimension talbe, we're already in
> deep trouble. Most database optimizers are quite efficient at resolving
> dimensional queries; it is not necessary to avoid joins like the plague.
> Also, on the performance fornt, most databases don't index SQL date
> calculations, so queries contraining on an SQL-calculated field wouldn't
> take advantage of an index.
>
> In terms of usability, the typical business users is not versed in SQL
> date semantics, so he or she would be unable to directly leverage inherent
> capabilities associated with a date data type. SQL date functions do not
> support filtering by attributes such as weekdays verses weekends,
> holidays, fiscal periods, seasons, or major events. Presuming that the
> business needs to slice data by these nonstandard date attributes, then an
> explicit date dimension table is essential. At the bottom line, calendar
> logic belongs in a dimension table, not in the application code. Finally,
> we're going to suggest that the date key is an integer rather than a date
> data type anyway. An SQL-based date key typically is 8 bytes, so you're
> wasting 4 bytes in the fact table for every date key in every row. "
>
> He talks about how 10 years of dates is only a few thousand rows, so we
> can afford the extra database space with all these denormalized fields. On
> the other hand the "fact" table (in this example the Transactions table)
> is millions or billions of rows. We want that table to be as narrow as
> possible. One extra byte per row in a billion row table is an extra
> gigabyte of database size.
>
>
> Another example from the book, that helps to reinforce the point. When
> talking about the "Product" dimension table:
>
> Products
> Product Key (Primary Key)
> Product Description
> SKU Number (Natural Key)
> Brand Description
> Category Description
> Department Description (i.e. "Frozen Foods", "Pet Food", "Dairy",
> "Meat", etc)
> Package Type Description
> Package Size
> Fat Content
> Diet Type
> Weight
> Weight Units Of Measure
> Storage Type
> Shelf Life Type
> Shelf Width
> Shelf Height
> Shelf Depth
> ...and more
>
>
> "A reasonable product dimension table would have 50 or more descriptive
> attributes. Each attribute is a rich source for constraining and
> constructing row headers."
>
> In our normal database thinking, would would define a Departments table,
> and put a key in the products table to departments. But we don't do that
> in a warehoue. Imagine there are 150,000 products and only about 50
> departments,
>
> "Thus, on average, there are 3,000 repetitions of each unique value in
> department attribute. This is all right! We do not need to separate these
> repeated values into a second normalized table to save space. Remember
> that dimension table space requirements pale in comparison with fact table
> space considerations"
>
>
> Data for the warehouse is usually extracted nightly from the live
> transactional system, transformed into a format suitable for the
> warehouse, and then loaded into the data warehouse; ready for use by
> managers come morning. This process has a special name, ETL (Extract,
> Transform, Load).
>
> You're already seen dimension tables; relativly short and very wide (large
> number of columns). Dimension tables usually don't measure things that
> "happen"; they usually measure things that are. (Customers, Patrons,
> Airplanes, Products, Dates, Companies, Diseases, etc).
>
> The central table, in this example the Transactions table, is known as a
> "Fact" table. Facts are things that happen, or take place, or get measured
> (items on a customer's bill, a line item on a doctor's bill, a boarding
> pass to get on an airplace, a daily snapshot of inventory levels, a
> monthly snapshot for each bank account).
>
>
> To sum up:
> A data warehouse is a specially designed database system, separate from
> the "live" system, that is organized in such as way that makes it easy for
> managers to get at the data.
>
> So, have you ever used a data warehouse? It is not magically different
> from any other database - only the way it is structured is special.
>
>
>
>
> "Gen" <me@xxxxxxxxxxxxx> wrote in message
> news:%236Aewx%23sFHA.3548@xxxxxxxxxxxxxxxxxxxxxxx
>> I'm thinking I may have worked with a data warehouse before, but how can
>> I be sure? I am applying for jobs that ask for an understanding of
>> multiple data warehouses but the definition for data warehouse to me is
>> unclear since i'm not sure if i've seen one (or worked with one). Would
>> it be the same as an adhoc database (many views coming from different
>> databases)?
>>
>> If you answer, can you give me a basic example?
>>
>> thanks.
>>
>
>
.
- Follow-Ups:
- Re: what is a data warehouse?
- From: Ian Boyd
- Re: what is a data warehouse?
- References:
- what is a data warehouse?
- From: Gen
- Re: what is a data warehouse?
- From: Ian Boyd
- what is a data warehouse?
- Prev by Date: Re: what is a data warehouse?
- Next by Date: Re: End-user software to query a warehouse?
- Previous by thread: Re: what is a data warehouse?
- Next by thread: Re: what is a data warehouse?
- Index(es):
Relevant Pages
|