|
Data
mining is the discovery of knowledge of analyzing enormous set of data, by
extracting the meaning of the data and then predicting the future trends. Data
mining helps us to find out secret information from large databases, and also
helps companies to take sound decisions, based on knowledge and information.
If we closely take a look into any data-mining
tool, we can see there are some common core logic, which are independent of the
data and the applications, but most of existing implementations try to ignore
that fact and concentrate on the specific problem, in that way the tool becomes
limited to only to a particular set of data for specific application.
Data mining is also
finding interesting patterns in data. The main challenge of any data-mining
engine is how to apply different algorithms or different techniques, on
different set of data, to find interesting pattern, which is very useful to
business. It is extremely difficult to come with some standard way of analyzing
the data. The enormous volume and the complexity of the data make it impossible
to run same algorithms on different dataset. Nowadays, there are different
vendors, who are trying to solve this problem, but mostly they support a subset
of different algorithms. None of them has come up with any stable engine, which
can work in any data set and in any domain.
In the last decade, the improvement in storage
and CPU speed has created a huge opportunity for different data mining
application, ranging from CRM to medical health care application. The evolution
of data mining is shown in table 1.
Now it is very difficult to
develop a single application, which can take care all of these problems. It’s a
dream even to think of an application, which can iterate through any data and
will find pattern. Data mining also deals with useful pattern, not just patterns,
now whether a pattern is useful or not, depends on the context where it is
usually applied. Present day tools depend solely on the expert about what kind
of algorithms to apply, and how to analyze the output, because most of them are
generic, and there is no context specific logic is attached to the application.
|
Evolutionary Step
|
Business Question
|
Enabling Technologies
|
Product Providers
|
Characteristics
|
|
Data Collection
(1960s)
|
"What was my
total revenue in the last five years?"
|
Computers, tapes,
disks
|
IBM, CDC
|
Retrospective,
static data delivery
|
|
Data Access
(1980s)
|
"What were
unit sales in New England, last March?"
|
Relational
databases (RDBMS), Structured Query Language (SQL), ODBC
|
Oracle, Sybase,
Informix, IBM, Microsoft
|
Retrospective, dynamic
data delivery at record level
|
|
Data Warehousing
& Decision Support
(1990s)
|
"What were
unit sales in New England, last March? Drill down to Boston."
|
On-line analytic
processing (OLAP), multidimensional databases, data warehouses
|
Pilot, Comshare,
Arbor, Cognos, Microstrategy
|
Retrospective,
dynamic data delivery at multiple levels
|
|
Data Mining
(Emerging Today)
|
"What’s likely
to happen to Boston unit sales next month? Why?"
|
Advanced
algorithms, multiprocessor computers, massive databases
|
Pilot, Lockheed, IBM,
SGI, numerous startups (nascent industry)
|
Prospective,
proactive information delivery
|
Table 1. Steps in the Evolution of
Data Mining [12].
Here is s summary
of the problems that we face today in the existing data mining tools
1. Difficult to use– Existing data mining
tools try to cover all different data mining applications, thus it becomes very
difficult to configure and run.
2. Needs Expert to run the
tool –
No domain or problem specific logic is tied with the tool, therefore needs
expert to run the to tool and analyze the result
3. Difficult to add new
functionality
- Because of the size and complexity of each tool, it is very difficult to add
any new feature.
4. Difficult to interface - There is no way those
algorithms developed by some other companies, can be integrated with the tool
easily
5. Short Lifetime - There is no stable
component in the tool and with time the tool become obsolete, as new tools
take the market, changing the exiting tool to incorporate new feature is
difficult and require lot of changes.
6. Limited Number of
algorithms
– Existing tool only provide limited number of algorithm and sometime use of
multiple algorithms is very limited.
7. Need lot of resources: Existing tools are not
optimized for any specific application, therefore they need lot of resources,
such as runtime memory, hard disk etc.
Thus, this workshop is driven forward by three main
questions. First, “how can we develop a unified data mining engine {UDME)?”
Second, “what kind of technologies and tools to build such an Engine?” and
third, “how can we overcome the existing problems?”
Building such an engine is not an easy exercise,
specifically, when several factors can undermine their quality success, such as
cost, time, and lack of systematic approaches. We would like to architect and
develop a Unified Data Mining Engine (UDME), that has the some or all of the
following properties:
1. Ease of use– Multiple tools can be
developed easily by focusing on specific problems, because they all can share
the core services, that are provided by the UDME.
2. No Need of Expert to run
the tool
– Domain specific knowledge such as verification, selection of tool etc, can be
implemented in the tool itself, while developing the tool.
3. Easy to add new
functionality
- – The application specific logic should be separate from the core logic,
therefore new application specific functionality can be added easily, without
making any change in the core logic.
4. Easy to interface - The design should be
based on system of independent patterns, they can be developed by 3rd
party vendors.
5. Long Lifetime - The engine should be
based on stable core logic, which has a long lifetime, the application logic
should be loosely connected which can change over time.
6. Multiple algorithms – The engine must
support any number of algorithms.
7. Fewer resources: The proposed engine
should be developed by connecting several patterns or components. Depending on
the application, a domain the engine can use patterns or components, which are
necessary therefore it needs less resources compare to existing tools.
8. Stable: The engine should be
stable over time, and provide a simple way to apply different data mining and
data analysis algorithms on different sets of data in any domain.
9. Isolation of Application
logic:
We must also isolate the stable knowledge from any application specific logic,
therefore different applications can use the same core knowledge, which need
not to be changed.
10. Minimum Maintenance Cost – Maintenance cost of
such an engine should be very minimal.
The workshop will address the unified data
mining engine challenges and debate several issues that are related to the
following questions. We also want researchers, framework developers, and
application developers to discuss and debate the following questions related
to:
I.
UDME
Architecture
a. What is the best
approach for building such an engine?
b. What are the bases of
creating the engine architecture?
c. Are there any
guidelines, methodologies, and/or processes for an engine architecture creation
and development?
d. What are the components
of the unified data mining engine architecture?
e. What kind of patterns or
components that appear in UDME ?
f.
Show
how your engine architecture meets the above UDME properties.
II.
UDME
Development
a. What is the ultimate way
to develop such an engine?
b. What are the techniques
and tools for developing such an engine?
c. Show how to extend your
engine to the new application logics?
More information will be available at:
http://www.oopsla.org/oopsla2007 (OOPSLA 2007 Link)
http://www.oopsla.org/oopsla2007/index.php?page=sub/&id=160 (Workshop Link 1)
http://www.engr.sjsu.edu/~fayad/workshops/UDME07
(Workshop Link 2)
http://www.vrlsoft.com/workshops/UDME07 (Workshop Link 3)
Detailed
instructions for electronic paper submission and review process are found at
http://www.compsac.org/. Developers and programmers, who are interested in
participating in the workshop, are requested to submit a short position paper (3-5
pages), or regular workshop paper (limited to 6-15 pages, double
spaced, including figures) by representing views and experiences that are
relevant to the given discussion topic. The title page must include a maximum
150-word abstract, five keywords, full mailing address, e-mail address, phone
number, fax number, and a designated contact author. Workshop papers will be
selected depending on their originality, quality and relevance to the
workshop. All submitted papers will also be evaluated according to their originality,
significance, correctness, presentation and relevance. Papers should be
submitted electronically to the chair. Please follow the instructions that are provided on the web
page. Camera Ready manuscripts must be submitted following ACM SIGPLAN conference proceedings
style and guidelines. We also encourage authors to present novel and fresh
ideas, critiques of existing work, and practical studies.
Each
accepted workshop paper must be presented in the person, either by the author
or by one of the co-authors. To foster and promote lively discussions, authors
are encouraged to present open ended questions and one or two main statements
for the purpose of discussion at the workshop. Submissions must be made either
in MS-Word or RTF formats (Please, DO NOT compress files).
Depending on the total
number and spread of contributions, the scope may be further narrowed down to
ensure an effective communication and information sharing session. Accepted
position papers will be distributed to the participants, just before the
workshop and will be made generally available through the WWW and FTP.
Accepted papers will also be published in the Workshop Proceedings. At least
one of the authors of each accepted paper must register, as a full delegate in
the workshop. Selected papers will be published in one of the future issues of
the online International Journal Of Patterns (IJOP), www.ijop.org and/or International Journal of
Software Architectures (IJSA), www.ijsa.net
People
who are interested in participating in the workshop, without making any
submissions are requested to fill out the participation form and e-mail to any
of the workshop chairs.
-------------------------------------------------
PARTICIPATION FORM:
Name and
Affiliation:
Position:
Address:
E-mail:
URL:
Areas of interest:
Reasons for Attending?
-------------------------------------------------
Please
note that registration is absolutely mandatory, in order to participate in the
workshop. An early registration discount is made available for all desired
participants. An overhead projector and a flipchart will also be made
available to all participants.
For more
information please visit any of the following websites:
http://www.oopsla.org/oopsla2007 (OOPSLA 2007 Link)
http://www.oopsla.org/oopsla2007/index.php?page=sub/&id=160 (Workshop Link 1)
http://www.engr.sjsu.edu/~fayad/workshops/UDME07
(Workshop Link 2)
http://www.vrlsoft.com/workshops/UDME07 (Workshop Link 3)
You may also
contact the organizers, either by e mail or by phone.
1. Welcome and
introduction of participants. The organizers will first provide a short
overview of all open issues, and also of the main arguments arising out of the
position papers. (Estimated time: 20-30 minutes)
2. Selected authors
(who’ll be representing the main trends) will be allotted 20 minutes, to
explain, how their position relates to other positions, and what each one of
them sees as the three major issues. We are expecting about 5-10 position
papers in this session. (Estimated time: 120-130 minutes)
3.
The organizers will also propose an identification process of the major issues,
and the participants will then discuss, choose and select what they perceive
are the hottest issues to be examined and analyzed. (Estimated time: 10-15
minutes)
4. The participants
will work for 70-95 minutes in small groups, with a designated moderator
assigned for leading each group. The groups will then individually deal with
two identified, but different hot issues, and will produce a summary note in
the form of points and counterpoints, showing either how several views are
irreducibly opposed or how they are complementary. The total number of groups
will depend mainly on the number of participants and issues selected; ideally
there should be 3-5 people in each group. (Estimated time: 60-70 minutes)
5. Each group will
be provided10-15 minutes to present its findings and inferences to the
workshop. A closing discussion will soon follow. The workshop report will be
composed on the basis of these findings, and will include a clear cut agenda
for future exploration and cooperation; this will be made available through the
WWW and FTP. (Estimated time: 50-60 minutes for five teams)
(Total estimated
time: 285-315 minutes, i.e. about five hours +/- 15 minutes; lunch and breaks
are not included.)
IMPORTANT DATES -- Will be updated based on acceptance
process.
Submission
deadline September 14, 2007
Acceptance
notification September 30, 2007
Camera-ready
paper due October 10, 2007
Workshop
date: Starts at 8:30 a.m October 22, 2007
Dr. M.E. Fayad (Chair)
Professor of
Computer Engineering
Computer
Engineering Dept., College of Engineering
San José State University
One Washington
Square, San José, CA 95192-0180
Ph: (408)
924-7364, Fax: (408) 924-4153
E-mail: m.fayad@sjsu.edu, mefayad@gmail.com
http://www.engr.sjsu.edu/fayad
Dr.
Tarek Helmy (Co-Chair)
College of computer science and
engineering,
Department of Information and
Computer Science,
King Fahd University of
Petroleum and Minerals,
Dhahran 31261, Mail Box. 413, Saudi Arabia.
Ph: 9663-860-1967 (Office)
E-mail:
helmy@ccse.kfupm.edu.sa
Dr. Rami Bahsoon (Co-Chair)
School of Engineering and Applied Science
Aston University in Birmingham, Birmingham B4 7ET, United Kingdom
office: Main Building, Second Floor, MB 213E
Ph: +44 (0) 121 204 3464
fax: +44(0) 121 204 3681
URL: http://www-users.aston.ac.uk/~bahsoonr/index.htm
Professor Dilip Patel (Co-Chair)
Faculty of Business, Computing and Information Management
London South Bank University
103 Borough Road
London SE1 0AA, United Kingdom
TEL: +44 (0)20 7815 7429
Somenath Das (Co-Chair)
eBay,
Inc.
2211 North First Street
San Jose, CA 95131, USA
Ph:
408 967 4151
E-mail:
sodas@ebay.com
Eduardo M. Segura (Co-Chair)
vrlSoft, Inc.
2065 Martin Ave., Suite
103
Santa Clara, CA
95050-2707
Phone/Fax: (408)
654-8972
E-mail:
esegura@vrlsoft.com, eduardo.segura@sjsu.edu
http://www.vrlsoft.com
Rami Bahsoon, Aston University in Birmingham, United Kingdom
Rogerio Atem de Carvalho,
Federal Center for Technological Education of Campos, Brazil
Chia-Chu Chiang, University of Arkansas, Little Rock, USA
Issam Wajih Damaj, Dhofar University, Salalah, Sultanate of Oman
Somenath Das, eBay, Inc., USA
Dilip Patel, London South Bank
University, United Kingdom
Jurgen Dix, Clausthal University of Technology, Germany
M.E. Fayad, San Jose State University and vrlSoft, Inc, Silicon Valley, USA
Jaafar
Gaber, Université de Technologie de Belfort-Montbéliard, France
Rosario Girardi, Federal University of Maranhão, São Luís, Brasil
Dr. Tarek Helmy, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
Hoda Hosny, The
American University in Cairo, Egypt
A. Kannammal, Coimbatore
Institute of Technology, TamilNadu, India
Mohamed-Khireddine
Kholladi, University of Constantine, France
Dae-Kyoo Kim, Oakland University, USA
Roger (Buzz) King, University of Colorado, Boulder CO, USA
Jianzhi Li, De Montfort University, United Kingdom
Nashat
Mansour, Lebanese American University, Lebanon
Tokuro Matsuo, Yamagata University, Japan
Srini Ramaswamy, University of Arkansas, Little
Rock, USA
Miguel Garre Rubio, Universidad de Alcalá, Madrid, Spain
Eduardo M. Segura, San Jose State University and vrlSoft, Inc, Silicon Valley, USA
Jaroslav Zendulka, Brno University of Technology, Czech Republic
|