TY - GEN
T1 - A new paradigm in data intensive computing
T2 - Challenges of Large Applications in Distributed Environments, CLADE 2006
AU - Kosar, Tevfik
PY - 2006
Y1 - 2006
N2 - The unbounded increase in the computation and data requirements of scientific applications has necessitated the use of widely distributed compute and storage resources to meet the demand. In a widely distributed environment, data is no more locally accessible and has thus to be remotely retrieved and stored. Efficient and reliable access to data sources and archiving destinations in such an environment brings new challenges. Placing data on temporary local storage devices offers many advantages, but such "data placements" also require careful management of storage resources and data movement, i.e. allocating storage space, staging-in of input data, staging-out of generated data, and de-allocation of local storage after the data is safely stored at the destination. Traditional systems closely couple data placement and computation, and consider data placement as a side effect of computation. Data placement is either embedded in the computation and causes the computation to delay, or performed as simple scripts which do not have the privileges of a job. The insufficiency of the traditional systems and existing CPU-oriented schedulers in dealing with the complex data handling problem has yielded a new emerging era: the data-aware schedulers. One of the first examples of such schedulers is the Stork data placement scheduler. In this paper, we will discuss the limitations of the traditional schedulers in handling the challenging data scheduling problem of large scale distributed applications; give our vision for the new paradigm in data-intensive scheduling; and elaborate on our case study: the Stork data placement scheduler.
AB - The unbounded increase in the computation and data requirements of scientific applications has necessitated the use of widely distributed compute and storage resources to meet the demand. In a widely distributed environment, data is no more locally accessible and has thus to be remotely retrieved and stored. Efficient and reliable access to data sources and archiving destinations in such an environment brings new challenges. Placing data on temporary local storage devices offers many advantages, but such "data placements" also require careful management of storage resources and data movement, i.e. allocating storage space, staging-in of input data, staging-out of generated data, and de-allocation of local storage after the data is safely stored at the destination. Traditional systems closely couple data placement and computation, and consider data placement as a side effect of computation. Data placement is either embedded in the computation and causes the computation to delay, or performed as simple scripts which do not have the privileges of a job. The insufficiency of the traditional systems and existing CPU-oriented schedulers in dealing with the complex data handling problem has yielded a new emerging era: the data-aware schedulers. One of the first examples of such schedulers is the Stork data placement scheduler. In this paper, we will discuss the limitations of the traditional schedulers in handling the challenging data scheduling problem of large scale distributed applications; give our vision for the new paradigm in data-intensive scheduling; and elaborate on our case study: the Stork data placement scheduler.
KW - Data placement
KW - Data-aware
KW - Data-intensive
KW - Grid
KW - Scheduling
KW - Staging
KW - Storage management
KW - Stork
UR - https://www.scopus.com/pages/publications/33845932640
M3 - Conference contribution
AN - SCOPUS:33845932640
SN - 1424404207
SN - 9781424404209
T3 - Proceedings - Challenges of Large Applications in Distributed Environments, CLADE 2006
SP - 5
EP - 12
BT - Proceedings - Challenges of Large Applications in Distributed Environments, CLADE 2006
Y2 - 19 June 2006 through 19 June 2006
ER -