Cheap and Secure Web Hosting Provider : See Now

# [Solved]: Scheduling Algorithm with limitations

, ,
Problem Detail:

I have a list of webpages and I must download them frequently, each webpage got a different download frequency. Based on this frequency we group the webpages in 5 groups:

Items in group 1 are downloaded once per 1 hour items in group 2 once per 2 hours items in group 3 once per 3 hours items in group 4 once per 12 hours items in group 5 once per 24 hours 

This means, we must download all the group 1 webpages in 1 hour, all the group 2 in 2 hours etc.

I am trying to make an algorithm. As input, I have:

a) DATA_ARR = one array with 5 numbers. Each number represents the number of items in this group.

b) TIME_ARR = one array with 5 numbers (1, 2, 3, 12, 24) representing how often the items will be downloaded.

b) X = the total number of webpages to download per hour. This is calculated using items_in_group/download_frequently and rounded upwards. If we have 15 items in group 5, and 3 items in group 4, this will be 15/24 + 3/12 = 0.875 and rounded is 1.

Every hour my program must download at max X sites. I expect the algorithm to output something like:

Hour 1: A1 B0 C4 D5 Hour 2: A2 B1 C2 D2 ... 

A1 = 2nd item of 1st group
C0 = 1st item of 3rd group

My algorithm must be as efficient as possible. This means that I should never download items more often than the update frequency of their group (unless I have absolutely no other choice)(see example)

Example:

group 1: 0 items | once per 1 hour group 2: 3 items | once per 2 hours group 3: 4 items | once per 3 hours group 4: 0 items | once per 12 hours group 5: 0 items | once per 24 hours 

We calculate the number of items we can take per hour: 3/2+4/3 = 2.83. We round this upwards and it's 3.

Using pencil and paper, we can found the following solution:

Hour 1: B0 C0 B1 Hour 2: C1 B2 C2 Hour 3: B0 C3 B1 Hour 4: C0 B2 C2 Hour 5: B0 C1 B1 Hour 6: C3 B2 C2 and repeat the above. 

We take C0, C1 and C3 once every 3 hours. We also take B0, B1 and B2 once every 2 hours.

We take C2 once every 2 hours. This is more often than needed, wasting bandwidth, but if there is no other way around (like here), we will take it every 2 hours. My question on SO here and Math Overflow here made me understand that sometimes you are forced to download items more often than the absolutely minimum.

Question: Please, explain to me, how to design an algorithm able to download the items, while using the absolute minimum number of downloads? Brute force is NOT a solution and the algorithm must be efficient CPU wise because the number of elements can be huge.

Here is a very simple algorithm that will probably suffice in practice.

For each site, keep track of when you last visited, and how long you're supposed to go between downloads. This lets you tell how "overdue" each site is (the difference between those two numbers).

Now at any point in time, you are permitted to download $X$ sites. So, I suggest that you download the $X$ sites that are most overdue.

(Note that if $X$ is large enough, this rule may tell you to download some sites before they are due to be downloaded again. You can either download them now earlier than necessary, or decide to skip downloading them at all at this point and wait for later, depending upon what your goal is.)

If for some reason you absolutely need the optimal solution (not one that is good enough), then you could formulate this as an integer linear program and try solving it using an off-the-shelf ILP solver. I am very skeptical of the request for an absolutely optimal solution; I think that is poorly motivated and I doubt it will be necessary in most practical solutions. In practice, the loss from other factors not considered in this simplified model will almost certainly dominate any slight loss of optimality from the simple algorithm above.

But if you insist you absolutely must have the optimal, periodic solution and you're certain you know what you need, then you can formulate this as an ILP problem. Suppose the period is $p$ hours. Introduce integer variables $x_{t,i}$, with the intended meaning that $x_{t,i}=1$ means that the $i$th website is downloaded at time $t$, and $x_{t,i}=0$ means that it isn't. (Here $t$ ranges over $1,2,\dots,p$ and $i$ ranges over the number of websites you have.) Introduce the constraints $0 \le x_{t,i} \le 1$, to force each $x_{t,i}$ to be $0$ or $1$. Since you're only allowed to download at most $X$ websites per hour, introduce the constraint

$$\sum_{i} x_{t,i} \le X \text{ for } t=1,2,\dots,p.$$

If you want to ensure that item $i$ is downloaded at least $n_i$ times every $p$ hours, then introduce the constraint

$$\sum_{t} x_{t,i} \ge n_i \text{ for all } i.$$

(You can derive the desired value of $n_i$ from the required frequency for downloading item $i$; e.g., if the desired frequency is $f_i$, you can set $n_i=\lceil f_i p \rceil$. If you prefer to set a smaller window, say length $\ell$, and require that every window of length $\ell$ download site $i$ at least $n_i= \lceil f_i \ell \rceil$ times, you can do that instead; that will give you a more stringent requirement -- just remember to treat time as periodic, when you construct your windows.) Now check whether this ILP has a feasible solution. If it does, you've found a periodic solution of period $p$. Repeat this for each value of $p$ from one to, say, a thousand or so (solving one ILP per possible value of $p$), and keep the best solution you've found. It's unlikely this will yield much improvement over the simple algorithm, but hey, there you go.