Training Deep Neural Networks to Solve Deterministic Optimisation Problems CSAI702

Subject Code :
CSAI702
University :
others Exam Question Bank is not sponsored or endorsed by this college or university.
Country :
United Kingdom

Training Deep Neural Networks to Solve Deterministic Optimisation Problems

1. Introduction

1.1 Background

Optimisation problems are, in fact, the focus of many decisions in various fields such as logistics, finance, power systems, and AI [?]. Typically, deterministic optimisation problems, where the relationship between the input and output is well-defined, have been solved using classical mathematical techniques like linear programming, dynamic programming, and branch and bound, etc. Though these methods are accurate in providing solutions, their use in large-scale or complex model instances has led to the investigation of other techniques due to their ineffectiveness [?].

On the same note, deep learning has equally transformed AI tools that enable one to approximate complex mappings and extract patterns from data. Deep Neural Networks (DNNs), a basic deep learning model [?], have achieved sig- nificant accomplishments in a broad spectrum of applications across the world, solving challenges like image classification and language processing. More re- cently, researchers have started exploring the application of DNNs in solving optimisation problems, given their ability to identify patterns from data, mini- mize computational cost, and solve non-linear problems [?].

In this chapter, you will be introduced to the notion of training DNNs for deterministic optimisation tasks and provided with an overview of the rationale behind its advantages and limitations.

1.2 Problem Statement

Nevertheless, the emerging integration of Deep Neural Networks (DNNs) in deterministic optimisation is a breakthrough in introducing new ideas for solving optimisation problems [?]. However, it raises several essential issues that require consideration:

The performance of DNNs must be investigated to understand how effi- ciently they can solve deterministic optimisation problems compared to classical methods.

The generalization capability of these networks, especially their capacity to solve a wide variety of problems, is questionable.

Some work must be done to evaluate the applicability of DNNs for real-life problems where their usage might not be more efficient than traditional

The training of DNNs to provide accurate and reliable solutions for deter- ministic optimisation problems remains a significant challenge [?].

This research seeks to address these challenges by articulately mapping out the methodologies and techniques required to train DNNs for optimisation tasks, thereby enhancing the existing knowledge in the field.

1.3 Objectives of the Study

The primary objectives of this study are:

To understand the theoretical foundations and principles of deterministic optimisation and deep learning.

To develop and train deep neural networks capable of solving specific deterministic optimisation problems.

To evaluate the performance of DNN-based solutions against classical op- timisation methods in terms of accuracy, computational efficiency, and

To identify and mitigate the challenges encountered in training DNNs for optimisation tasks.

By achieving these objectives, this study aims to bridge the gap between optimisation theory and neural network methodologies, contributing to both fields.

1.4 Motivation and Significance

The motivation behind the integration of DNNs with optimisation comes from the opportunity to provide significant enhancements in several areas of interest. Traditionally, linear programming, mixed-integer programming, and dynamic programming are used for finding optimal solutions as they are based on precise mathematical models. The incentive to use DNNs here is that they can approx- imate complicated functions and therefore tackle a broader class of optimisation problems faster through data-centric methods. An important benefit is the rela- tively short time that it takes for DNNs to provide a response, which is especially valuable in areas such as supply chain management and financial stock trading, where decisions must be made quickly. Moreover, DNNs are highly suitable for complex problem structures because they do not impose assumptions such as linearity, which are often required by classical methods. This makes them well-suited for nonlinear and high-dimensional problems. Additionally, DNNs can benefit from data analysis since, when trained on previous data, they adapt to include practical trends and peculiarities that a deterministic model cannot predict.

The challenge preventing the easy use of DNNs to solve deterministic opti- misation problems is the DNNs training process. The training process requires extensive clean data, while the architecture of the models must be modified to maintain the structural and functional characteristics of the problem. This work primarily aims to provide a strong basis for the use of DNNs in the determinis- tic optimisation context by focusing on these aspects, hoping for more efficient optimisation.

By incorporating traditional optimisation paradigms with data-centric ap- proaches employing DNNs, this work fills a gap in the frontier of existing prac- tices. This study aims to amplify knowledge in this sector by:

Finding ways of creating and utilizing training data effectively to train DNNs for optimisation problems.

Pointing out practical problems and applications related to DNNs in real deterministic optimisation tasks.

Providing a stepwise analysis ranking DNN-centered methods over classi- cal optimisation techniques.

1.5 Scope of the Study

The study area of this research is specifically delimited to analyzing the capa- bility of Deep Neural Networks (DNNs) in the context of deterministic opti- misation problems. Particularly, it concerns the problem classes for which we have had linear programming, integer programming, and quadratic program- ming problems, which are central problem classes in optimisation theory and applications. The method includes learning DNNs through the framework of su- pervised learning, where original solutions derived from various applications of traditional optimisation algorithms form the training data. The performance of these networks will be evaluated quantitatively by the rate of solution accuracy, computational time, and the ability of the developed networks to generalize other performance measures.

Today, deterministic optimisation approaches seem to be most developed in supply chain management, portfolio selection, and network design problems. An example of deterministic optimisation within supply chain management is Support Vector Regression for Non-Linear Demand Forecasting. Support Vec- tor Regression (SVR) is a kernel-based optimisation technique that predicts demands when the variables are non-linear. If we were to do this via DNNs, we could structure a test using inputs that are problem-specific parameters and features that define the structure and constraints of the supply chain system.

For DNNs demand forecasting using SVR, the inputs to consider:

Historical demand data Time-series data of past demand for different

Product features Characteristics of the products, such as size, type, or price.

Seasonality trends Features representing the seasonal or cyclical na- ture of demand.

Economic indicators Factors like GDP, inflation, or customer spend- ing behavior that influence demand.

Lead times Delivery or production times that affect replenishment

Cost parameters Costs associated with production, inventory holding, or transportation.

External variables Weather patterns, promotional activities, or com- petitor actions that may impact demand.

The outputs of the kernel-based model represent the predicted solutions, which serve as decision-support tools for the supply chain manager. For SVR in demand forecasting, the outputs to consider:

Forecasted demand Quantitative predictions of demand for specific products over a time horizon (e.g., next week, month, or quarter).

Optimal inventory levels Suggested inventory quantities based on forecasted demand and existing supply chain constraints.

Delivery schedules Optimised timelines for replenishing stock to meet demand with minimal delays.

Capacity utilization insights Recommendations for production or warehouse usage based on forecasted demand.

Cost-effective solutions Outputs that balance costs, such as holding costs, backorder penalties, and production costs.

The Test Architecture for SVR through DNNs would be structured as fol- lows:

Input Layer: Encodes the input features (historical demand, product data, seasonality, external factors) as a structured feature vector. Each input is mapped into a higher-dimensional space using a kernel function (e.g., Radial Basis Function) to handle non-linear

Hidden Layers: SVR implicitly applies kernel transformations to cap- ture the complex and non-linear relationships between the input parame- ters and demand. Unlike neural networks, SVR does not explicitly require multiple hidden layers, as the kernel trick operates in the transformed

Output Layer: Delivers predicted demand values (continuous outputs). The output can also be post-processed to ensure feasibility, such as round- ing to integer values for product quantities or applying business constraints like capacity limits.

Other factors to consider/add-on:

Training Data: Historical data from traditional demand forecasting methods (e.g., moving averages, exponential smoothing) can serve as a benchmark for supervised training.

Data Augmentation: Variations of supply chain scenarios (e.g., intro- ducing seasonal peaks, unexpected demand shocks, or varying cost con- straints) will improve the models robustness and generalization capabili-

Performance Metrics: Forecast accuracy (e.g., Mean Squared Error, Mean Absolute Percentage Error), computation time for generating fore- casts, and solution feasibility in terms of meeting real-world constraints (e.g., inventory holding limits or production capacity).

By using DNNs, we can integrate SVR into supply chain management in a potentially more efficient way. This efficiency can help organizations improve demand predictions and optimise decision-making processes, leading to cost reductions, efficient resource allocation, and improved service levels.

What this study aims to evaluate within numerous practical applications is the extent of this increased efficiency, if present, as well as its accuracy. Despite the fact that the study focuses only on deterministic problems, the generalization of the results and extension to stochastic and real-time optimisation have vast generic importance for future work in optimisation.

1.6 Semester Plan and Project Summary

I will concentrate on solving deterministic optimisation problems and the design of algorithms that employ Deep Neural Networks (DNNs). The semester will therefore be underpinned with a schedule that follows a certain achievement roadmap. Following the first two weeks, I aim to thoroughly understand the theoretical backgrounds of deterministic optimisation and deep learning. I will review the appropriate literature to determine the principles and methods of optimisation with application to DNN models, as well as the training of appro- priately customized models for given problems. Compared to these traditional optimisation methods, the performance of the proposed models will be evalu- ated in terms of solution accuracy, computational complexity, and scalability characteristics, which will be analyzed in detail.

To overcome the current difficulties associated with DNNs training for opti- misation problems, I will also discuss methods of dealing with the issue, such as hyperparameter optimisation, changes in the network architecture, and other training techniques. In the closing stage of the semester, I shall consolidate collected data into a report, present the lessons learned, and scope the findings for future research.

Timeline of Plan for Semester 2

Weeks 1: Review basic principles in the area of deterministic optimisa- tion and deep learning.

Weeks 2-4: Review the literature on the relation between optimisation and neural networks and define important open research problems.

Weeks 5-9: Implement and test DNN models on standard optimisation problems and evaluate their performance against traditional methods.

Weeks 10-12: Discussion of findings, integration of results, and writing the final report.

Pursuant to this plan, the study seeks to identify important insights and add new dimensions to the development of optimisation theory and its practice.

2. Foundations of Deterministic Optimisation

Optimisation is the science of making the best possible decisions.

~ Stephen Boyd

2.1 Introduction to Deterministic Optimisation

2.1.1 General Introduction

At its core, deterministic optimisation is the art of making definitive, provably optimal decisions in a world stripped of ambiguity. Every parameter, from costs to constraints, is a fixed and known quantity. As all the parameters are known with certainty, there is no randomness involved. So, in essence, there are no what-ifs, no probabilities, no dice rolls when it comes to optimisation being deterministic in nature. This clarity makes deterministic models the bedrock of industries where precision is non-negotiable: scheduling hospital surgeries, designing aircraft wings, or allocating emergency relief supplies after a disaster [?].

However, this precision comes at a cost. Deterministic frameworks assume we live in a perfectly predictable universe, a luxury rarely afforded in reality. Why, then, do we still cling to them? Because they provide something in- valuable: a mathematical sandbox. By removing randomness, we isolate the structural essence of a problem, allowing us to dissect it with surgical precision [?].

2.1.2 The Anatomy of a Deterministic Problem

Consider a logistics manager tasked with delivering vaccines to 100 clinics using 10 trucks. Each truck has a fixed capacity, each clinic a fixed demand, and each route a fixed cost. The goal? Minimize total delivery cost. This is a classic deterministic optimisation problem. Lets break it down:

Variables: Which clinics does each truck visit? (Let x_ij = 1 if truck i

serves clinic j, else 0.)

i=1

j=1

Objective: Minimize total cost ?10 ?100c_ijx_ij, where c_ij is the cost of truck i visiting clinic j.

3. Constraints:

i=1

Each clinic gets exactly one truck: ?10

x_ij = 1 for all j.

j=1

Truck capacities: ¹⁰⁰ d_jx_ij ? C_i for all i, where d_j is clinic js demand and C_i is truck is capacity.

This is a optimisation problem but what makes it deterministic is that all parameters are known before decisions are made. (c_ij, d_j, C_i) are known before decisions are made. Theres no ambiguity about tomorrows fuel prices or sudden road closures [?].

2.1.3 Why Determinism Still Matters in an Uncertain World

Critics argue that deterministic models are naively rigid. After all, the real world is messy: demand fluctuates, machines break down, pandemics erupt. But dismissing determinism is like dismissing a sculptors chisel because it cant paint. Deterministic models excel in three key scenarios:

Short-Term Decision Horizons: When outcomes unfold over hours or days (e.g., surgery scheduling), uncertainty is minimal [?].

Controlled Environments: Manufacturing plants, data centers, and power grids are engineered to minimize surprises [?].

Benchmarking: Deterministic solutions provide a gold standard to measure stochastic or heuristic approaches against [?].

Take the 1945 work of Leonid Kantorovich, who pioneered linear program- ming to optimise Soviet production schedules during WWII. His models assumed fixed resource allocations and factory outputsa stark simplification. Yet, they cut costs by 30% and earned him a Nobel Prize [?, ?]. Determinism, when applied judiciously, works.

1.1.4 The Price of Perfection

Deterministic optimisation isnt a free lunch. Its rigor demands three sacrifices:

Computational Complexity: Many deterministic problems (e.g., inte- ger programming) are NP-hard. Finding an exact solution for 100 clinics might take seconds; for 10,000 clinics, it could take until the heat death of the universe [?].

Brittleness: A solution optimal for fixed parameters can catastrophi- cally fail if those parameters shift. (Imagine your vaccine delivery plan collapsing because one truck breaks down [?].)

Data Hunger: Deterministic models require precise Guesswork or rounding can render solutions uselessor worse, harmful [?].

These limitations set the stage for modern innovations: robust optimisation, stochastic programming, and yes, deep learning. But before we hybridize, we must master the pure form [11].

2.2 Classification of Deterministic Optimisation Problems

2.2.1 Linear Programming (LP)

Formulation and Scope: Linear Programming (LP) seeks to optimise a lin- ear objective function subject to linear equality or inequality constraints. Its canonical form is:

min

c^T x

subject to Ax ? b

x ? 0

? ? }

? ? ? X { ? |

where c Rⁿ, A R^mⁿ, and b R^m. The feasible region = x Rⁿ Ax b, x 0 is a convex polyhedron, and the optimal solution (if it exists) lies at a vertex of this polyhedron [?].

Challenges

Curse of Dimensionality: While the simplex method efficiently nav- igates polyhedral vertices, problems with n > 10⁶ variables strain even state-of-the-art solvers due to memory and computational limits [?].

Sensitivity to Data: LP solutions are highly sensitive to perturbations in c, A, or b. A small change in production costs or resource availability can render a solution infeasible or suboptimal, necessitating techniques like post-optimality analysis [11].

Interpretability: While LP solutions are mathematically precise, trans- lating them into actionable business decisions (e.g., explaining why a fac- tory should produce 7 units) often requires rounding or heuristic ad- justments [?].

Applications

Supply Chain Optimisation: Unilever uses LP to minimize transporta- tion costs across 300 factories and 500 warehouses while meeting regional demand [?].

Energy Grid Management: Californias CAISO grid operator employs LP to allocate electricity generation across solar, wind, and fossil fuels while satisfying transmission constraints [?].

Example: Oil Refinery Blending A refinery produces gasoline (x₁) and diesel (x₂) from crude oil. Each barrel of gasoline yields $40 profit, diesel $30. The refinery has 1000 barrels of crude, with blending ratios: - Gasoline requires

0.3 barrels of crude A and 0.7 barrels of crude B. - Diesel requires 0.6 barrels of crude A and 0.4 barrels of crude B. Crude A availability: 400 barrels; Crude B: 600 barrels.

The LP formulation becomes:

max

x1,x2

40x₁ + 30x₂

s.t. 0.3x₁ + 0.6x₂ ? 400 (Crude A) 0.7x₁ + 0.4x₂ ? 600 (Crude B) x₁, x₂ ? 0

2.2.2 Integer Programming (IP)

Formulation and Scope Integer Programming restricts decision variables to integers, modeling discrete decisions like yes/no, counts, or sequences:

min

c^T x

subject to Ax ? b

x ? Zⁿ

Binary variables (x_i ? {0, 1}) enable logical conditions (e.g., if warehouse i

is built, x_i = 1) [86].

Challenges

Combinatorial Explosion: Feasible solutions grow exponentially with

n. A traveling salesman problem with 20 cities has 19! 1.2 10¹⁷ routes, making brute-force search impossible [?].

? { } ? ?

Relaxation Gaps: LP relaxations (ignoring integer constraints) often yield poor bounds. For example, relaxing x 0, 1 to 0 x 1 may produce fractional solutions (e.g., x = 0.5) that offer no actionable insight [?].

Symmetry: Identical subproblems (e.g., multiple indistinguishable ma- chines in scheduling) bloat branch-and-bound trees, requiring symmetry- breaking constraints [86].

Applications

Airlines Crew Scheduling: Delta Airlines saves $100M annually using IP to assign 30,000 crew members to flights while complying with labor laws [?].

Chip Design: Intel uses IP to place billions of transistors on silicon wafers, minimizing signal delay and power consumption [?].

Example: Facility Location A retail chain must choose 3 warehouses from 10 candidates to serve 50 stores. Let x_i = 1 if warehouse i is built, y_ij = 1 if store j is served by i. The IP minimizes total cost:

10 10 5

min

x,y

? cixi + ? ? dijyij

i=1

i=1 j=1

s.t.

y_ij = 1 ?j (Each store served)

i=1

y_ij ? x_i ?i, j (No service from unbuilt warehouses)

x_i = 3 (Build 3 warehouses)

i=1

x_i, y_ij ? {0, 1}

2.2.3 Quadratic Programming (QP)

Formulation and Scope Quadratic Programming generalizes LP with a quadratic objective:

min

1 x^T Qx + c^T x

subject to Ax ? b

x ? 0

? ?

where Q Rⁿⁿ is symmetric. Convex QP (Q 0) guarantees global minima; non-convex QP (Q ?? 0) may have multiple local minima [11].

Challenges

Positive Semi-Definiteness: Non-convex QP problems are NP-hard, limiting solvers to local minima. For example, training a neural network involves non-convex loss landscapes with saddle points [49].

Ill-Conditioning: Poorly scaled Q matrices (e.g., eigenvalues spanning 10^?⁶ to 10⁶) cause numerical instability in gradient-based methods [?].

Sparse vs. Dense Q: Portfolio optimisation with dense covariance ma- trices (?) requires O(n²) storage, infeasible for n > 10⁴ [?].

Applications

Robotics Trajectory Planning: QP optimises joint angles to minimize jerk (third derivative of position) while avoiding obstacles [?].

Financial Risk Management: BlackRock uses QP to balance asset portfolios, minimizing variance x^T ?x for a target return ^T x ? R [?].

? {? }

Example: Support Vector Machines (SVMs) Given labeled data (a_i, y_i) where y_i 1, 1 , SVMs find the hyperplane w^T a + b = 0 that maximizes the margin between classes:

min

w,b

1 w 2

s.t. y_i(w^T a_i + b) ? 1 ?i

This convex QP problem is solved via Lagrange duality, with kernel tricks extending it to non-linear boundaries [?].

2.2.4 Nonlinear Programming (NLP)

Formulation and Scope Nonlinear Programming generalizes optimisation to non-linear objectives and constraints:

min

f (x)

subject to g_i(x) ? 0, i = 1, . . . , m

h_j(x) = 0, j = 1, . . . , p

where f , g_i, or h_j are non-linear. Examples include chemical reactor design and neural network training [?].

Challenges

Global Local Optima: Gradient descent may converge to local minima, e.g., in clustering with non-convex objectives [11].

Computational Cost: Second-order methods (Newton-Raphson) re- quire O(n³) Hessian inversions, impractical for n > 10⁴[62].

Constraint Handling: Penalty and barrier methods struggle with non-convex or disjoint feasible regions [?].

Applications

Chemical Engineering: Optimising reactor temperature profiles to max- imize yield while avoiding runaway reactions [?].

Computer Vision: Bundle adjustment in 3D reconstruction minimizes reprojection error, a non-convex NLP [?].

2.3 Classical Algorithms for Deterministic Optimisation

2.3.1 The Simplex Method

Mechanics and Intuition Developed by George Dantzig in 1947, the Sim- plex Method solves linear programming (LP) problems by traversing the vertices of the feasible polyhedron. It operates on the principle that the optimal solution lies at a vertex of the feasible region [?]

Given an LP in standard form:

min

c^T x

s.t. Ax = b

x ? 0

the algorithm:

Initialization: Start at a feasible

Pivoting: Move to an adjacent vertex with a lower objective

Termination: Stop when no improving adjacent vertex

Challenges

Exponential Worst-Case Complexity: Klee-Minty (1972) constructed problems where Simplex visits all 2ⁿ vertices, though such cases are rare in practice [?].

Degeneracy: Cycling between vertices with the same objective value stalls progress, requiring anti-cycling rules (e.g., Blands rule) [?].

Sparse Dense Matrices: Performance plummets for dense constraint matrices A, as pivot operations scale with O(m³)[?].

Applications

Airlines Revenue Management: American Airlines uses Simplex to allocate seats across fare classes, boosting annual revenue by $4B [?].

Oil Refinery Scheduling: ExxonMobil solves million-variable LPs to optimise crude blending and distillation [?].

Example: Solve the LP:

max

x1,x2

3x₁ + 5x₂

s.t. x₁ + 2x₂ ? 10 3x₁ + x₂ ? 12 x₁, x₂ ? 0

The Simplex Method identifies the optimal vertex at (x₁, x₂) = (2, 4), yielding a maximal profit of $26 [?].

2.3.2 Branch-and-Bound for Integer Programming

Mechanics and Intuition Branch-and-Bound (B&B) solves integer program- ming (IP) problems by recursively partitioning the feasible region (branching) and pruning suboptimal regions (bounding). For a binary IP:

min c^T x s.t. Ax ? b, x ? {0, 1}n

the algorithm:

Relaxation: Solve the LP relaxation (0 ? x ? 1).

Branching: Split the problem into subproblems by fixing x_i = 0 or

x_i = 1.

Bounding: Discard subproblems whose relaxed objective exceeds the current best integer solution.

Challenges

Tree Size Explosion: For n variables, the B&B tree can have O(2ⁿ) nodes, necessitating heuristic variable selection (e.g., most fractional) [86].

? { } ?

Weak LP Bounds: Poor relaxations (e.g., min x t. x 0, 1 , x 0.5) fail to prune subproblems, forcing exhaustive search [?].

Parallelization Overhead: Distributing subproblems across CPUs in- troduces communication latency, limiting scalability [?].

Applications

Telecom Network Design: Verizon uses B&B to optimise fiber-optic routes, reducing infrastructure costs by 25% [?].

Pharmaceutical Scheduling: Pfizer schedules drug trials by solving IPs with B&B, minimizing patient wait times [?].

Example: Solve the knapsack problem:

max

10x₁ + 15x₂ + 5x₃

s.t. 2x₁ + 3x₂ + x₃ ? 4

x_i ? {0, 1}

B&B prunes branches where x₃ = 1 (since 10x₁ + 15x₂ 4 is infeasible) and finds x = (1, 1, 0) with value $25 [?].

2.3.3 Gradient Descent for Nonlinear Optimisation

Mechanics and Intuition Gradient Descent (GD) minimizes differentiable functions f (x) by iteratively moving in the direction of steepest descent:

x_k₊₁ = x_k ? ??f (x_k)

where ? is the learning rate. For constrained problems, projected GD modifies updates to maintain feasibility:

x_k₊₁ = ?_X (x_k ? ??f (x_k)) where ?_X projects x onto the feasible set X.

Challenges

Local Minima: Non-convex landscapes (e.g., neural network training) trap GD in suboptimal solutions [31].

Step Size Tuning: Poor ? choices cause oscillations (too large) or stag- nation (too small). Adaptive methods (e.g., Adam) mitigate this [46].

Curvature Ignorance: GD struggles with ill-conditioned Hessians (e.g., valleys with steep walls and flat floors), requiring preconditioning [?].

Applications

Machine Learning: Training deep neural networks via SGD (stochastic GD) on losses like cross-entropy [31].

Power Systems: GD optimises voltage control in grids to minimize trans- mission losses [?].

Example: Minimize f (x) = x⁴ ? 3x³ + 2 with GD. Starting at x₀ = 0, ? = 0.1:

x₁ = 0 ? 0.1 (?0) = 0 (stationary point)

This fails due to zero gradienthighlighting the need for second-order methods or randomized initialization [?].

2.3.4 Interior-Point Methods

Mechanics and Intuition Interior-Point Methods (IPMs) solve LPs and con- vex QPs by traversing the interior of the feasible region. For an LP:

min c^T x s.t. Ax = b, x 0

IPMs replace inequalities with logarithmic barriers:

min c^T x ln x_i s.t. Ax = b

i=1

As ? 0, the solution converges to the LP optimum [?].

Challenges

Newton Step Cost: Each iteration solves O(n³) linear systems, im- practical for n > 10⁶[?].Central Path Sensitivity : Poorlychosen se- quences cause oscillations or slow convergence [?].

Non-Convexity: IPMs fail for non-convex QPs, as the barrier function may introduce local minima [?].

Applications

High-Frequency Trading: Goldman Sachs uses IPMs to solve portfolio QPs in milliseconds [11].

Structural Engineering: Boeing optimises wing designs with IPMs, balancing stress and weight [?].

Example: Consider a portfolio manager who wants to allocate funds across n assets to minimize risk (variance) while achieving a target return R. Let x_i be the fraction of funds allocated to asset i, _i the expected return of asset i, and ? the covariance matrix of returns. The QP formulation is

min

1 x^T ?x 2

s.t.

_ix_i ? R

i=1 n

x_i = 1

i=1

x_i ? 0 ?i ? {1, . . . , n}

1 ? ?

To solve this using an IPM, we introduce a barrier function for the non-negativity constraints:

min

x^T ?x ln x_i

i=1

s.t.

_ix_i ? R

i=1 n

x_i = 1

i=1

The algorithm proceeds as follows: 1. **Initialization**: Start with a fea- sible x₀ and ₀ > 0. 2. **Newton Step**: At each iteration, solve the barrier problem for the current _k using Newtons method. 3. **Update**: Reduce _k and repeat until convergence [?].

? ?

0.1 0.02 0.01

For instance, with n = 3, = [0.1, 0.2, 0.3], and ? = 0.02 0.2 0.03 ,

0.01 0.03 0.3

the IPM converges to x = [0.4, 0.3, 0.3], achieving a risk of 0.12 for a target return of 0.15 [?].

2.4 Mathematical Framework

2.4.1 The Optimisation Problem: A Formal Blueprint

Every deterministic optimisation problem is a battle between ambition and lim- itation. Ambition is captured by the objective function, a mathematical expression of what we want to achieve. Limitation is enforced by constraints, which carve out the realm of possibility. Together, they form a structured dia- logue between desire and reality [?].

Formal Definition: A deterministic optimisation problem seeks a vector x

Rⁿ that solves:

min

f (x)

subject to g_i(x) ? 0, i = 1, . . . , m

h_j(x) = 0, j = 1, . . . , p

x ? X ? Rⁿ

Decision Variables (x): The levers we pull. For example, x₁ could represent the number of vaccines sent to Clinic A, x₂ the budget allocated to Truck B [?].

Objective Function (f (x)): The scorecard. Minimizing cost, maximiz- ing profit, or balancing riskall are codified here [?].

Inequality Constraints (g_i(x) 0): Hard boundaries. Do not exceed the trucks capacity. Keep production costs under $1M [11].

Equality Constraints (h_j(x) = 0): Perfect balances. Exactly 100% of the budget must be spent. Supply must meet demand [?].

Feasible Set ( ): The universe of allowable solutions. Its the intersec- tion of all constraintsa geometric region shaped by hyperplanes, curves, and inequalities [?].

Example: The Diet Problem A classic LP problem: Minimize the cost of a diet while meeting nutritional needs [?].

Variables: x_i = ounces of food i (e.g., carrots, chicken).

Objective: Cost = i c_ix_i, where c_i = cost per ounce of food i.

Constraints:
- i a_jix_i ? b_j for each nutrient j (e.g., vitamin A ? 5000 IU).
- x_i ? 0 (you cant eat negative carrots).

This problem, first studied by George Stigler in 1945, illustrates how opti- misation translates real-world trade-offs into mathematics [?].

2.4.2 Convexity: The Geometry of Efficiency

Convexity is the unsung hero of optimisation. It transforms nightmares into tractable puzzles [?].

Convex Set: A set X ? Rⁿ is convex if, for any two points x, y ? X, the entire line segment between them lies in X:

?x + (1 ? ?)y ? X ?? ? [0, 1].

Why it matters: Convex feasible sets ensure that mixing two feasible solutions yields another feasible solution. Imagine blending two vaccine delivery plans: if both respect truck capacities, so does their weighted average [11].

Convex Function: A function f : Rⁿ ? R is convex if:

f (?x + (1 ? ?)y) ? ?f (x) + (1 ? ?)f (y) ?x, y ? Rⁿ, ? ? [0, 1].

Visual intuition: The line segment between any two points on the graph of f lies above the graph. Convex functions have no hidden valleysonly a single, global minimum [?].

Example:

Convex: f (x) = x², f (x) = x (norm functions).

Non-convex: f (x) = sin(x), f (x) = x³.

The Power of Convexity:

Guaranteed Global Optima: Gradient descent on a convex f will always find the lowest valley [?].

Efficient Algorithms: Convex problems can often be solved in polynomial time [11].

Duality: Every convex problem has a shadow problem (the dual) that provides bounds and insights [?].

2.4.3 Duality: The Optimisation Mirror

Duality theory reveals that every optimisation problem has a twina dual prob- lemthat offers a complementary perspective [11].

Lagrangian Dual:

For the primal problem:

min f (x) s.t. g_i(x) 0, h_j(x) = 0,

we define the Lagrangian:

m p

L(x, ?, ) = f (x) + ? ?_ig_i(x) + ? _jh_j(x),

where ?_i ? 0 (Lagrange multipliers for inequalities) and _j ? R (for equalities).

The dual function is:

d(?, ) = inf (x, ?, ).

x?X

The dual problem maximizes d(?, ):

max d(?, ).

??0,

Interpretation:

The dual problem provides a lower bound on the primals optimal value [?].

If the primal is convex and satisfies Slaters condition (strict feasibility),

strong duality holds: the primal and dual optimal values coincide [11].

Example: In the diet problem, the dual variables ?_j represent the shadow price of each nutrient. If vitamin As shadow price is $0.10, increasing its requirement by 1 IU would raise the diets cost by $0.10 [?].

2.4.4 When Geometry Fights Back: Non-Convexity

Non-convex optimisation problems defy the orderly geometry of their convex counterparts. Here, the feasible region is riddled with jagged peaks, discon- nected valleys, and saddle points that confound even the most sophisticated algorithms [?]. These problems emerge in high-stakes domains like molecular dynamics, neural network training, and robotics, where the cost of settling for suboptimal solutions can be catastrophic.

Challenges:

Multiple Local Minima: The loss landscape of non-convex functions is a labyrinth of local minimasolutions that appear optimal within a small neighborhood but are globally suboptimal. For example, training a deep neural network involves minimizing a non-convex loss function (?), where ? represents network weights. Gradient descent often converges to local minima that yield poor generalization, a phenomenon known as overfitting [31].

NP-Hardness: Non-convex problems are frequently NP-hard, meaning their worst-case computational complexity grows exponentially with prob- lem Consider the traveling salesman problem (TSP): finding the

shortest route visiting n cities is non-convex due to combinatorial con- straints. For n = 50, there are 49! 6 10⁶² possible routes, and no known algorithm can guarantee an exact solution in polynomial time [?]. Similarly, protein foldinga non-convex optimisation task in bio- physicsremains unsolved for large molecules despite decades of research [?].

No Duality Guarantees: Convex problems enjoy strong duality: the primal and dual solutions coincide, enabling efficient algorithms like dual ascent. In non-convex regimes, duality gaps emerge, rendering these meth- ods unreliable. For instance, in mixed-integer nonlinear programming (MINLP), Lagrangian relaxations often yield weak dual bounds, leaving practitioners blind to the quality of their solutions [?]. This forces reliance on heuristic methods like simulated annealing or genetic algorithms, which lack convergence guarantees [?].

Despite their challenges, non-convex problems dominate cutting-edge applica- tions:

Deep Learning: Transformers like GPT-4 train on non-convex losses, achieving human-level performance by exploiting wide valleys in the loss landscape [?].

Quantum Computing: Variational quantum eigensolvers (VQEs) opti- mise non-convex Hamiltonians to simulate molecular structures [?].

Autonomous Systems: Non-convex trajectory planning enables drones to navigate urban canyons while avoiding collisions [?].

Example: The Rosenbrock Function The Rosenbrock function, a classic non-convex benchmark, illustrates the perils of deceptive minima:

f (x, y) = (a ? x)² + b(y ? x²)²

For a = 1, b = 100, the global minimum is at (1, 1), but the elongated valley

y = x² misleads gradient-based methods into slow convergence [?].

Modern Counterstrategies Researchers combat non-convexity with:

Global Optimisation: Branch-and-bound and cutting-plane methods for MINLPs [?].

Stochastic Methods: SGD with momentum to escape shallow minima [31].

Topological Tools: Persistent homology to map loss landscapes [?].

3. Foundations of Neural Network

3.1General Introduction

3.1.1Biological Inspiration vs. Mathematical Abstraction

The term neural network evokes biological neurons, but the connection is su- perficial. Biological neurons are complex, dynamic cells that communicate via electrochemical signals [55]. Artificial neural networks (ANNs), in contrast, are simplified mathematical models designed for computation [73].

Biological Neuron: - Receives inputs via dendrites, processes them in the soma, and fires outputs via axons. - Communication is sparse, asyn- chronous, and energy-efficient [?].

Artificial Neuron: - A mathematical function that computes a weighted sum of inputs, adds a bias, and applies a non-linear activation:

y = ?

i=1

w_ix_i + b!

- x_i: Inputs (e.g., pixel values, sensor readings). - w_i: Weights (learnable parameters). - b: Bias (learnable offset). - ?: Activation function (e.g., ReLU, sigmoid) [36].

Key Takeaway: ANNs are not simulations of brainsthey are function approximators inspired by neuroscience [31]. Their power lies in composition: stacking layers of neurons to model complex relationships.

3.1.2 The Perceptron: Smallest Building Block

The perceptron, proposed by Frank Rosenblatt in 1958 [72], is the simplest neural network: a single artificial neuron.

Mathematical Formulation Given inputs x = [x₁, x₂, . . . , x_n], weights w = [w₁, w₂, . . . , w_n], and bias b, the perceptron computes:

z = w^T x + b (weighted sum), y = ?(z) (activation).

(

For binary classification, ? is often a step function:

?(z)= 1 if z ? 0,

0 otherwise.

Limitations

Linear Separability: Perceptrons can only solve problems where classes are separable by a hyperplane. For example, they learn AND/OR logic gates but fail at XOR [58].

Static Behavior: No capacity to learn without weight updates (fixed by later innovations like backpropagation [73]).

3.1.3 Neural Networks as Mathematical Functions

A neural network is a parameterized function f_? : Rⁿ ? R^m, where ? represents learnable parameters (weights and biases). It transforms inputs x ? Rⁿ into outputs y ? R^m through a sequence of layered computations [19].

Anatomy of a Neural Network

Input Layer: Receives raw data x = [x₁, x₂, . . . , x_n].

Hidden Layers: Intermediate layers that apply non-linear transforma- tions [40].

Output Layer: Produces predictions y = [y₁, y₂, . . . , y_m].

Mathematical Formulation For a network with L layers, the output of layer

l is computed as:

h(l) = ?(l) W(l)h(l?1) + b(l)

where: - h^(l): Output of layer l (with h⁽⁰⁾ = x). - W^(l) R^dldl?1 : Weight matrix connecting layer l 1 to l. - b^(l) R^dl : Bias vector for layer l. - ?^(l): Non-linear activation function for layer l.

Example: 2-Layer Network A network with one hidden layer (3 neurons) and one output neuron computes:

Hidden layer: h⁽¹⁾ = ? W⁽¹⁾x + b⁽¹⁾ , W⁽¹⁾ ? R³ⁿ, b⁽¹⁾ ? R³

Output layer: y = ?_out W⁽²⁾h⁽¹⁾ + b⁽²⁾ , W⁽²⁾ ? R¹³, b⁽²⁾ ? R

Key Components

Weights (W): Determine the strength of connections between neurons [9].

Biases (b): Shift the activation threshold of

Activation Functions (?): Introduce non-linearity (see Table 1) [60].

Table 1: Common activation functions and their properties.

Function Formula Range

Sigmoid ¹?x (0, 1)

ReLU max(0, x) [0, ?)

1+e

Tanh

ex?e?x ex+e?x

(?1, 1)

Why Non-Linearity? Without activation functions, a neural network re- duces to a linear model, regardless of depth [31]:

y = W^(L)W^(L^?¹⁾ W⁽¹⁾x + b= W_effx + b

Non-linear activations break this linearity, enabling the network to approximate complex functions (as guaranteed by the Universal Approximation Theorem [40]).

Parameter Count The total number of parameters in a network is:

Params = (d_l d_l_?₁ + d_l)

l=1

For example, a network with layers [4, 5, 3] (input, hidden, output) has (4 5 +

5) + (5 3 + 3) = 25 + 18 = 43 parameters [?].

3.1.4 Learning Mechanism: Training a Neural Network

Training a neural network involves adjusting ? = {W, b} to minimize a loss function L(?), which measures prediction error [9].

Loss Functions

Mean Squared Error (MSE):

L(?) = 1? (y ? f (x ))

Used for regression tasks (e.g., predicting house prices).

Cross-Entropy:

N C

i,c

i c

L(?) = ? 1? ? y log(f (x ) )

Used for classification tasks (e.g., image recognition) [31].

Gradient Descent The weights are updated iteratively using gradients of the loss:

? ?

W(l) W(l) ? ?L

?W(l)

? ?

b(l) b(l) ? ?L

?b(l)

where ? is the learning rate [?].

Backpropagation Backpropagation computes gradients efficiently using the chain rule [73]:

?h(L)

?h(l+1)

?W(l) = ?h(L) ?h(L?1)

?W(l)

Example: Training a Perceptron Consider a perceptron learning the AND gate:

Training data: (x₁, x₂) ? {(0, 0), (0, 1), (1, 0), (1, 1)}, y ? {0, 0, 0, 1}.

Initialize w₁, w₂, b randomly (e.g., w₁ = 0.5, w₂ = 0.5, b = ?0.7).

Update weights using gradient descent until predictions match targets [?].

3.2 Neural Network Architectures

3.2.1 Feedforward Neural Networks (FNNs)

Feedforward Neural Networks (FNNs) are the simplest and most universal archi- tecture, where information flows unidirectionally from input to output without cycles [31].

Mathematical Structure For an FNN with L layers:

h^(l) = ?^(l) W^(l)h^(l^?¹⁾ + b^(l) , l = 1, 2, . . . , L

? ?

h⁽⁰⁾ = x Rⁿ: Input vector. - h^(L) = y R^m: Output vector. - Hidden layers h⁽¹⁾, . . . , h^(L^?¹⁾ progressively transform the input [9].

Applications

Regression: Predicting house prices from features like square footage and location [?].

Classification: Identifying spam emails from word frequency vectors [31].

Limitations

Fixed Input Size: FNNs require inputs of fixed dimension (e.g., a 784- pixel MNIST image).

No Temporal Awareness: Cannot model sequences or time-dependent data [39].

3.2.2 Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) exploit spatial hierarchies through lo- calized filters, making them ideal for grid-like data (e.g., images, audio spectro- grams) [48].

Mathematical Structure A convolutional layer applies K filters to input

X ?RHW C :

Z_k = W_k,c ? X_c + b_k, k = 1, . . . , K

c=1

- ?: 2D convolution operation. - W_k,c ? R^h^w: Kernel for filter k and input

? ?

channel c. - Z_k ? R^H ^W : Output feature map [?].

Pooling Pooling layers (e.g., max-pooling) downsample feature maps:

P_i,j = max

p,q?N (i,j)

Zp,q

where N(i, j) is a local neighborhood (e.g., 2x2 window) [?].

Applications

Image Classification: Identifying objects in photos (e.g., ResNet [?], VGG [?]).

Medical Imaging: Detecting tumors in MRI scans [26].

Limitations

Translation Invariance: Assumes local patterns are position-agnostic, which may not hold for non-grid data [31].

High Memory Cost: Storing 3D feature maps for large inputs demands significant memory [48].

3.2.3 Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) process sequential data by maintaining a hidden state that encodes temporal dependencies [73].

Mathematical Structure At time step t, an RNN updates its hidden state

h_t and output y_t:

h_t = ? (W_hh_t_?₁ + W_xx_t + b_h) y_t = W_yh_t + b_y

x_t ? R^d: Input at time t. - h_t ? R^h: Hidden state [39].

Variants

LSTM: Introduces gating mechanisms to mitigate vanishing gradients [39].

GRU: Simplified gated architecture with fewer parameters [?].

Applications

Machine Translation: Converting English to French via sequence-to- sequence models [?].

Time Series Forecasting: Predicting stock prices or energy demand [?].

Limitations

Vanishing Gradients: Difficulty learning long-range dependencies (par- tially solved by LSTMs/GRUs) [?].

Sequential Computation: Cannot parallelize across time steps, slowing training [83].

3.2.4 Transformers

Transformers, introduced by [83], replace recurrence with self-attention to model global dependencies in parallel.

Self-Attention Mechanism For input sequence X Rⁿ^d, compute queries

Q, keys K, and values V:

Q = XW_Q, K = XW_K, V = XW_V

Attention(Q, K, V) = softmax

QKT

W_Q, W_K, W_V ? R^d^dk : Learnable projection matrices [83].

Applications

Language Modeling: GPT-4 generates human-like text via autoregres- sive transformers [64].

Protein Folding: AlphaFold predicts 3D protein structures using atten- tion mechanisms [44].

Limitations

Quadratic Complexity: Self-attention scales as O(n²), prohibitive for long sequences [?].

Lack of Inductive Bias: Requires massive data to learn spatial/temporal relationships from scratch [?].

3.2.5 Summary of Architectures

Each architecture encodes a unique inductive bias, shaping how it learns and generalizes.

Table 2: Key properties of neural network architectures.

Architecture	Input Type	Strengths	Weaknesses
FNN	Fixed-size vectors	Universality, simplicity	No spa- tial/temporal reasoning
CNN	Grids (images, audio)	Spatial hierarchy	Translation invariance assumption
RNN	Sequences	Temporal dependency	Sequential computation
Transformer	Sequences	Parallelism, global context	Quadratic complexity

This table summarizes the key properties of the architectures discussed in this section. Feedforward Neural Networks (FNNs) excel at handling fixed-size inputs but lack spatial or temporal reasoning. Convolutional Neural Networks (CNNs) exploit spatial hierarchies in grid-like data but assume translation in- variance. Recurrent Neural Networks (RNNs) model temporal dependencies but suffer from sequential computation. Transformers, while powerful, scale quadratically with input length, making them computationally expensive for long sequences [83].

3.3 Training Neural Networks

Training a neural network involves adjusting its parameters (weights and biases) to minimize a loss function, which quantifies the discrepancy between predic- tions and targets [9]. This section explores the key components of training: loss functions, gradient-based optimisation, and regularization.

3.3.1 Loss Functions

A loss function (?) measures how well the networks predictions y = f_?(x) match the true targets y.

Mean Squared Error (MSE) For regression tasks, MSE is commonly used [?]

L 1 ?

(?) = (y_i

i=1

y_i)²

- N : Number of training examples. - y_i: True target for example i. - y_i: Predicted output for example i.

Cross-Entropy Loss For classification tasks, cross-entropy loss is preferred [31]:

L ? ? ?

N C

(?) = 1 y

i=1 c=1

i,c

log(y

i,c)

- C: Number of classes. - y_i,c: Binary indicator (1 if example i belongs to class

c, else 0). - y_i,c: Predicted probability of class c for example i.

Custom Loss Functions In some cases, domain-specific loss functions are designed to capture unique objectives. For example, in medical diagnosis, false negatives (missed diagnoses) may be penalized more heavily than false positives [26].

3.3.2 Gradient-Based Optimisation

Gradient-based optimisation adjusts the networks parameters ? to minimize the loss function L(?) [?].

Gradient Descent The weights are updated iteratively using gradients of the loss:

?_k₊₁ = ?_k ? ??_?L(?_k)

? L

?: Learning rate (controls step size). - _? (?_k): Gradient of the loss with respect to ?.

Stochastic Gradient Descent (SGD) Instead of computing gradients over the entire dataset, SGD uses mini-batches [10]:

?k+1 = ?k ? ??? Lbatch(?k)

_batch: Loss computed over a subset of the - Benefits: Faster conver- gence, reduced memory usage.

Adaptive Optimisers Modern optimisers like Adam and RMSProp adapt the learning rate dynamically [46]:

?_k₊₁ = ?_k ? ? AdaptiveTerm(?_?L(?_k))

Adam combines momentum and adaptive learning rates for robust training [46].

3.3.3 Backpropagation

Backpropagation computes gradients efficiently using the chain rule of calculus [73].

Forward Pass Compute the networks output y from input x: h(l) = ?(l) W(l)h(l?1) + b(l) , l = 1, . . . , L

- h⁽⁰⁾ = x: Input. - h^(L) = y: Output.

Backward Pass Compute gradients layer-by-layer, starting from the output [31]:

?L ?L ?h^(l+1)

?h(l) = ?h(l+1) ?h(l)

Propagate errors backward to update weights and

Example: Training a Perceptron Consider a perceptron learning the AND gate [?]:

Training data: (x₁, x₂) ? {(0, 0), (0, 1), (1, 0), (1, 1)}, y ? {0, 0, 0, 1}.

Initialize w₁, w₂, b randomly (e.g., w₁ = 0.5, w₂ = 0.5, b = ?0.7).

Update weights using gradient descent until predictions match

3.3.4 Regularization and Generalization

Regularization techniques prevent overfitting, where the network memorizes training data instead of generalizing to unseen examples [76].

L2 Regularization (Weight Decay) Adds a penalty on large weights to the loss function [31]:

L_reg(?) = L(?) + ? ? w²

?: Regularization

Dropout Randomly deactivates neurons during training to encourage robust- ness [76]:

h(l) = Dropout(?(l) W(l)h(l?1) + b(l) )

Dropout rate: Fraction of neurons deactivated (e.g., 5).

Early Stopping Halts training when validation loss stops improving, pre- venting overfitting [69].

3.4 Challenges and Limitations of Neural Networks

3.4.1 Overfitting and Generalization

Overfitting occurs when a neural network learns to memorize training data instead of generalizing to unseen examples [31].

Solutions

Regularization: Techniques like dropout [76] and weight decay [?] pe- nalize overly complex models.

Data Augmentation: Artificially expanding the dataset (e.g., rotating images) improves generalization [?].

Early Stopping: Halting training when validation performance plateaus prevents overfitting [69].

3.4.2 Vanishing and Exploding Gradients

In deep networks, gradients can vanish (become too small) or explode (become too large), making training unstable [?].

Solutions

ReLU Activation: Mitigates vanishing gradients by avoiding saturation [60].

Batch Normalization: Stabilizes training by normalizing layer inputs [41].

Gradient Clipping: Prevents exploding gradients by capping their mag- nitude [65].

3.4.3 Computational Cost

Training deep neural networks requires significant computational resources [77].

Solutions

Model Pruning: Removing unnecessary weights reduces computational load [?].

Quantization: Using lower precision (e.g., 8-bit instead of 32-bit) speeds up computation [?].

Distributed Training: Splitting workloads across multiple GPUs or nodes reduces training time [21].

3.4.4 Interpretability and Explainability

Neural networks are often criticized as black boxes due to their lack of inter- pretability [?].

Solutions

Saliency Maps: Highlight input features that influence predictions [?].

SHAP Values: Quantify the contribution of each feature to the output [?].

Attention Mechanisms: Provide insights into which parts of the input the model focuses on [83].

3.4.5 Data Dependency

Neural networks require large amounts of labeled data [?].

Solutions

Transfer Learning: Leveraging pre-trained models reduces the need for large datasets [?].

Semi-Supervised Learning: Combining labeled and unlabeled data im- proves performance [?].

Synthetic Data Generation: Creating artificial data to supplement real-world datasets [32].

1.1.1 Robustness and Adversarial Attacks

Neural networks are vulnerable to adversarial examplessmall perturbations that cause misclassification [32].

Solutions

Adversarial Training: Training on adversarial examples improves ro- bustness [?].

Robust Optimisation: Modifying the loss function to penalize vulner- able predictions [79].

Defensive Distillation: Training a second model to smooth decision boundaries [?].

1.2 Neural Networks in the Wild

1.2.1 Image Recognition

Modern image recognition systems leverage deep learning to achieve superhu- man performance [?].

Applications

Medical Imaging: - CNNs detect diabetic retinopathy in retinal scans with 94?curacy [34]. - Google DeepMinds system deployed in India and Thailand [?].

Autonomous Vehicles: - YOLO networks process real-time video feeds [?]. - Teslas Autopilot uses 48 neural networks [?].

Satellite Imagery: - U-Net architectures segment deforestation patterns [?].

Architectural Evolution

ResNet: Residual connections enabled training of 1,000+ layer networks [?].

Vision Transformers: Patch-based attention models like ViT [?].

1.2.2 Natural Language Processing

Neural networks have redefined how machines understand and generate human language [83].

Applications

Machine Translation: - Transformer-based models (e.g., Googles M4) [83]. - Real-time sign language-to-text conversion [?].

Content Generation: - GPT-4 writes scientific abstracts [64]. - AI Dungeon generates interactive stories [?].

Sentiment Analysis: - BERT models classify customer reviews [?].

Breakthrough Models

Attention Is All You Need: Introduced self-attention [83].

PaLM 2: 540-billion parameter model [?].

1.2.3 Limitations in Practice

Despite their success, neural networks face critical barriers in real-world deploy- ment [?].

Data Hunger

Rare Diseases: Training cancer detection models requires aggregated datasets [26].

Cost: Labeling 1 hour of Lidar data costs $3,200 [?].

Interpretability Crisis

Healthcare: Radiologists reject black-box AI tools [?].

Legal Risks: EUs AI Act mandates explainability [?].

Brittleness

Adversarial Attacks: Fawkes algorithm cloaks faces [?].

Environmental Impact: GPT-3 emitted 552 tons of CO₂ [77].

2 Bridging Deterministic Optimisation and Deep Learning Theoretically

At its core, training a neural network is an exercise in navigating complex, high- dimensional landscapesa task that demands both mathematical precision and intuitive adaptations of classical optimisation principles [31]. This subsection re-examines foundational concepts from deterministic optimisation through the lens of deep learning, illuminating how centuries-old theories underpin modern neural network training [11].

2.1 Optimisation Fundamentals for Deep Learning

2.1.1 Convexity Reimagined: The Geometry of Loss Landscapes

The concept of convexity, long a cornerstone of classical optimisation, takes on new meaning in the context of deep learning [62]. A function f : R^d

R is convex if every line segment between two points on its graph lies above the graph itselfa property guaranteeing that gradient descent will find the global minimum [13]. Yet neural network loss functions, shaped by non-linear activations and layered transformations, are decidedly non-convex [17].

The Paradox of Non-Convex Success Why then do gradient-based meth- ods succeed so spectacularly in practice? Two key insights emerge [87]:

Overparameterisation as Sanctuary: Modern networks contain more parameters than training samples [?]. This excess capacity creates a convex-like geometry where local minima are often global [24], effectively sidestepping traditional non-convex pitfalls.

Initialisation as Compass: Techniques like Glorot initialisation [30], which scales weights inversely with layer dimensions, position networks in regions where gradient flow remains stable [38]. This careful starting point, akin to choosing favourable coordinates for a mountainous trek, avoids pathological regions where optimisation might stall.

Activation Functions Advantage Consider the rectified linear unit (ReLU), defined as:

?(z) = max(0, z).

Unlike saturating functions such as the sigmoid, ReLU preserves gradient in- formation across much of its domain [60]. This property, while introducing non-convexity, paradoxically simplifies optimisation by maintaining informative gradientsa critical factor in training deep architectures.

2.1.2 Gradient Descent: From Theory to Practice

Gradient descent, first conceived by Cauchy in 1847, finds renewed purpose in neural network training [14]. Let (?) represent the loss function, where ? denotes network parameters, and ? > 0 the learning rate.

Evolution of an Algorithm

Vanilla Gradient Descent:

?_t₊₁ = ?_t ? ??_?L(?_t)

While elegant in theory, this formulation proves impractical for large net- works due to computational demands [10]a limitation reminiscent of early linear programming challenges.

Stochastic Gradient Descent (SGD): By computing gradients on mini-batches B ? D:

?_t₊₁ = ?_t ? ??_?L_B(?_t),

SGD introduces controlled noise. This stochasticity, far from a bug, be- comes a feature [37]helping escape sharp minima that plague determin- istic methods.

Momentum: Physics Meets Optimisation: Borrowing from Newto- nian mechanics [67], momentum accumulates velocity in persistent direc- tions:

v_t₊₁ = ?v_t + (1 ? ?)?_?L(?_t), ?_t₊₁ = ?_t ? ?v_t₊₁.

The momentum coefficient ? [0, 1), typically set around 0.9, determines how quickly past gradients decaya crucial hyperparameter in training deep networks like ResNets [?].

Convergence in Hostile Territory For non-convex objectives, gradient de- scent guarantees convergence only to stationary points [50]. Yet in practice, these often prove sufficient for neural networks [17]a phenomenon later sec- tions will explore through modern theoretical lenses like the Neural Tangent Kernel [42].

2.1.3 Lagrange Multipliers: Hidden Constraints in Neural Networks

Though rarely framed explicitly, constrained optimisation permeates neural net- work training through regularisation techniques [9]. Consider the generic con- strained problem:

min L(?) subject to g_i(?) ? 0, i = 1, . . . , m.

The Duality of Regularisation Weight decay, commonly implemented as

?₂-regularisation, can be reinterpreted through Lagrange duality [82]:

min L(?) + ? ? ².

Here, ? functions as a Lagrange multiplier, implicitly constraining parameter magnitudes. This dual perspective reveals regularisation as a form of soft con- straint enforcement [11]a concept familiar from portfolio optimisation (Chap- ter ??).

Projection in Practice Explicit constraints occasionally surface in neural networks. For example, non-negative weight constraints in interpretability- focused models employ projected gradient descent [3]:

?_t₊₁ = ?_? (?_t ? ??_?L(?_t)) ,

where ?_? denotes Euclidean projection. Similar logic underpins gradient clip- ping [65], preventing parameter updates from destabilising training.

2.2 How Deep Learning Challenges Classical

Optimisation

Deep learning has not only borrowed from classical optimisation but has also reshaped it [49], introducing new challenges and insights. This subsec- tion explores the theoretical interplay between deep learning and optimisation, highlighting how neural networks have redefined traditional paradigms [74].

2.2.1 The Non-Convexity Paradox: Why Deep Learning Works

Classical optimisation theory warns against non-convex landscapes [62], yet deep learning thrives in such environments [87]. This apparent paradox is rooted in the unique geometry of high-dimensional spaces.

High-Dimensional Landscapes In high dimensions, local minima are rare, and saddle points dominate [17]. However, gradient-based methods like SGD rarely converge to strict saddles [50], as their unstable directions occupy negli- gible volume.

Overparameterisation and Implicit Regularisation Overparameterised networks exhibit a convex-like loss landscape [?], where local minima are often global [24], and SGD converges to solutions that generalise well [87].

The Role of Depth Depth amplifies these effects [25]. Deep networks parti- tion input space into piecewise linear regions, creating a hierarchical structure that gradient descent navigates efficiently [59].

2.2.2 Implicit Regularisation: The Hidden Hand of Optimisation

Optimisation algorithms in deep learning do more than minimise lossthey shape the solution itself [?].

SGD and Flat Minima SGD exhibits a preference for flat minima [45], which generalise better than sharp ones [39]. This bias has been formalised through PAC-Bayes theory [56].

The Role of Noise The noise introduced by SGDs mini-batches acts as a regulariser [37], preventing overfitting.

Adaptive Methods and Their Biases Adaptive optimisers like Adam [46] favour directions of low curvature, promoting flat minima at the cost of inter- pretability [85].

2.2.3 Gradient Flow and Mean-Field Theory

The training dynamics of deep networks can be modelled as a gradient flow [57].

Mean-Field Limits In the limit of infinite width, networks behave like kernel machines governed by the Neural Tangent Kernel (NTK) [42].

Finite-Width Effects Finite-width networks exhibit stochastic deviations from the mean-field limit [51].

Connections to Statistical Mechanics Gradient flow draws parallels with statistical mechanics [?], inspiring techniques like Langevin dynamics [84].

2.2.4 Open Questions and Future Directions

Despite progress, many questions remain unanswered [?].

The Role of Initialisation Why do certain initialisations (e.g., Glorot [30]) lead to faster convergence?

Optimisation in the Presence of Noise How can algorithms harness noise from gradients and hardware [75]?

Beyond First-Order Methods Can second-order methods be adapted for modern networks [54]?

2.3 Challenges at the Optimisation-Deep Learning Inter- face

The theoretical overlap of deep learning and classical optimisation has yielded advances, but challenges persist [?].

2.3.1 Generalisation and Optimisation Dynamics

The relationship between optimisation trajectories and generalisation remains enigmatic [87].

The Double Descent Phenomenon Test error decreases as models grow beyond the interpolation threshold [6], challenging classical theory.

Sharpness-Aware Minimisation (SAM) SAM explicitly optimises for flat minima [28]:

? L

?_t₊₁ = ?_t ? _? max (?_t + ?).

?????

Grokking: Delayed Generalisation Networks suddenly transition from memorisation to understanding after prolonged training [68].

2.3.2 Stochastic Optimisation Revisited

Stochasticity demands new perspectives [10].

Mini-Batch Noise as Levy Processes Empirical gradients exhibit heavy tails [75]:

d?_t = ??L(?_t)dt + ?dL_t.

Convergence in Non-Smooth Settings Subgradient methods require ex- tension for ReLU networks [20].

Adaptive Methods Under Scrutiny Adam dominates practice but lacks theoretical grounding [71].

2.3.3 Scaling Laws and Distributed Optimisation

Large models necessitate distributed optimisation [21].

Communication Complexity Federated learning incurs ?(KT ) communi- cation cost [47].

Model Parallelism as Tensor Decomposition Splitting networks induces tensor factorisation [63]:

min

U,V

W ? U ? V F .

The Memory-Speed Trade-Off Compromises include checkpointing [?] and mixed precision [?].

2.3.4 Frontiers in Optimisation Theory

Deep learning exposes gaps in classical theory [?].

Quantum-Inspired Optimisation Quantum annealing may escape local minima [2].

Neural Differential Equations Networks as discretised differential equa- tions [15].

Symmetry in Learning Dynamics Extending Noethers theorem for equiv- ariant networks [18].

2.4 Theoretical Roadmap for DNN-Driven Optimisation

This subsection outlines the theoretical framework focusing on bridging DNNs and optimisation [7].

2.4.1 Methodological Foundations

The experimental progression mirrors historical evolution [78].

Problem Hierarchy

Linear Programming: Validate basic DNN viability [?].

Quadratic Programming: Test smooth non-linearities [3].

Mixed-Integer LP: Learn combinatorial logic [29].

Non-Linear Programming: Probe non-convex limits [?].

PDE-Constrained Problems: Integrate physics [70].

Architectural Evolution

LP/QP: ReLU networks [60].

MILP: GNNs [29].

PDE-Constrained: PINNs [70].

Training Protocol

Data Generation: Synthesise via Gurobi/CPLEX [35].

Loss Design: Augment MSE with Lagrangian penalties [3]:

L(?) = y ? f_?(x) ²+ ? max(0, g_i(f_?(x)))²

Benchmarking: Compare speedup and optimality gap [?].

2.4.2 Problem Classes and Evaluation Metrics

Tailored evaluation criteria for each problem class [7].

Linear Programming

Baseline: Simplex Interior-Point [11].

Metrics: Feasibility rate and speedup [?].

Mixed-Integer Linear Programming

Baseline: Branch-and-Bound [86].

Metrics: Integer feasibility and optimality gap [29].

Non-Convex Optimisation

Baseline: Multi-start gradient descent [17].

Metrics: Convergence probability [50].

2.4.3 Anticipated Challenges and Mitigations

Addressing unique DNN-based optimisation challenges [?].

Constraint Violations

Challenge: Infeasible

Mitigation: Differentiable projection layers [3].

Combinatorial Explosion

Challenge: MILP

Mitigation: Curriculum learning [?].

Physics-Informed Learning

Challenge: High-dimensional

Mitigation: Hybrid spectral methods [70].

2.4.4 Significance and Expected Contributions

This work seeks to advance the integration of DNNs into optimisation theory and practice through three key contributions:

Theoretical Insights

Characterise problems where DNNs outperform classical methods (e.g., real-time optimisation under noisy inputs) [4].

Formalise conditions for DNN generalisation across problem instances [?].

Algorithmic Innovations

Develop architectures preserving optimisation problem structure [?].

Introduce training protocols blending gradient-based learning with classi- cal feasibility heuristics [?].

Practical Impact

Demonstrate DNN applicability in high-frequency trading [?] and supply chain logistics [?].

Open-source benchmark datasets and model templates [?].

3 Current State of the Art: DNNs in Optimisa- tion

3.1 Methodological Innovations

This subsection surveys architectural and algorithmic breakthroughs that define the current frontier in this overlap.

3.1.1 Architectural Taxonomies by Problem Class

Modern DNN architectures are increasingly tailored to exploit structural prop- erties of optimisation problems:

Linear/Quadratic Programming: - OptNet : Differentiable optimisa- tion layers that embed quadratic programming (QP) solvers into neu- ral networks, enabling end-to-end learning with convex constraints [3].

- DeepLP : Transformer-based models predict active constraints in linear programmes (LPs), reducing solve time by 70% on Netlib benchmarks [53].

Mixed-Integer Linear Programming (MILP): - Graph Neural Net- works (GNNs): Encode variable-constraint bipartite graphs, achieving 90?curacy in branch-and-bound variable selection on MIPLIB 2017 [29]. - Neural Diving: Predicts partial integer assignments to warm-start solvers, reducing CPLEX runtime by 50% on scheduling problems [61].

Non-Convex Optimisation: - Convex Relaxation Networks: Combine DNNs with convex envelopes to approximate global minima of non-convex functions [16]. - Neural Trust Region: Learns trust region policies for non-linear objectives, outperforming BFGS on Rosenbrock and Ackley functions [8].

PDE-Constrained Optimisation: - Physics-Informed Neural Networks (PINNs): Embed partial differential equation (PDE) residuals as soft con- straints, solving inverse problems in fluid dynamics 100 faster than finite element methods [70]. - Fourier Neural Operators: Learn mappings be- tween function spaces for high-dimensional PDEs, achieving state-of-the- art on Burgers equation and Navier-Stokes [52].

3.1.2 Training Paradigms

Innovative training protocols bridge the gap between classical optimisation and deep learning:

Differentiable Solvers

Concept: Unroll iterations of classical algorithms (e.g., proximal gradi- ent, simplex) into neural networks.

DC3: Ensures feasibility in constrained optimisation via differentiable projections, achieving 100% constraint satisfaction on portfolio optimisa- tion tasks [?].

CvxpyLayer: Embeds convex optimisation problems as differentiable layers, enabling gradient-based tuning of QP parameters [1].

Hybrid Classical-DNN Pipelines

Learn-to-Cut: Trains DNNs to generate cutting planes for MILP, reduc- ing solver iterations by 60% on combinatorial auctions [66].

Warm-Start Generation: DNNs provide initial solutions to Gurobi, cutting runtime by 35% on large-scale supply chain problems [80].

Loss Function Design

Lagrangian Learning: Adaptive penalty methods balance constraint violation and objective optimisation, achieving 95?asibility on non- convex problems [5].

Optimality Gap Loss: Directly minimises f_?(x) x^? using pre-solved instances, reducing suboptimality by 40% vs. MSE [61].

3.1.3 Key Innovations

Breakthroughs expanding DNNs role in optimisation:

Implicit Layers: Embed optimisation problems (e.g., OSQP for QP) as neural network layers, enabling real-time control in robotics [3].

Symmetry Exploitation: Equivariant networks leverage problem sym- metries (e.g., rotational invariance in molecular docking), reducing train- ing data needs by 80% [12].

Neural Duality: Predicts dual variables for LPs, enabling rapid sensi- tivity analysis in supply chain optimisation [?].

3.2 Case Studies

This subsection examines practical deployments of DNNs in optimisation

3.2.1 Industry Applications

DNNs are reshaping optimisation in mission-critical industrial workflows:

Logistics & Supply Chain: - Google OR-Tools: Integrated graph neural networks (GNNs) for vehicle routing, reducing delivery costs by 12?ross 10,000+ European routes [33]. - UPS Routing : DNNs dynamically adjust delivery routes using real-time traffic data, processing 15,000+ updates per second with 95?asibility [?].

Financial Optimisation: - JPMorgan Portfolio Management : Rein- forcement learning (RL)-based DNNs outperformed mean-variance opti- misation (MVO) by 8% annualised returns on S&P 500 assets [43]. - High- Frequency Trading: QP-solving DNNs achieve 5ms latency for real-time arbitrage, 20 faster than interior-point methods (IPMs) [?].

Energy Systems: - DeepMind Data Centers: DNNs reduced Googles cooling costs by 40% via non-convex optimisation of HVAC systems [22].

- Tesla Battery Management : Physics-informed neural networks (PINNs) optimise charge/discharge cycles, improving battery longevity by 15% [81].

Healthcare Scheduling: - Mayo Clinic: Hybrid MILP-DNN models re- duced MRI scheduling delays by 30% while respecting clinician preferences [?].

3.2.2 Benchmark Problems

Standardised benchmarks quantify DNN efficacy against classical solvers:

MIPLIB 2017: - Neural Branch & Bound achieves 90% optimality on 30% of instances without branching, vs. 70% for CPLEX heuristics [?]. - Neural Diving solves supply chain instances 5 faster than default Gurobi settings [61].

QAPLIB (Quadratic Assignment): - GNNs attain solutions within 2% of optimality in 10s vs. 100s for Tabu Search [?]. - DNNs generalise across facility layouts, maintaining 85?curacy on unseen instances [66].

POOLMIP (Supply Chain): - Hybrid DNN-solver pipelines reduced runtime by 50% on 500-vendor procurement problems [?]. - Feasibility rates improved from 75% (classical) to 92% via differentiable projection layers [?].

3.2.3 Lessons Learned

Key takeaways from industrial deployments and benchmarks:

Speed-Accuracy Trade-Off : DNNs excel in latency-sensitive applica- tions (e.g., HFT) but lag in certifiable optimality.

Data Efficiency: Pre-training on synthetic instances improves real-world generalisation by 2540% [80].

Scalability Limits: Current DNNs struggle with problems 50k variables due to GPU memory constraints [?].

3.3 Current Comparative Analysis

This subsection evaluates DNNs against classical optimisation methods across speed, accuracy, and scalability, drawing on empirical benchmarks and real- world deployments.

3.3.1 Speed vs. Accuracy Trade-Offs

Linear Programming (LP): - DNNs (OptNet): Solve 1,000-variable LPs in 5ms (vs. 50ms for Gurobi) but achieve 98?asibility vs. 100% for simplex [3]. - DeepLP : Predicts active constraints 70?ster than dual simplex on Netlib benchmarks but with 2% suboptimality [53].

Mixed-Integer Programming (MILP): - Neural Branch & Bound : Solves MIPLIB instances to 70% optimality in 1s (vs. 60% for CPLEX in the same time) but struggles with 10k variables [29]. - Gurobi Hy- brid : Warm-started by DNNs, solves supply chain MILPs 35?ster but requires 10k pre-solved instances for training [80].

Non-Convex Optimisation: - Neural Trust Region: Achieves 90% con- vergence to global minima on Rosenbrock vs. 60% for BFGS but at 2 computational cost [8].

3.3.2 Generalisation Across Problem Scales

Synthetic to Real-World: - DNNs trained on synthetic MIPLIB in- stances generalise to real logistics problems with 80?curacy, 45% for hand-tuned heuristics [61]. - Fourier Neural Operators: Solve unseen PDEs (e.g., Darcy flow) with 5% error vs. 15% for FEM on coarse grids [52].

Noise Robustness: - DNNs tolerate 10% constraint perturbations in portfolio optimisation vs. 2% for IPMs [8]. - PINNs: Maintain 90?curacy with 20% noisy boundary conditions 50% for adjoint methods [70].

3.3.3 Feasibility Guarantees

Constrained Optimisation: - DC3 : Ensures 100?asibility for convex QPs via differentiable projections, outperforming penalty methods (85?asibility) [?]. - Lagrangian Learning: Achieves 95?asibility on non- convex NLP 70% for sequential quadratic programming (SQP) [5].

Combinatorial Problems: - Neural Diving: Produces 90% integer- feasible solutions for MILP 75% for random restarts [61]. - Learn- to-Cut : Reduces infeasibility rates by 40% vs. Gomory cuts in CPLEX [66].

3.4 Limitations and Open Gaps in Practise

Despite significant progress, critical challenges persist in deploying DNNs for optimisation. This subsection outlines unresolved practical issues and pathways for future research.

3.4.1 Current Limitations

Data Dependency: - DNNs require 10,000+ pre-solved instances for training, whereas classical methods (e.g., simplex, branch-and-bound) need zero [80]. - Example: Neural Branch & Bound achieves 70?curacy only after training on 15k MIPLIB instances [29].

Constraint Guarantees: - No formal feasibility certificates for non- convex or mixed-integer problems [27]. - DC3 ensures 100?asibility for convex QPs but fails on 30% of non-convex cases [?].

Interpretability: - Black-box decisions hinder adoption in regulated sec- tors (e.g., healthcare, aviation) [?]. - Example: DNNs for ICU scheduling cannot explain why nurse assignments violate labour laws [?].

Memory Constraints: - GPU memory limits DNNs to problems with

50k variables, vs. classical solvers handling millions [?].

3.4.2 Open Research Questions

Quantum-DNN Hybrids: - Can quantum annealing escape local min- ima in DNN loss landscapes? Early experiments show 20% speedup on Ising models [?].

Federated Optimisation: - How to train DNNs on distributed optimi- sation instances (e.g., cross-hospital scheduling) without sharing sensitive data?

Certifiable Feasibility: - Can formal verification (e.g., SMT solvers) ensure DNN solutions meet safety-critical constraints (e.g., power grid stability)?

Cross-Problem Generalisation: - Can DNNs trained on LPs solve MILPs via few-shot learning? Preliminary results suggest 50?curacy with 100 examples [?].

3.4.3 Future Directions

Algorithmic Innovations: - Differentiable Cutting Planes: Train DNNs to generate facet-defining cuts for MILP [66]. - Stochastic Optimisation: Extend DNNs to handle uncertainty in objective/constraints via robust optimisation [?].

Theoretical Advances: - Generalisation Bounds: Develop PAC-Bayes frameworks for optimisation networks [?]. - Convergence Guarantees: Prove rates for DNNs on non-convex landscapes (e.g., Kurdyka-L- ojasiewicz conditions) [?].

Hardware-Software Co-Design: - Optimise DNN architectures for TPU/GPU constraints (e.g., sparsity-aware training for memory-intensive MILP).

Download Solution Now

Uploaded By : Akshita
Posted on : May 09th, 2025
Downloads : 0
Views : 207