PointLaSA: Learnable Spatial Anchors for Feature Aggregation in Point Clouds

The paper will be released The code

Overview

Efficient and robust feature aggregation is essential for point cloud understanding tasks such as classification and segmentation. Most existing methods employ k-nearest neighbor (KNN) search with fixed spatial anchors (FSA), which limits their ability to adapt to complex and diverse geometric patterns. To overcome this limitation, we propose PointLaSA, a novel framework based on Learnable Spatial Anchors for multi-head, multi-scale feature aggregation. Specifically, each attention head predicts an anchor offset to generate geometry-adaptive query points, while different neighborhood sizes across heads capture complementary local contexts. The fused features from all heads yield expressive point representations that are both flexible and discriminative. Extensive experiments on standard benchmarks demonstrate that PointLaSA consistently outperforms fixed-anchor KNN baselines across classification and segmentation tasks.

Motivation

These increasingly sophisticated local feature extractors have advanced the field by capturing finer-grained geometric patterns. However, once local geometry is sufficiently characterized, their performance gains often plateau, and further complicating the extractors tends to yield only marginal improvements. This trend suggests that, rather than relentlessly pursuing more intricate local descriptors, it may be more beneficial to explore how to acquire more appropriate receptive fields, as this could exert a more significant impact on overall network performance.

Method Overview

Overall architecture of PointLaSA. We adopt a U-Net architecture to support both classification and segmentation tasks. In the encoder, a point feature transformation module first projects point features into a higher-dimensional space, followed by downsampling via farthest point sampling. Next, a spatial encoding module augments point descriptors with positional information, enriching local representations. At each Local Aggregation stage, we replace the conventional fixed-anchor KNN with the proposed Learnable Spatial Anchors, enabling geometry-adaptive feature aggregation. The decoder progressively restores spatial resolution through nearest-neighbor interpolation and feature propagation, and task-specific heads are applied for segmentation and classification. Throughout the network, we deliberately avoid complex feature extractors and instead employ lightweight MLPs and max pooling, allocating more computation to neighborhood selection for both efficiency and model compactness.

Comparison between FSA-KNN and the proposed LaSA-KNN. FSA-KNN aggregates features within a single fixed-scale neighborhood. By contrast, LaSA-KNN introduces learnable anchor shifts to adaptively adjust point positions, and leverages multi-head, multi-scale KNN to construct diverse neighborhoods, enabling richer geometric perception and more discriminative feature extraction.

Experiments

We evaluate our model on five widely-used point cloud benchmarks spanning three representative tasks. For shape classification, we use ModelNet40 and ScanObjectNN; for part segmentation, ShapeNetPart; and for indoor scene semantic segmentation, ScanNet v2.

Network configurations used for different datasets.

Dataset Dimension Points
ScanObjectNN 96-192-384 64-256-1024
ModelNet40 96-192-384 64-256-1024
ShapeNetPart 96-192-320-512 64-192-512-2048
ScanNet v2 64-96-160-288-512 --

Classification results on ModelNet40

OA: overall accuracy. mAcc: mean per-class accuracy.

Method OA (%) mAcc (%)
PointNet 89.2 86.0
PointNet++ 93.0 90.7
DGCNN 92.9 90.2
PCT 93.2 -
PTv1 93.7 90.6
CurveNet 93.8 -
PointMLP 94.1 91.3
GBNet 93.8 91.0
PointNeXt 94.0 91.1
PTv2 94.2 91.6
DualMLP 93.7 -
PointLaSA (ours) 94.2 91.8

Classification results on ScanObjectNN

“OA” denotes overall accuracy; “mACC” denotes mean class accuracy.

Method OA (%) mACC (%)
PointNet 75.2 71.4
PointNet++ 86.2 84.4
DGCNN 86.1 84.3
PointMLP 87.7 86.4
PointNeXt 88.2 86.8
SPoTr 88.6 86.8
point2vec 87.5 86.0
CurveCloud 89.1 86.5
KPConvX 89.3 88.1
PointBERT* 83.1 -
PointMAE* 85.2 -
PPT* 89.5 -
PointLaSA (ours) 90.2 89.2

Part segmentation results (mean IoU %) on ShapeNetPart.
“Ins. mIoU” denotes instance-average mIoU; “Cls. mIoU” denotes class-average mIoU.

Method Ins. mIoU Cls. mIoU
PointNet 83.7 80.4
PointNet++ 85.1 81.9
DGCNN 85.2 82.3
PTv1 86.6 83.7
AGCN 86.9 85.1
RSCNN 86.2 84.0
CurveNet 86.8 84.2
OTMae3D 86.8 85.1
PointJEPA 83.9 85.8
AdaCrossNet - 85.1
PointTriPE (ours) 87.4 85.1

Semantic segmentation results on ScanNet v2.
Mean IoU (%) on validation and test sets.

Method Val mIoU Test mIoU
O-CNN 74.0 72.8
PointNet++ 53.5 33.9
PointConv 61.0 55.6
KpConv 69.2 68.0
Minkowski 72.2 73.4
SFormer 74.3 73.7
BPNet 73.9 74.9
PTv2 75.4 75.2
KPConvX-L 76.2 75.6
PointTriPE (ours) 76.0 75.4

Ablation study on PointLaSA components

✓ = included, ✗ = excluded. Metrics are classification accuracy (OA) on ModelNet40, instance mIoU on ShapeNetPart, and mIoU on ScanNet v2.

No. Offset Multi-Head Multi-Scale Multi-Dim ModelNet40 (OA) ShapeNet (Ins. mIoU) ScanNet v2 (mIoU)
1 94.1% 87.4% 75.4%
2 94.2% 87.2% 75.1%
3 94.2% 87.1% 75.0%
4 94.0% 87.0% 74.8%
5 93.9% 87.0% 74.6%
6 94.0% 87.1% 74.6%
7 93.8% 87.2% 74.4%
8 93.7% 87.0% 74.4%
9 93.6% 86.9% 74.6%
10 93.4% 86.8% 74.2%