this post was submitted on 02 May 2025
16 points (86.4% liked)
Programming
20483 readers
15 users here now
Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!
Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.
Hope you enjoy the instance!
Rules
Rules
- Follow the programming.dev instance rules
- Keep content related to programming in some way
- If you're posting long videos try to add in some form of tldr for those who don't want to watch videos
Wormhole
Follow the wormhole through a path of communities !webdev@programming.dev
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
yes? maybe, depending on what you mean.
Let's say you're doing a job and that job will involve reading 1M records or something. Pagination means you grab N number at a time, say 1000, in multiple queries as they're being done.
Reading your post again to try and get context, it looks like you're identifying duplicates as part of a job.
I don't know what you're using to determine a duplicate, if it's structural or not, but since you're running on HDDs, it might be faster to get that information into ram and then do the job in batches and update in batches. This will also allow you to do things like writing to the DB while doing CPU processing.
BTW, your hard disks are going to be your bottleneck unless you're reaching out over the internet, so your best bet is to move that data onto an NVMe SSD. That'll blow any other suggestion I have out of the water.
BUT! there are ways to help things out. I don't know what language you're working in. I'm a dotnet dev, so I can answer some things from that perspective.
One thing you may want to do, especially if there's other traffic on this server:
Use a HashSet (this can work if you have record types) or some other method of equality that's property based. Many Dictionary/HashSet types can take some kind of equality comparer.
So, what you can do is asynchronously read from the disk into memory and start some kind of processing job. If this job does also not require the disk, you can do another read while you're processing. Don't do a write and a read at the same time since you're on HDDs.
This might look something like:
Yes, we are currently in the process of migrating to PostgreSQL and to a new hardware. Nonetheless the approach we are using is a disaster. So we will refactor our approach as well. Appreciate your input.
All processing and SQL related transactions are executed via python. But should not have any influence since the SQL server is the bottleneck.
Yes I have considered this already for the next update. Since our setup can accept dirty reads - but I have not tested/quantified any benefits yet.
While I understand the underlying issue here, I do not know yet how to control this. Since we have multiple microservices set up which are connected to the DB and either fetch (read), write or delete from different tables. But to my understanding since I am currently not using NOLOCK such occurrences should be handled by SQL no? What I mean is that during a process the object is locked - so no other process can interfere on the SQL object?
Thanks for putting this together I will review it tomorrow again (Y).
Thanks for giving it a good read through! If you're getting on nvme ssds, you may find some of your problems just go away. The difference could be insane.
I was reading something recently about databases or disk layouts that were meant for business applications vs ones meant for reporting and one difference was that on disk they were either laid out by row vs by column.
That was a bit of a hasty write, so there's probably some issues with it, but that's the gist
BTW. nice username.
Thanks haha