How would you detect thread safety issues?
- Favor immutable objects as they are inherently thread-safe.
- If you need to use mutable objects, and share them among threads, then a key element of thread-safety is locking access to shared data while it is being operated on by a thread. For example, in Java you can use the synchronized keyword.
- Generally try to keep your locking for as shorter duration as possible to minimize any thread contention issues if you have many threads running. Putting a big, fat lock right at the start of the function and unlocking it at the end of the function is useful on functions that are rarely called, but can adversely impact performance on frequently called functions. Putting one or many larger locks in the function around the data that actually need protection is a finer grained approach that works better than the coarse grained approach, especially when there are only a few places in the function that actually need protection and there are larger areas that are thread-safe and can be carried out concurrently.
- Use proven concurrency libraries (e.g. java.util.concurrency) as opposed to writing your own. Well written concurrency libraries provide concurrent access to reads, while restricting concurrent writes.
Fixing production issues
There could be general runtime production issues that either slow down or make a system to hang. In these situations, the general approach for troubleshooting would be to analyze the thread dumps to isolate the threads which are causing the slow-down or hang. For example, a Java thread dump gives you a snapshot of all threads running inside a Java Virtual Machine. There are graphical tools like Samurai to help you analyze the thread dumps more effectively.
- Application seems to consume 100% CPU and throughput has drastically reduced – Get a series of thread dumps, say 7 to 10 at a particular interval, say 5 -8 seconds and analyze these thread dumps by inspecting closely the “runnable” threads to ensure that if a particular thread is progressing well. If a particular thread is executing the same method through all the thread dumps, then that particular method could be the root cause. You can now continue your investigation by inspecting the code.
- Application consumes very less CPU and response times are very poor due to heavy I/O operations like file or database read/write operations – Get a series of thread dumps and inspect for threads that are in “blocked” status. This analysis can also be used for situations where the application server hangs due to running out of all runnable threads due to a deadlock or a thread is holding a lock on an object and never returns it while other threads are waiting for the same lock.