NavTrans Errors - 2024: verschil tussen versies
(Nieuwe pagina aangemaakt met 'In spring 2024, we experienced sudden and frequent crashes of a MariaDB server and equally frequent corruption of databases. This was on a webserver with a bunch of...') |
k (Jeroen Strompf heeft pagina NavTrans Problem - Spring 2024 hernoemd naar NavTrans Errors - 2024) |
(geen verschil)
|
Huidige versie van 10 jun 2024 om 08:36
In spring 2024, we experienced sudden and frequent crashes of a MariaDB server and equally frequent corruption of databases. This was on a webserver with a bunch of WooCommerce sites. The symptoms seemed very similar to the Transients Layered Nav Counts problem we had in autumn 2023 - but it can be something entirely differently. However, Ockam's razor tells me, that it is something quite similar. Especially since we don't have adequate measurements in place since autumn 2023; only some quick fixes.
Brainstorm
- NavTransients are a problem, independent of the number of filters: When any NavTrans gets too big, it doesn't work anymore in the interface + table locking problems
First symptoms
What were the very first things that indicated that there was a problem?
In practice, there might be a bunch of stuff happening seemingly at the same time, or in such short succession, that it isn't always possible or useful to isolate what seems to be the first thing. Additionally, this question can quickly become philosofical, when considering what is cause and what is effect. Here, we mean with symptoms the first things that folks who worked with the system, noticed.
Data | Event | Notes |
---|---|---|
2024.04.25 | Site nl_nl had database-connectivity problems → Database server was down
|
Database-connectivity problems & database server down, are both symptoms rather than causes in this context |
2024.05.27 | Database server was down with internal errors according to its log | From symptom to cause: What causes these crashes? |
2024.05.02-06 | Individual sites don't work due to corrupted databases | From symptom to cause: What causes these crashes? |
2024.05.08 |
|
Note that these are actually two different symptoms:
- Corrupted database
- Crashed database servers.
For now, let's assume they are related to the same underlying cause.
Also note that the events happened at several different moments. For now, let's assume it's one big happy problem.
What happened next
At this moment of writing (2024.05), it isn't clear how the chains of cause-and-effect look like and what the final cause or causes are. But let's try to figure out what's directly behind the aforementioned symptoms
Database corruption
I don't remember exactly what the situation was, and it's too much work to figure it out now:
- When trying to drop existing databases, I got errors that certain files couldn't be deleted
- Something went wrong when trying to read tables?
Ah, I found something: 02-May-2024 21:43:22 UTC] WordPress database error Table 'example_autralia_com.wp_posts' doesn't exist in engine for query SHOW FULL COLUMNS FROM `wp_posts` made by wp_install, wp_install_defaults
Database crashes
As an example, this is the part of the MySQL error log (/var/lib/mysql/error.log
) around the crash on 2024.05.08-12:05:
2024-05-08 9:53:19 0 [Note] /usr/sbin/mariadbd: ready for connections. Version: '10.6.16-MariaDB-0ubuntu0.22.04.1' socket: '/run/mysqld/mysqld.sock' port: 3306 Ubuntu 22.04 2024-05-08 9:53:19 0 [Note] InnoDB: Rolled back recovered transaction 46179644 2024-05-08 9:53:19 0 [Note] InnoDB: Rollback of non-prepared transactions completed 2024-05-08 9:53:19 9 [ERROR] InnoDB: Table mysql.innodb_table_stats not found. 2024-05-08 10:31:19 0 [Note] Starting MariaDB 10.6.16-MariaDB-0ubuntu0.22.04.1 source revision as process 381314 2024-05-08 10:31:19 0 [Note] mariadbd: Aria engine: starting recovery tables to flush: 3 2 1 0 (0.0 seconds); 2024-05-08 10:31:19 0 [Note] mariadbd: Aria engine: recovery done 2024-05-08 10:31:19 0 [Note] InnoDB: Compressed tables use zlib 1.2.11 2024-05-08 10:31:19 0 [Note] InnoDB: Number of pools: 1 2024-05-08 10:31:19 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions 2024-05-08 10:31:19 0 [Note] InnoDB: Initializing buffer pool, total size = 134217728, chunk size = 134217728 2024-05-08 10:31:19 0 [Note] InnoDB: Completed initialization of buffer pool 2024-05-08 10:31:19 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=14370627528336,14370674627487 2024-05-08 10:31:20 0 [Note] InnoDB: 3 transaction(s) which must be rolled back or cleaned up in total 3 row operations to undo 2024-05-08 10:31:20 0 [Note] InnoDB: Trx id counter is 46412788 2024-05-08 10:31:20 0 [Note] InnoDB: To recover: 3282 pages 2024-05-08 10:31:20 0 [Note] InnoDB: 128 rollback segments are active. 2024-05-08 10:31:20 0 [Note] InnoDB: Starting in background the rollback of recovered transactions 2024-05-08 10:31:20 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1" 2024-05-08 10:31:20 0 [Note] InnoDB: Creating shared tablespace for temporary tables 2024-05-08 10:31:20 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ... 2024-05-08 10:31:20 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB. 2024-05-08 10:31:20 0 [Note] InnoDB: 10.6.16 started; log sequence number 14370682353825; transaction id 46412789 2024-05-08 10:31:20 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool 2024-05-08 10:31:20 0 [Note] Plugin 'FEEDBACK' is disabled. 2024-05-08 10:31:20 0 [Note] InnoDB: Cannot open '/var/lib/mysql/ib_buffer_pool' for reading: No such file or directory 2024-05-08 10:31:20 0 [Note] Server socket created on IP: '0.0.0.0'. 2024-05-08 10:31:20 0 [Note] Server socket created on IP: '::'. 2024-05-08 10:31:20 0 [Note] InnoDB: Rolled back recovered transaction 46406898 2024-05-08 10:31:20 0 [Note] InnoDB: Rolled back recovered transaction 46412786 2024-05-08 10:31:20 0 [Note] InnoDB: Rolled back recovered transaction 46412784 2024-05-08 10:31:20 0 [Note] InnoDB: Rollback of non-prepared transactions completed 2024-05-08 10:31:20 1 [Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1932: Table 'mysql.gtid_slave_pos' doesn't exist in engine 2024-05-08 10:31:20 0 [Note] /usr/sbin/mariadbd: ready for connections. Version: '10.6.16-MariaDB-0ubuntu0.22.04.1' socket: '/run/mysqld/mysqld.sock' port: 3306 Ubuntu 22.04 2024-05-08 10:31:20 4 [ERROR] InnoDB: Table mysql.innodb_table_stats not found. 2024-05-08 12:05:39 0x7fed1c681640 InnoDB: Assertion failure in file ./storage/innobase/buf/buf0lru.cc line 285 InnoDB: Failing assertion: !block->page.in_file() InnoDB: We intentionally generate a memory trap. InnoDB: Submit a detailed bug report to https://jira.mariadb.org/ InnoDB: If you get repeated assertion failures or crashes, even InnoDB: immediately after the mariadbd startup, there may be InnoDB: corruption in the InnoDB tablespace. Please refer to InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/ InnoDB: about forcing recovery. 240508 12:05:39 [ERROR] mysqld got signal 6 ; Sorry, we probably made a mistake, and this is a bug. Your assistance in bug reporting will enable us to fix this for the next release. To report this bug, see https://mariadb.com/kb/en/reporting-bugs We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. Server version: 10.6.16-MariaDB-0ubuntu0.22.04.1 source revision: key_buffer_size=134217728 read_buffer_size=131072 max_used_connections=192 max_threads=302 thread_count=192 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 796155 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. Thread pointer: 0x7febe8000c68 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x7fed1c680c38 thread_stack 0x49000 /usr/sbin/mariadbd(my_print_stacktrace+0x32)[0x555904162e82] /usr/sbin/mariadbd(handle_fatal_signal+0x478)[0x555903c97298] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7feda2afe520] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7feda2b529fc] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7feda2afe476] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7feda2ae47f3] /usr/sbin/mariadbd(+0x66b7ee)[0x55590392d7ee] /usr/sbin/mariadbd(+0x675564)[0x555903937564] /usr/sbin/mariadbd(+0xdd1705)[0x555904093705] /usr/sbin/mariadbd(+0xe07cbb)[0x5559040c9cbb] /usr/sbin/mariadbd(+0xe0bb5c)[0x5559040cdb5c] /usr/sbin/mariadbd(+0xe0c320)[0x5559040ce320] /usr/sbin/mariadbd(+0xdb5199)[0x555904077199] /usr/sbin/mariadbd(+0xd6fe12)[0x555904031e12] /usr/sbin/mariadbd(+0xd72c4b)[0x555904034c4b] /usr/sbin/mariadbd(+0xd746b0)[0x5559040366b0] /usr/sbin/mariadbd(+0xd53b28)[0x555904015b28] /usr/sbin/mariadbd(+0xcb9b06)[0x555903f7bb06] /usr/sbin/mariadbd(_ZN7handler13ha_update_rowEPKhS1_+0x23a)[0x555903ca5f4a] /usr/sbin/mariadbd(_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_orderybPySA_+0x1adc)[0x555903b31fcc] /usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x23b3)[0x555903a68b83] /usr/sbin/mariadbd(_Z11mysql_parseP3THDPcjP12Parser_state+0x1d8)[0x555903a6b988] /usr/sbin/mariadbd(_Z16dispatch_command19enum_server_commandP3THDPcjb+0xbcd)[0x555903a6d52d] /usr/sbin/mariadbd(_Z10do_commandP3THDb+0x131)[0x555903a6f031] /usr/sbin/mariadbd(_Z24do_handle_one_connectionP7CONNECTb+0x3af)[0x555903b7657f] /usr/sbin/mariadbd(handle_one_connection+0x5d)[0x555903b768fd] /usr/sbin/mariadbd(+0xc0fa26)[0x555903ed1a26] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7feda2b50ac3] /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x7feda2be2850] Trying to get some variables. Some pointers may be invalid and cause the dump to abort. Query (0x7fea9c000070): UPDATE `wp_options` SET `option_value` = 'a:44660:{i:0;b:0;s:32:\"602d... Connection ID (thread ID): 28743 Status: NOT_KILLED Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off,hash_join_cardinality=off,cset_narrowing=off The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mariadbd/ contains information that should help you find out what is causing the crash. Writing a core file... Working directory at /var/lib/mysql Resource Limits: Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 514201 514201 processes Max open files 32768 32768 files Max locked memory 524288 524288 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 514201 514201 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us Core pattern: |/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E Kernel version: Linux version 5.15.0-105-generic (buildd@lcy02-amd64-007) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024 2024-05-08 12:11:21 0 [Note] Starting MariaDB 10.6.16-MariaDB-0ubuntu0.22.04.1 source revision as process 864516 2024-05-08 12:11:21 0 [Note] InnoDB: Compressed tables use zlib 1.2.11 2024-05-08 12:11:21 0 [Note] InnoDB: Number of pools: 1 2024-05-08 12:11:21 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions 2024-05-08 12:11:21 0 [Note] InnoDB: Initializing buffer pool, total size = 134217728, chunk size = 134217728 2024-05-08 12:11:21 0 [Note] InnoDB: Completed initialization of buffer pool 2024-05-08 12:11:21 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=14626521631583,14626555390524 2024-05-08 12:11:21 0 [ERROR] InnoDB: Corrupted page identifier at 14626576696142; set innodb_force_recovery=1 to ignore the record. 2024-05-08 12:11:21 0 [Note] InnoDB: Dump from the start of the mini-transaction (LSN=14626576696142) to 100 bytes after the record: len 117; hex 388... :"9948eb07";a:; 2024-05-08 12:11:21 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error 2024-05-08 12:11:21 0 [Note] InnoDB: Starting shutdown... 2024-05-08 12:11:21 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed. 2024-05-08 12:11:21 0 [Note] Plugin 'FEEDBACK' is disabled. 2024-05-08 12:11:21 0 [ERROR] Unknown/unsupported storage engine: InnoDB 2024-05-08 12:11:21 0 [ERROR] Aborting
Two things that stood out from this log:
- Line
Query (0x7fea9c000070)...
: That was a NavTrans update query - Line near bottom
:"9948eb07";a:;
: That was an NavTrans update query.
Is this a separate problem, a symptom or additional data? Maybe.
- When a single transient becomes too big, the corresponding filter isn't working anymore on the site
- So, it is not only about the total number of NavTransients. Hence, simply reducing the total number of filters on a site, isn't a sufficient solution
- This happened on multiple servers and so far, it doesn't seem to be caused by something that's wrong on a server
- This might also indicate, that these NavTrans are basically a bad idea and that probably not having them at all, is an improvement.
- The error depicted in the image, was again generated at 2024.05.10-07:59
Additional information
About sizes of various NavTransients:
wp transient list --fields=name,expiration --human-readable --search=wc_layered_nav_counts_pa_breedte
wp transient get wc_layered_nav_counts_pa_breedte > /home/jeroen/navtrans-breedte.txt
→ Only 414.7 kBwp transient get wc_layered_nav_counts_pa_breedte --format=json > /home/jeroen/navtrans-breedte-json.txt
→ Only 183 kBwp transient get wc_layered_nav_counts_pa_submodel > /home/jeroen/navtrans-submodel.txt
→ 1.2MB - Still workingwp transient get wc_layered_nav_counts_productlijn > /home/jeroen/navtrans-productlijn.txt
→ 420kB
Sizes according to script db_size_monitor
:
2024-05-10-08:00 - db: 652.95MB. wp_options (storage): 357M. wp_options: 92.30MB. Tr: 91.01MB. NavTr: 85.88MB. #NavTr: 16 → 5.36MB/NavTr
Cause: Out-of-memory
The PHP log contains endless errors like these:
[10-May-2024 06:17:01 UTC] PHP Fatal error: Allowed memory size of 1073741824 bytes exhausted (tried to allocate 57989641 bytes) in /var/www/example.nl/wp-includes/class-wpdb.php on line 2349 [10-May-2024 06:17:02 UTC] PHP Fatal error: Allowed memory size of 1073741824 bytes exhausted (tried to allocate 57975238 bytes) in /var/www/example.nl/wp-includes/class-wpdb.php on line 2349 [10-May-2024 06:17:02 UTC] PHP Fatal error: Allowed memory size of 1073741824 bytes exhausted (tried to allocate 57975238 bytes) in /var/www/example.nl/wp-includes/class-wpdb.php on line 2349 [10-May-2024 06:17:02 UTC] PHP Fatal error: Allowed memory size of 1073741824 bytes exhausted (tried to allocate 57975238 bytes) in /var/www/example.nl/wp-includes/class-wpdb.php on line 2349 [10-May-2024 06:17:03 UTC] PHP Fatal error: Allowed memory size of 1073741824 bytes exhausted (tried to allocate 57989641 bytes) in /var/www/example.nl/wp-includes/class-wpdb.php on line 2349
Cause: Object not loaded in time
The log line below appeared only once in the last 100 lines. My liberal interpretation is, that some object became too big to load in time for execution - I can well imagine this being part of the NavTrans problem discussed here:
[10-May-2024 06:16:36 UTC] PHP Fatal error: mysqli_real_connect(): Cannot use output buffering in output buffering display handlers in /var/www/example.nl/wp-includes/class-wpdb.php on line 1990 [10-May-2024 06:16:36 UTC] PHP Fatal error: Uncaught Error: mysqli object is not fully initialized in /var/www/example.nl/wp-includes/class-wpdb.php:1283 Stack trace: ...
Temporary solutions: Delete transients + optimize wp_options
All issues + associated php log lines stop upon execution of this:
wp transient list --fields=name --format=csv | grep wc_layered_nav_counts_pa_ | cut -d',' -f1 | xargs -n 1 wp transient delete
Along the way, it seems to help to avoid other NavTrans-related problems to optimize table wp_options
:
optimize table wp_options;
These are both temporary solutions, as they address symptoms, not the underlying problem of NavTrans growing out of whack.
Additional information
One mechanism:
- WooCommerce updates more and more data in transients in table wp_options
- These updates are so frequent, that table
wp_options
grow uncontrolably. In november 2023 at a rate of some 2GB/minute - Eventually, it takes MySQL too long to lock such a row or field, and locking time-outs occur
- When locking time-outs occur, those SQL updates don't happen anymore, and the complete SQL update statements, are copied to the PHP errors log
- Instead of the database, now the PHP error log grows with about 2GB/minute (autumn 2023).
In autumn 2023, a quick fix was, to have each hour these transients removed + table wp_options
optimized. I doubt if it's a good idea to do that every hour.
Solutions from ChatGPT
On WordPress/WooCommerce sites, I have a bunch of filters on the shop page. Amongst others, WooCommerce produces transients for these filters that start with the name `wc_layered_nav_counts_pa_`. These transients are stored in table `wp_options`. These transients grow out of proportions and make servers crash and filters not work. How to solve this?
Excessive transients can indeed be a performance issue on WordPress/WooCommerce sites, especially if they're not being managed properly. Here are a few steps you can take to address this problem:
Increase Transient Expiry Time: By default, transients in WordPress expire after a certain period. You can increase the expiry time for these transients so they're not regenerated as frequently. You can do this using the set_transient() function with a longer expiration time. Use a Transient Management Plugin: There are plugins available specifically for managing transients on WordPress sites. These plugins can help you monitor, clean up, and optimize transient data. Manually Delete Unused Transients: If you're comfortable with PHP and WordPress, you can manually delete unused transients from the database. You can use the delete_transient() function to remove transients that are no longer needed. Optimize Queries and Code: Sometimes, excessive transients can be a symptom of inefficient code or database queries. Review your theme and plugin code to ensure that transients are being used appropriately and that queries are optimized for performance. Cache Queries Instead of Using Transients: Depending on your specific use case, caching the results of queries directly may be more efficient than using transients. You can use caching plugins like WP Super Cache or W3 Total Cache to implement query caching. Consider Offloading to a Separate Server: If your site is receiving a large amount of traffic and generating a significant number of transients, you may want to consider offloading some of the workload to a separate server or using a caching service like Redis or Memcached. Monitor Server Resources: Keep an eye on your server's resource usage to identify any issues caused by excessive transients. You may need to upgrade your server resources or optimize your server configuration to handle the load more efficiently.
And specifically for these "NavTrans" transients?
To specifically address the issue with the wc_layered_nav_counts_pa_ transients in WooCommerce, you can take the following steps:
Increase Expiry Time: As with any transient, you can increase the expiry time for these specific transients using the set_transient() function with a longer expiration time. This will reduce the frequency at which these transients are regenerated. Use a Transient Management Plugin: Look for WordPress plugins designed to manage transients, some of which might offer specific tools for handling WooCommerce transients. These plugins can help you monitor and clean up unused transients, including those related to WooCommerce layered navigation. Manually Delete Unused Transients: If you're comfortable working with PHP and MySQL, you can manually delete specific transients from the database. You can identify and delete wc_layered_nav_counts_pa_ transients that are no longer needed. Optimize Layered Navigation: Review your layered navigation setup in WooCommerce. If you have excessive filters or unnecessary variations, consider simplifying or optimizing them to reduce the number of transients being generated. Cache Layered Navigation Results: Instead of relying solely on transients, you can implement caching mechanisms specifically for layered navigation results. This can involve caching the results of queries or pre-generating navigation data to reduce the reliance on transient storage. Monitor and Test: Continuously monitor your site's performance and test changes to ensure they're effective. Keep an eye on server resources, database queries, and overall site speed to identify any remaining issues related to transients.
Questions
- What causes database crashes?
- What causes database corruptions?
- How to solve the NavTrans problem? → Start with the table at NavTrans Problem or do some fresh research.