[00:00:26] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Migrate ini.rb, ordered_yaml.rb and php_ini.rb to modern puppet custom functions [puppet] - 10https://gerrit.wikimedia.org/r/492518 (owner: 10Paladox) [00:02:23] (03PS2) 10Paladox: wmflib: Migrate ini.rb, ordered_yaml.rb and php_ini.rb to modern puppet custom functions [puppet] - 10https://gerrit.wikimedia.org/r/492518 [00:02:46] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Migrate ini.rb, ordered_yaml.rb and php_ini.rb to modern puppet custom functions [puppet] - 10https://gerrit.wikimedia.org/r/492518 (owner: 10Paladox) [00:02:53] (03PS3) 10Paladox: wmflib: Migrate ini, ordered_yaml and php_ini to modern puppet custom functions [puppet] - 10https://gerrit.wikimedia.org/r/492518 [00:03:39] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Migrate ini, ordered_yaml and php_ini to modern puppet custom functions [puppet] - 10https://gerrit.wikimedia.org/r/492518 (owner: 10Paladox) [00:04:07] hmm fatal: Could not read from remote repository. [00:05:22] (03PS4) 10Paladox: wmflib: Migrate ini, ordered_yaml and php_ini to modern puppet custom functions [puppet] - 10https://gerrit.wikimedia.org/r/492518 [00:05:35] (03PS5) 10Paladox: wmflib: Migrate ini, ordered_yaml and php_ini to modern puppet custom functions [puppet] - 10https://gerrit.wikimedia.org/r/492518 [00:05:42] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/492518 (owner: 10Paladox) [00:06:31] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Migrate ini, ordered_yaml and php_ini to modern puppet custom functions [puppet] - 10https://gerrit.wikimedia.org/r/492518 (owner: 10Paladox) [00:06:50] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Migrate ini, ordered_yaml and php_ini to modern puppet custom functions [puppet] - 10https://gerrit.wikimedia.org/r/492518 (owner: 10Paladox) [00:07:25] (03PS6) 10Paladox: wmflib: Migrate ini, ordered_yaml and php_ini to modern puppet custom functions [puppet] - 10https://gerrit.wikimedia.org/r/492518 [00:07:29] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/492518 (owner: 10Paladox) [00:08:09] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Migrate ini, ordered_yaml and php_ini to modern puppet custom functions [puppet] - 10https://gerrit.wikimedia.org/r/492518 (owner: 10Paladox) [00:08:32] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Migrate ini, ordered_yaml and php_ini to modern puppet custom functions [puppet] - 10https://gerrit.wikimedia.org/r/492518 (owner: 10Paladox) [00:11:25] (03CR) 10Paladox: "This works locally (running live on a prod system i help manage)." [puppet] - 10https://gerrit.wikimedia.org/r/492518 (owner: 10Paladox) [02:13:32] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:23:24] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:33:04] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:35:30] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [03:34:14] PROBLEM - puppet last run on analytics1064 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz],File[/usr/share/GeoIP/GeoIP2-City.mmdb.test] [03:34:34] PROBLEM - puppet last run on db1062 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:34:46] PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:38:08] PROBLEM - puppet last run on mw2137 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [04:00:52] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [04:04:12] RECOVERY - puppet last run on mw2137 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:05:34] RECOVERY - puppet last run on analytics1064 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:05:54] RECOVERY - puppet last run on db1062 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:10:11] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/492518 (owner: 10Paladox) [04:10:56] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Migrate ini, ordered_yaml and php_ini to modern puppet custom functions [puppet] - 10https://gerrit.wikimedia.org/r/492518 (owner: 10Paladox) [04:13:45] (03PS7) 10Paladox: wmflib: Migrate ini, ordered_yaml and php_ini to modern puppet custom functions [puppet] - 10https://gerrit.wikimedia.org/r/492518 [04:14:57] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Migrate ini, ordered_yaml and php_ini to modern puppet custom functions [puppet] - 10https://gerrit.wikimedia.org/r/492518 (owner: 10Paladox) [06:29:32] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:38:14] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [06:49:50] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10Marostegui) @Cmjohnson db1114 crashed again with the same memory errors on the same slots, so it looks like the mainboard memory slots aren't healthy? ` Record: 1 Date... [07:10:00] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) mysql crashed last night: ` Thread pointer: 0x0x0 Attempting backtrace. You can use the following information to find out where mysql... [07:14:07] (03PS1) 10Zoranzoki21: Added new protection levels for dewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492534 (https://phabricator.wikimedia.org/T216885) [07:52:57] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.19; 2019-02-26), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) I missed the fact that the last patch mer... [07:53:21] marostegui: dbstore1002 is sensing that something will happen soon :D [08:14:10] elukey: yes, it is trying to get our attention [08:16:09] poor dbstore1002 [08:27:30] PROBLEM - HHVM rendering on mw2137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:46] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - /mnt/hdfs is not accessible: No such file or directory [08:28:34] RECOVERY - HHVM rendering on mw2137 is OK: HTTP OK: HTTP/1.1 200 OK - 75159 bytes in 0.251 second response time [10:13:12] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:20:24] RECOVERY - Disk space on an-coord1001 is OK: DISK OK [10:22:27] !log force remount of /mnt/hdfs on an-coord1001 (fuse-hdfs stuck) [10:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:18] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [11:39:08] (03CR) 10MarcoAurelio: [C: 04-1] "Cf. commit message comments." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492447 (https://phabricator.wikimedia.org/T214765) (owner: 10Sau226) [11:59:19] (03PS3) 10Sau226: Restore bureaucrat rights on hi.wiktionary to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492447 (https://phabricator.wikimedia.org/T214765) [11:59:40] (03CR) 10Sau226: "Review implemented. Thanks for the heads up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492447 (https://phabricator.wikimedia.org/T214765) (owner: 10Sau226) [13:09:45] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10jcrespo) It crashed again in less than 30 minutes after generating load: ` 2019-02-24T13:07:00-0600 USR0030 Successfully logged in using root, from 10.64.32.25 and... [13:16:30] PROBLEM - HHVM rendering on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:34] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 75982 bytes in 0.254 second response time [15:19:58] (03CR) 10Urbanecm: [C: 04-1] Added new protection levels for dewiktionary (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492534 (https://phabricator.wikimedia.org/T216885) (owner: 10Zoranzoki21) [15:34:04] PROBLEM - puppet last run on restbase1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:00:12] RECOVERY - puppet last run on restbase1010 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [18:15:36] !log clean up 2017/2018 log files in /var/log/jmxtrans - root partition almost filled up [18:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:46] on kafka1012 [18:15:50] amending sal [18:20:10] !log clean up 2017/2018 log files in /var/log/jmxtrans on kafka1013-22 - root partitions filling up [18:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:25] will send a patch tomorrow for the jmxtrans logging [21:35:15] 10Operations, 10ORES, 10Scoring-platform-team: [Discuss] ORES without celery - https://phabricator.wikimedia.org/T216838 (10Ladsgroup) Technically there is no need to drop separation of IO requests from CPU parallelization. EventBus is based on kafka and at the end kafka is the same as celery ([[https://kafk... [22:42:30] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 28688 MB (5% inode=99%) [22:46:12] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 28256 MB (5% inode=99%) [22:49:50] 0.0 [22:51:04] RECOVERY - Disk space on elastic1025 is OK: DISK OK