[00:33:35] PROBLEM - MegaRAID on db1055 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [00:35:56] (03PS1) 10Jcrespo: mariadb: Depool db1055 because hardware issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374062 [00:38:58] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1055 because hardware issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374062 (owner: 10Jcrespo) [00:40:29] (03Merged) 10jenkins-bot: mariadb: Depool db1055 because hardware issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374062 (owner: 10Jcrespo) [00:40:39] (03CR) 10jenkins-bot: mariadb: Depool db1055 because hardware issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374062 (owner: 10Jcrespo) [00:42:31] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1051, hw issues, may get lag (duration: 00m 44s) [00:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:52] !log correction last log s/db1051/db1055/ [00:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:02] 10Operations, 10ops-eqiad, 10DBA: BBU issues on db1055, RAID cache on WriteThrough - https://phabricator.wikimedia.org/T174265#3556378 (10jcrespo) [00:50:39] 10Operations, 10ops-eqiad, 10DBA: BBU issues on db1055, RAID cache on WriteThrough - https://phabricator.wikimedia.org/T174265#3556391 (10jcrespo) db1055 depooled for performance reasons https://gerrit.wikimedia.org/r/374062 [01:03:35] RECOVERY - MegaRAID on db1055 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [01:04:25] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1503795853 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 5172369 keys, up 4 minutes 11 seconds - replication_delay is 1503795853 [01:05:16] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 5168749 keys, up 5 minutes 7 seconds - replication_delay is 0 [01:14:16] PROBLEM - puppet last run on rdb1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:33:35] PROBLEM - MegaRAID on db1055 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [01:42:35] RECOVERY - puppet last run on rdb1008 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [03:25:35] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 710.82 seconds [03:32:55] PROBLEM - puppet last run on mw2164 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:33:35] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:41:48] (03PS1) 10GeoffreyT2000: Rename Wikisaurus namespace on Wiktionary to "Thesaurus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374063 (https://phabricator.wikimedia.org/T174264) [03:55:46] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 62.27 seconds [04:00:56] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [04:01:26] RECOVERY - puppet last run on mw2164 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [05:30:55] !log Force BBU relearn on db1055 - T174265 [05:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:09] T174265: BBU issues on db1055, RAID cache on WriteThrough - https://phabricator.wikimedia.org/T174265 [05:34:04] 10Operations, 10ops-eqiad, 10DBA: BBU issues on db1055, RAID cache on WriteThrough - https://phabricator.wikimedia.org/T174265#3556421 (10Marostegui) I will force a re-learn cycle on this host to see if the BBU comes back to optimal. Anyhow, @Cmjohnson can we use a BBU of the servers that are ready to be dec... [06:03:36] RECOVERY - MegaRAID on db1055 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [06:04:18] 10Operations, 10ops-eqiad, 10DBA: BBU issues on db1055, RAID cache on WriteThrough - https://phabricator.wikimedia.org/T174265#3556425 (10Marostegui) After the re-learn the BBU is back to Optimal and the RAID back to WB: ``` root@db1055:~# megacli -AdpBbuCmd -a0 BBU status for Adapter: 0 BatteryType: B... [06:23:35] PROBLEM - MegaRAID on db1055 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [06:27:45] PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [06:28:45] RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.009 second response time [07:23:36] RECOVERY - MegaRAID on db1055 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [08:23:45] PROBLEM - MegaRAID on db1055 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [09:05:08] 10Operations, 10media-storage: Two cases of local-multiwrite storage backend failure - https://phabricator.wikimedia.org/T174269#3556512 (10Ladsgroup) [09:06:36] 10Operations, 10media-storage: Two cases of local-multiwrite storage backend failure - https://phabricator.wikimedia.org/T174269#3556526 (10Ladsgroup) [10:33:45] RECOVERY - MegaRAID on db1055 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [10:53:45] PROBLEM - MegaRAID on db1055 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [11:23:36] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2033270 [11:32:36] (03PS1) 10MarcoAurelio: SVG logo for es.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374065 (https://phabricator.wikimedia.org/T170604) [11:48:59] (03PS1) 10Urbanecm: Allow sysops to grant/remove transwiki user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374066 (https://phabricator.wikimedia.org/T174226) [12:03:45] RECOVERY - MegaRAID on db1055 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [12:04:28] (03CR) 10MarcoAurelio: Allow sysops to grant/remove transwiki user group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374066 (https://phabricator.wikimedia.org/T174226) (owner: 10Urbanecm) [12:23:45] PROBLEM - MegaRAID on db1055 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [12:43:45] RECOVERY - MegaRAID on db1055 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [13:03:45] PROBLEM - MegaRAID on db1055 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [13:57:03] (03PS1) 10Urbanecm: Add several HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374071 (https://phabricator.wikimedia.org/T150618) [14:13:45] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 4051 [14:21:27] (03PS2) 10Urbanecm: Allow sysops to grant/remove transwiki user group in dtywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374066 (https://phabricator.wikimedia.org/T174226) [14:21:34] (03CR) 10Urbanecm: "Fixed, thank you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374066 (https://phabricator.wikimedia.org/T174226) (owner: 10Urbanecm) [14:49:32] (03CR) 10Framawiki: [C: 031] throttle.php: Separate the throttling definitions from the exception values itself [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373695 (https://phabricator.wikimedia.org/T167040) (owner: 10Urbanecm) [15:33:45] RECOVERY - MegaRAID on db1055 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [15:46:41] !log upload kubernetes_1.4.6-7 to apt.wikimedia.org/jessie-wikimedia/main T170346 [15:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:57] T170346: Kubernetes man pages missing from WMF packages - https://phabricator.wikimedia.org/T170346 [15:50:45] PROBLEM - DPKG on kubernetes1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:52:45] RECOVERY - DPKG on kubernetes1001 is OK: All packages OK [17:00:11] (03CR) 10MarcoAurelio: [C: 031] Allow sysops to grant/remove transwiki user group in dtywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374066 (https://phabricator.wikimedia.org/T174226) (owner: 10Urbanecm) [17:28:32] (03PS1) 10Samtar: Make both LoginNotify email features default for Hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374082 (https://phabricator.wikimedia.org/T174263) [18:01:51] (03CR) 10Framawiki: [C: 031] Make both LoginNotify email features default for Hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374082 (https://phabricator.wikimedia.org/T174263) (owner: 10Samtar) [18:12:58] (03CR) 10Urbanecm: [C: 031] Make both LoginNotify email features default for Hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374082 (https://phabricator.wikimedia.org/T174263) (owner: 10Samtar) [18:18:35] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [18:20:56] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [18:21:26] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [18:23:56] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [18:28:46] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 429 (expecting: 200) [18:29:35] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [18:29:46] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [18:31:46] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [18:33:55] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) timed out before a response was received [18:35:15] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) timed out before a response was received [18:35:45] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) timed out before a response was received [18:38:56] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [18:39:55] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [18:43:05] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [18:43:55] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [18:44:05] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [18:44:05] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [18:44:15] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [18:47:16] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 502 (expecting: 200) [18:48:16] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [20:21:31] (03CR) 10Luke081515: Automatically include commons and wikidata in $wmgThrottlingExceptions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373698 (https://phabricator.wikimedia.org/T163872) (owner: 10Urbanecm) [20:24:19] (03CR) 10Framawiki: [C: 031] Automatically include commons and wikidata in $wmgThrottlingExceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373698 (https://phabricator.wikimedia.org/T163872) (owner: 10Urbanecm) [20:27:54] (03PS7) 10ArielGlenn: write dump output files to temporary location, move in place when done [dumps] - 10https://gerrit.wikimedia.org/r/368744 (https://phabricator.wikimedia.org/T169849) [20:28:15] (03CR) 10jerkins-bot: [V: 04-1] write dump output files to temporary location, move in place when done [dumps] - 10https://gerrit.wikimedia.org/r/368744 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn) [20:30:32] (03PS8) 10ArielGlenn: write dump output files to temporary location, move in place when done [dumps] - 10https://gerrit.wikimedia.org/r/368744 (https://phabricator.wikimedia.org/T169849) [20:32:58] (03CR) 10ArielGlenn: [C: 032] write dump output files to temporary location, move in place when done [dumps] - 10https://gerrit.wikimedia.org/r/368744 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn) [20:34:51] !log ariel@tin Started deploy [dumps/dumps@39f9b52]: write output files to temp location and move into place when complete [20:34:54] !log ariel@tin Finished deploy [dumps/dumps@39f9b52]: write output files to temp location and move into place when complete (duration: 00m 02s) [20:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:39] 10Operations, 10Dumps-Generation, 10Patch-For-Review: Architecture and puppetize setup for dumpsdata boxes - https://phabricator.wikimedia.org/T169849#3557115 (10ArielGlenn) This is now deployed. The Sept 1 dumps will use this code. [21:22:55] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2014703 [22:02:55] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 84 [22:27:34] (03Abandoned) 10Paladox: Enabled Ogg Opus support for TimedMediaHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox) [22:45:00] (03CR) 10Platonides: [C: 04-1] "I don't think we should be setting per-wiki defaults for these preferences that should be global. See T174263#3557302" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374082 (https://phabricator.wikimedia.org/T174263) (owner: 10Samtar)