[00:03:29] (03PS2) 10Dzahn: Run l10nupdate monday to thursday [puppet] - 10https://gerrit.wikimedia.org/r/350749 (https://phabricator.wikimedia.org/T164035) (owner: 10Reedy) [00:04:40] 10Operations, 10Traffic, 10Wikimedia-Blog, 10HTTPS: Change automatic shortlink in blog theme - https://phabricator.wikimedia.org/T165511#3365971 (10Tbayer) >>! In T165511#3362318, @Volker_E wrote: > That's what I expected. The shortlink didn't seem to be reason for the error. > As I've said, I didn't have... [00:04:55] (03CR) 10Dzahn: [C: 032] Run l10nupdate monday to thursday [puppet] - 10https://gerrit.wikimedia.org/r/350749 (https://phabricator.wikimedia.org/T164035) (owner: 10Reedy) [00:06:25] !log tin (deployment): manually remove l10nupdate cron, let puppet re-create it after gerrit:350749. stops l10nupdate cron from running on weekends. naos didn't need an action. (T164035). [00:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:34] T164035: Set l10nupdate cron to run Mon-Thursday - https://phabricator.wikimedia.org/T164035 [00:11:28] (03CR) 10Dzahn: [C: 031] Add explicit dependency on ghostscript [puppet] - 10https://gerrit.wikimedia.org/r/313963 (owner: 10Muehlenhoff) [00:13:45] (03CR) 10Dzahn: "still desired just like before?" [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [00:14:35] (03CR) 10Dzahn: "@traffic does it seem ok?" [puppet] - 10https://gerrit.wikimedia.org/r/355869 (https://phabricator.wikimedia.org/T164810) (owner: 10Dzahn) [00:18:31] (03PS1) 10Dzahn: install_server: switch planet2001 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/360584 [00:19:21] (03PS2) 10Dzahn: install_server: switch planet2001 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/360584 [00:22:39] (03CR) 10Dzahn: [C: 032] install_server: switch planet2001 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/360584 (owner: 10Dzahn) [00:24:48] !log planet2001 - scheduled downtime, reinstall with stretch [00:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:58] !log planet2001 - revoke old puppet cert, salt-key, re-add new cert/key after reinstall [00:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:30] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 6502 [00:50:23] (03PS1) 10Bearloga: Add info to Discovery Dashboards index page [puppet] - 10https://gerrit.wikimedia.org/r/360592 (https://phabricator.wikimedia.org/T167930) [00:51:28] (03PS2) 10Bearloga: Add info to Discovery Dashboards index page [puppet] - 10https://gerrit.wikimedia.org/r/360592 (https://phabricator.wikimedia.org/T167930) [01:03:00] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381 [01:04:00] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8797890 keys, up 3 minutes 53 seconds - replication_delay is 0 [01:14:48] (03PS1) 10Dzahn: planet: planet-venus not in stretch, try using "rawdog" instead [puppet] - 10https://gerrit.wikimedia.org/r/360596 [01:16:11] (03CR) 10jerkins-bot: [V: 04-1] planet: planet-venus not in stretch, try using "rawdog" instead [puppet] - 10https://gerrit.wikimedia.org/r/360596 (owner: 10Dzahn) [01:19:41] (03PS2) 10Dzahn: planet: planet-venus not in stretch, try using "rawdog" instead [puppet] - 10https://gerrit.wikimedia.org/r/360596 [01:20:26] (03PS3) 10Dzahn: planet: planet-venus not in stretch, try using "rawdog" instead [puppet] - 10https://gerrit.wikimedia.org/r/360596 [01:22:56] 10Operations: upgrade planet instances to stretch - https://phabricator.wikimedia.org/T168490#3366199 (10Dzahn) [01:23:30] (03PS4) 10Dzahn: planet: planet-venus not in stretch, try using "rawdog" instead [puppet] - 10https://gerrit.wikimedia.org/r/360596 (https://phabricator.wikimedia.org/T168490) [01:23:40] 10Operations, 10Patch-For-Review: upgrade planet instances to stretch - https://phabricator.wikimedia.org/T168490#3366215 (10Dzahn) https://gerrit.wikimedia.org/r/360584 [01:24:20] 10Operations, 10Patch-For-Review: upgrade planet instances to stretch - https://phabricator.wikimedia.org/T168490#3366216 (10Dzahn) [01:25:23] (03CR) 10Dzahn: [C: 032] planet: planet-venus not in stretch, try using "rawdog" instead [puppet] - 10https://gerrit.wikimedia.org/r/360596 (https://phabricator.wikimedia.org/T168490) (owner: 10Dzahn) [01:25:37] (03PS5) 10Dzahn: planet: planet-venus not in stretch, try using "rawdog" instead [puppet] - 10https://gerrit.wikimedia.org/r/360596 (https://phabricator.wikimedia.org/T168490) [01:26:32] 10Operations, 10Patch-For-Review: upgrade planet instances to stretch - https://phabricator.wikimedia.org/T168490#3366217 (10Dzahn) p:05Triage>03Normal [01:28:20] PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [01:32:57] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3366219 (10Racso) Hello there. Today, I found that some e-mails that arrived to the wikimedia-co list may be related to this issue. All of them come from the -owner addresses of ot... [01:40:01] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.59 seconds [01:40:20] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.40 seconds [01:40:20] PROBLEM - MariaDB Slave Lag: s4 on db2019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.59 seconds [01:40:20] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.59 seconds [01:40:30] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.18 seconds [01:40:50] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 327.27 seconds [01:42:20] PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:44:00] RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 13.57 seconds [01:44:20] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 0.51 seconds [01:44:20] RECOVERY - MariaDB Slave Lag: s4 on db2019 is OK: OK slave_sql_lag Replication lag: 0.08 seconds [01:44:20] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 0.14 seconds [01:44:30] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 0.07 seconds [01:44:50] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 0.26 seconds [01:45:04] !log planet1001 - remove php5 package [01:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:47:06] (03PS1) 10Dzahn: planet: do not install libapache2-mod-php5 [puppet] - 10https://gerrit.wikimedia.org/r/360597 (https://phabricator.wikimedia.org/T168490) [01:50:48] (03PS2) 10Dzahn: planet: do not install libapache2-mod-php5 [puppet] - 10https://gerrit.wikimedia.org/r/360597 (https://phabricator.wikimedia.org/T168490) [01:51:40] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 27 probes of 293 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:56:40] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 13 probes of 293 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [02:10:20] RECOVERY - puppet last run on ms-be1022 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [02:12:20] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:12:20] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:13:10] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [02:13:10] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [02:20:01] PROBLEM - puppet last run on planet2001 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 16 minutes ago with 4 failures. Failed resources (up to 3 shown): Package[libapache2-mod-php5],File[/usr/share/planet-venus/wikimedia],File[/usr/share/planet-venus/theme/wikimedia],File[/usr/share/planet-venus/theme/common] [02:26:00] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.5) (duration: 06m 52s) [02:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:11] (03PS3) 10Dzahn: planet: do not install libapache2-mod-php5 [puppet] - 10https://gerrit.wikimedia.org/r/360597 (https://phabricator.wikimedia.org/T168490) [02:42:36] (03CR) 10Dzahn: [C: 032] planet: do not install libapache2-mod-php5 [puppet] - 10https://gerrit.wikimedia.org/r/360597 (https://phabricator.wikimedia.org/T168490) (owner: 10Dzahn) [02:50:38] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.6) (duration: 06m 06s) [02:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:45] (03PS1) 10Dzahn: planet: disable codfw backend temporarily [puppet] - 10https://gerrit.wikimedia.org/r/360598 (https://phabricator.wikimedia.org/T168490) [02:56:33] (03PS2) 10Dzahn: planet: disable codfw backend temporarily [puppet] - 10https://gerrit.wikimedia.org/r/360598 (https://phabricator.wikimedia.org/T168490) [02:57:19] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Jun 21 02:57:19 UTC 2017 (duration 6m 41s) [02:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:57:30] (03PS3) 10Dzahn: planet: disable codfw backend temporarily [puppet] - 10https://gerrit.wikimedia.org/r/360598 (https://phabricator.wikimedia.org/T168490) [02:59:10] (03CR) 10Dzahn: [C: 032] planet: disable codfw backend temporarily [puppet] - 10https://gerrit.wikimedia.org/r/360598 (https://phabricator.wikimedia.org/T168490) (owner: 10Dzahn) [03:03:29] !log planet1001 - remove/purge all php5* packages [03:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:34] 10Operations, 10Ops-Access-Requests: Access request for Daniel Worley to analytics / hadoop - https://phabricator.wikimedia.org/T168439#3366242 (10EBjune) @Dzahn I think he just needs access to the analytics-users group, which I approve. [03:13:28] !log planet - copying HTML files from docroot from planet1001 to planet2001 - (don't serve Debian default page) [03:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:30] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 73689 [03:26:43] (03PS1) 10Dzahn: planet: install python-tidylib for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/360599 (https://phabricator.wikimedia.org/T168490) [03:30:16] (03CR) 10Dzahn: [C: 032] planet: install python-tidylib for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/360599 (https://phabricator.wikimedia.org/T168490) (owner: 10Dzahn) [03:33:20] ACKNOWLEDGEMENT - puppet last run on planet2001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 1 second ago with 3 failures. Failed resources (up to 3 shown): File[/usr/share/planet-venus/wikimedia],File[/usr/share/planet-venus/theme/wikimedia],File[/usr/share/planet-venus/theme/common] daniel_zahn https://phabricator.wikimedia.org/T168490 [04:06:46] (03PS1) 10Dzahn: planet: add support/compat for stretch and rawdog [puppet] - 10https://gerrit.wikimedia.org/r/360600 (https://phabricator.wikimedia.org/T168490) [04:08:05] (03CR) 10jerkins-bot: [V: 04-1] planet: add support/compat for stretch and rawdog [puppet] - 10https://gerrit.wikimedia.org/r/360600 (https://phabricator.wikimedia.org/T168490) (owner: 10Dzahn) [04:16:53] (03PS2) 10Dzahn: planet: add support/compat for stretch and rawdog [puppet] - 10https://gerrit.wikimedia.org/r/360600 (https://phabricator.wikimedia.org/T168490) [04:21:48] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/6826/" [puppet] - 10https://gerrit.wikimedia.org/r/360600 (https://phabricator.wikimedia.org/T168490) (owner: 10Dzahn) [04:21:57] (03PS3) 10Dzahn: planet: add support/compat for stretch and rawdog [puppet] - 10https://gerrit.wikimedia.org/r/360600 (https://phabricator.wikimedia.org/T168490) [04:28:10] PROBLEM - puppet last run on planet1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 59 seconds ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/planet-venus/wikimedia/theme/common/images/planet-wm2.png] [04:34:51] (03PS1) 10Dzahn: planet: fix logo path and duplicate declare of config dir [puppet] - 10https://gerrit.wikimedia.org/r/360601 (https://phabricator.wikimedia.org/T168490) [04:36:07] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: upgrade planet instances to stretch - https://phabricator.wikimedia.org/T168490#3366279 (10Dzahn) [04:42:10] 10Operations, 10Analytics, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3366282 (10Nuria) >These metrics are 429s emitted from RESTBase, and not Varnish. Right, that is why we should continue to see throttling on the rest base end. Do take a second... [04:46:43] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/6827/" [puppet] - 10https://gerrit.wikimedia.org/r/360601 (https://phabricator.wikimedia.org/T168490) (owner: 10Dzahn) [04:48:20] RECOVERY - puppet last run on planet1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [04:49:51] 10Operations: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494#3366319 (10Dzahn) [04:50:26] 10Operations: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494#3366332 (10Dzahn) [04:50:29] 10Operations, 10DBA, 10Patch-For-Review: Prepare mysql hosts for stretch - https://phabricator.wikimedia.org/T168356#3366331 (10Dzahn) [04:50:54] 10Operations: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494#3366319 (10Dzahn) [04:50:56] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: upgrade planet instances to stretch - https://phabricator.wikimedia.org/T168490#3366333 (10Dzahn) [04:51:58] 10Operations: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494#3366319 (10Dzahn) [04:52:00] 10Operations, 10HHVM: Upload hhvm to stretch apt repo in apt.wikimedia.org - https://phabricator.wikimedia.org/T167225#3366335 (10Dzahn) [04:53:03] 10Operations: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494#3366337 (10Dzahn) [04:54:08] 10Operations: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494#3366319 (10Dzahn) [05:00:00] 10Operations: tracking task: get rid of trusty (in prod) - https://phabricator.wikimedia.org/T168495#3366340 (10Dzahn) [05:05:54] 10Operations: tracking task: get rid of trusty (in prod) - https://phabricator.wikimedia.org/T168495#3366340 (10Dzahn) maybe once of these 2 tracking things can also be used as a goal or use some goal-* tag some time? [05:09:24] 10Operations: tracking task: get rid of Ubuntu (trusty) (in prod) - https://phabricator.wikimedia.org/T168495#3366373 (10Dzahn) [05:15:46] (03CR) 10Chad: Gerrit: Makes sure review_site/lib exists (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/360412 (owner: 10Paladox) [05:20:39] 10Operations, 10Gerrit, 10Patch-For-Review: Gerrit constantly throws HTTP 500 error when reviewing patches (due to "Too many open files") - https://phabricator.wikimedia.org/T168360#3366375 (10demon) Hmm, all this started after we tried swapping SysV init for systemd. Funny how that correlates 🤔 😏 [05:23:17] 10Operations, 10Gerrit, 10Patch-For-Review: Gerrit constantly throws HTTP 500 error when reviewing patches (due to "Too many open files") - https://phabricator.wikimedia.org/T168360#3366376 (10Dzahn) Better to raise it (https://gerrit.wikimedia.org/r/#/c//1) than not raise it. I am happy to build the new deb... [05:25:23] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T168360#3364632" [puppet] - 10https://gerrit.wikimedia.org/r/356516 (owner: 10Dzahn) [05:25:55] (03CR) 10Dzahn: [C: 031] "https://phabricator.wikimedia.org/T168360#3364632" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/360312 (https://phabricator.wikimedia.org/T168360) (owner: 10Paladox) [05:30:35] (03CR) 10Dzahn: "any comments on this from db people maybe?" [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [05:32:12] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:32:12] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:32:20] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:32:20] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:32:20] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:32:20] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:32:21] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:32:31] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:32:40] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:32:40] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:32:50] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:32:50] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:32:51] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:32:51] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:10] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:11] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:11] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:11] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:11] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:11] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:34:11] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [05:34:11] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:34:30] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:34:50] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:34:51] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:34:51] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:34:51] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:35:00] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:35:00] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:35:00] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:35:00] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:35:01] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:35:01] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:35:10] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:35:10] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:35:10] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [05:35:10] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:35:10] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:35:14] dbstore1001 still alive. yea [05:37:40] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:37:40] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:39:17] mutante: yeah, timeouts from the backups [05:39:29] I will silence it [05:39:41] ok, good. thanks [05:41:17] !log Start relearn BBU cycle on db1016 - T166344 [05:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:29] T166344: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344 [05:46:14] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1021" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360602 [05:46:17] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1021" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360602 [05:47:33] (03PS1) 10Dzahn: planet: fix invalid relationship in cronjob, dirs on stretch [puppet] - 10https://gerrit.wikimedia.org/r/360603 (https://phabricator.wikimedia.org/T168490) [05:48:16] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1021" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360602 (owner: 10Marostegui) [05:48:20] RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [05:48:31] (03CR) 10jerkins-bot: [V: 04-1] planet: fix invalid relationship in cronjob, dirs on stretch [puppet] - 10https://gerrit.wikimedia.org/r/360603 (https://phabricator.wikimedia.org/T168490) (owner: 10Dzahn) [05:48:33] (03CR) 10Dzahn: [C: 031] "deployment of this would be appreciated but has absolutely not a high priority, just "normal"." [puppet] - 10https://gerrit.wikimedia.org/r/359373 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [05:49:15] 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3366397 (10Marostegui) After issuing the relearn, the raid is back to WB: ``` ˜/icinga-wm 7:48> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy ``` [05:49:16] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1021" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360602 (owner: 10Marostegui) [05:49:25] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1021" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360602 (owner: 10Marostegui) [05:50:23] (03PS2) 10Dzahn: planet: fix invalid relationship in cronjob, dirs on stretch [puppet] - 10https://gerrit.wikimedia.org/r/360603 (https://phabricator.wikimedia.org/T168490) [05:50:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1021 - T166205 (duration: 01m 00s) [05:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:45] T166205: Convert unique keys into primary keys for some wiki tables on s2 - https://phabricator.wikimedia.org/T166205 [05:52:18] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360604 [05:52:21] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360604 [05:54:51] !log Deploy alter table s5 - labsdb1011 - T166207 [05:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:00] T166207: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207 [05:55:02] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360604 (owner: 10Marostegui) [05:56:01] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360604 (owner: 10Marostegui) [05:56:10] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360604 (owner: 10Marostegui) [05:57:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1087 - T166207 (duration: 00m 44s) [05:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:50] PROBLEM - puppet last run on labtestpuppetmaster2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:59:59] !log reboot stat100[2,3,4] for kernel upgrades [06:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:15] (03PS1) 10Marostegui: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360605 (https://phabricator.wikimedia.org/T166207) [06:02:13] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360605 (https://phabricator.wikimedia.org/T166207) (owner: 10Marostegui) [06:03:35] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360605 (https://phabricator.wikimedia.org/T166207) (owner: 10Marostegui) [06:03:48] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360605 (https://phabricator.wikimedia.org/T166207) (owner: 10Marostegui) [06:04:41] !log Deploy alter table s5 - dbstore1002 - T166207 [06:04:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1082 - T166207 (duration: 00m 44s) [06:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:51] T166207: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207 [06:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:41] !log Deploy alter table s5 - db1082 - T166207 [06:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:10] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3654.00 Read Requests/Sec=2343.10 Write Requests/Sec=624.50 KBytes Read/Sec=36170.40 KBytes_Written/Sec=8977.60 [06:06:45] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/6828/" [puppet] - 10https://gerrit.wikimedia.org/r/360603 (https://phabricator.wikimedia.org/T168490) (owner: 10Dzahn) [06:08:12] !log reboot thorium for kernel upgrades (outage to all the analytics websites) [06:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:20] PROBLEM - Apache HTTP on mw2136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:10:33] (03PS1) 10Dzahn: planet: fix another duplicate declaration [puppet] - 10https://gerrit.wikimedia.org/r/360606 [06:10:59] mutante: o/ [06:11:10] RECOVERY - Apache HTTP on mw2136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.118 second response time [06:11:51] Jun 21 06:09:35 mw2136 systemd[1]: hhvm.service: main process exited, code=killed, status=9/KILL [06:12:10] elukey: eh, hi:) what did you mean of those things [06:13:01] (03PS1) 10Marostegui: db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360607 (https://phabricator.wikimedia.org/T168354) [06:13:31] (03CR) 10Dzahn: [C: 032] "already declared in dirs.pp, remove in init.pp" [puppet] - 10https://gerrit.wikimedia.org/r/360606 (owner: 10Dzahn) [06:13:54] ah Jun 21 06:08:01 mw2136 CRON[34165]: (root) CMD (/usr/local/bin/hhvm-needs-restart > /dev/null && /usr/local/sbin/run-no-puppet /usr/local/bin/restart-hhvm > /dev/null) [06:14:12] so mw2136 all good, restart script taking a bit [06:14:15] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360607 (https://phabricator.wikimedia.org/T168354) (owner: 10Marostegui) [06:14:35] mutante: ah sorry I just wanted to say hello, no other meanings :) [06:15:09] elukey: ok, wasn't sure if you were pointing at a specific line there :) hello too, chhers [06:15:23] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360607 (https://phabricator.wikimedia.org/T168354) (owner: 10Marostegui) [06:15:50] and now i want to see one icinga-wm recovery .. [06:15:57] (03CR) 10jenkins-bot: db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360607 (https://phabricator.wikimedia.org/T168354) (owner: 10Marostegui) [06:16:14] for puppet run on planet2001 or something, and then i will go afk :) come on icinga-wm [06:16:20] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=15.40 Read Requests/Sec=0.40 Write Requests/Sec=9.70 KBytes Read/Sec=1.60 KBytes_Written/Sec=161.60 [06:16:34] and that too. these are always spikes when list mail goes out [06:18:09] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2066 - T168354 (duration: 00m 43s) [06:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:19] T168354: dbstore2001 s5 thread is 6 days delayed - https://phabricator.wikimedia.org/T168354 [06:19:15] !log Stop replication and puppet on db2066 for maintenance - T168354 [06:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:40] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 875.25 seconds [06:19:57] ^ that is expected - I will silence it [06:23:42] !log planet2001 wget missing unpuppetized logo file from https://en.planet.wikimedia.org/images/planet-wm2.png - should fix puppet run [06:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:00] RECOVERY - puppet last run on planet2001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:24:53] ok :) and good night, ciao elukey. good luck with upgrades, out :) [06:27:50] RECOVERY - puppet last run on labtestpuppetmaster2001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:33:38] 10Operations, 10Traffic, 10netops: codfw row A switch upgrade - https://phabricator.wikimedia.org/T168462#3366437 (10Marostegui) We need to make sure we downtime the following DBs in EQIAD as they have cross replication with some of the dbs affected here, so we can avoid pages like we had yesterday for cross... [06:45:31] (03PS1) 10Muehlenhoff: Remove access for jgirault [puppet] - 10https://gerrit.wikimedia.org/r/360609 [06:45:40] PROBLEM - Host elastic1023 is DOWN: PING CRITICAL - Packet loss = 100% [06:46:00] RECOVERY - Host elastic1023 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [06:46:41] (03CR) 10Muehlenhoff: [C: 032] Remove access for jgirault [puppet] - 10https://gerrit.wikimedia.org/r/360609 (owner: 10Muehlenhoff) [06:47:23] (03CR) 10Marostegui: "I am checking db1011 (tendril db master) and I am seeing two users for those two hosts already there, with different grants than the ones " [puppet] - 10https://gerrit.wikimedia.org/r/359373 (https://phabricator.wikimedia.org/T149557) (owner: 10Dzahn) [06:50:00] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:50:10] checking --^ [06:52:10] PROBLEM - puppet last run on notebook1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:01:06] (03PS1) 10Muehlenhoff: Fix group membership list analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/360611 [07:01:40] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:02:45] (03CR) 10Muehlenhoff: [C: 032] Fix group membership list analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/360611 (owner: 10Muehlenhoff) [07:03:36] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3102305 (10Bawolff) So i guess someone is sending spam subject lines to wikimedia-gh, with a forged from address of wikimedia-co@lists.wikimedia.org, in order for the mailing list... [07:08:10] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:08:10] PROBLEM - puppet last run on notebook1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:09:01] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:10:21] moritzm: I'm forcing a puppet run on those [07:10:30] already doing that [07:10:42] ah ok [07:11:00] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [07:11:02] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [07:11:02] should recover soon [07:11:05] can I ask with which command? :-P [07:11:11] RECOVERY - puppet last run on notebook1001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [07:11:59] salt puppet.run, I'll stop using that next quarter :-) [07:12:10] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [07:12:10] RECOVERY - puppet last run on notebook1002 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [07:12:37] moritzm: but you had to copy/paste all the hosts I guess... it would have been as simple as: [07:12:40] sudo cumin 'R:Group = analytics-privatedata-users' 'run-puppet-agent' [07:12:40] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [07:12:44] * volans advertising ;) [07:13:52] 10Operations, 10Traffic, 10netops: codfw row A switch upgrade - https://phabricator.wikimedia.org/T168462#3366501 (10Joe) [07:14:11] 10Operations, 10Traffic, 10netops, 10User-Joe: codfw row A switch upgrade - https://phabricator.wikimedia.org/T168462#3365105 (10Joe) [07:17:40] nice [07:28:45] moritzm: while for the general case where there isn't a common factor to query, see https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed ;) [07:29:24] 10Operations, 10Citoid, 10VisualEditor, 10Services (blocked), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3366527 (10Joe) Hey sorry I was on vacation last week. The IPs of our proxies are: webproxy.eqiad.wmnet -... [07:29:26] * volans stop advertising [07:30:10] PROBLEM - HHVM rendering on mw2134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:31:00] RECOVERY - HHVM rendering on mw2134 is OK: HTTP OK: HTTP/1.1 200 OK - 74538 bytes in 0.319 second response time [07:33:06] (03PS3) 10Gehel: Add info to Discovery Dashboards index page [puppet] - 10https://gerrit.wikimedia.org/r/360592 (https://phabricator.wikimedia.org/T167930) (owner: 10Bearloga) [07:37:11] !log Stop and reset slave s5 on dbstore2001 - T168354 [07:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:21] T168354: dbstore2001 s5 thread is 6 days delayed - https://phabricator.wikimedia.org/T168354 [07:37:27] 10Operations, 10Incident-20150423-Commons, 10MediaWiki-API, 10Parsoid, and 7 others: HHVM request timeouts not working; support lowering the API request timeout per request - https://phabricator.wikimedia.org/T97192#3366546 (10Joe) >>! In T97192#3365034, @GWicke wrote: > @Joe, has this been fixed with 3.18... [07:38:04] (03CR) 10Gehel: [C: 031] "LGTM. @bearloga: since this depends on a few other patches, let me know when you are ready to deploy so that we can push all this in a coo" [puppet] - 10https://gerrit.wikimedia.org/r/360592 (https://phabricator.wikimedia.org/T167930) (owner: 10Bearloga) [07:41:20] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag not a slave [07:45:00] PROBLEM - Host elastic1024 is DOWN: PING CRITICAL - Packet loss = 100% [07:48:50] RECOVERY - Host elastic1024 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [08:16:24] (03PS32) 10Gehel: maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 (https://phabricator.wikimedia.org/T167871) [08:18:26] (03CR) 10Gehel: [C: 032] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 (https://phabricator.wikimedia.org/T167871) (owner: 10Gehel) [08:19:12] gehel: was it you on elastic1024 above? [08:19:34] volans: yes, cluster restart in progress, and downtime just a few minutes too short :( [08:19:38] sorry for the noise [08:20:33] np, I just didn't saw a SAL for today's restart starting ;) [08:20:58] volans: it actually started yesterday, and is planned on running for the whole week :) [08:21:26] I could log it again each day :) [08:22:04] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360614 [08:22:13] 10Operations, 10Citoid, 10VisualEditor, 10Services (blocked), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3366636 (10akosiaris) >>! In T165105#3366527, @Joe wrote: > Hey sorry I was on vacation last week. > > The... [08:22:29] naah it's ok :D [08:23:14] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360614 [08:24:56] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360614 (owner: 10Marostegui) [08:26:24] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360614 (owner: 10Marostegui) [08:26:31] (03PS2) 10Alexandros Kosiaris: Remove ganglia aggregator from netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/351260 [08:26:34] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360614 (owner: 10Marostegui) [08:26:36] !log reimage ms-be1014 / 1015 with jessie [08:26:42] (03PS3) 10Alexandros Kosiaris: Remove ganglia aggregator from netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/351260 [08:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:46] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove ganglia aggregator from netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/351260 (owner: 10Alexandros Kosiaris) [08:27:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1082 - T166207 (duration: 00m 45s) [08:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:00] T166207: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207 [08:29:07] (03PS1) 10Gehel: maps - storage ID for maps-test is v5 [puppet] - 10https://gerrit.wikimedia.org/r/360615 (https://phabricator.wikimedia.org/T167871) [08:29:15] (03PS2) 10Gehel: maps - storage ID for maps-test is v5 [puppet] - 10https://gerrit.wikimedia.org/r/360615 (https://phabricator.wikimedia.org/T167871) [08:29:34] (03PS1) 10Marostegui: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360616 (https://phabricator.wikimedia.org/T166207) [08:30:43] (03CR) 10Gehel: [C: 032] maps - storage ID for maps-test is v5 [puppet] - 10https://gerrit.wikimedia.org/r/360615 (https://phabricator.wikimedia.org/T167871) (owner: 10Gehel) [08:30:57] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360616 (https://phabricator.wikimedia.org/T166207) (owner: 10Marostegui) [08:32:26] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360616 (https://phabricator.wikimedia.org/T166207) (owner: 10Marostegui) [08:32:35] (03PS1) 10Giuseppe Lavagetto: profile::pybal: properly distribute reads from etcd [puppet] - 10https://gerrit.wikimedia.org/r/360617 [08:32:39] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360616 (https://phabricator.wikimedia.org/T166207) (owner: 10Marostegui) [08:32:54] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3366645 (10fgiunchedi) [08:33:15] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3209252 (10fgiunchedi) 05Open>03Resolved All done, 1019 BBU was swapped yesterday by @Cmjohnson [08:33:42] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1070 - T166207 (duration: 00m 44s) [08:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:52] T166207: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207 [08:33:55] (03CR) 10jerkins-bot: [V: 04-1] profile::pybal: properly distribute reads from etcd [puppet] - 10https://gerrit.wikimedia.org/r/360617 (owner: 10Giuseppe Lavagetto) [08:34:38] !log Deploy alter table db1070 s5 - T166207 [08:34:42] !log reboot kafka1012 for kernel upgrades [08:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:32] (03PS2) 10Giuseppe Lavagetto: profile::pybal: properly distribute reads from etcd [puppet] - 10https://gerrit.wikimedia.org/r/360617 [08:38:34] <_joe_> ema, FYI ^^ [08:38:51] <_joe_> ema: I'll have to do a rolling restart of all LVSs afterwards [08:38:59] <_joe_> sorry, just of pybal, actually [08:39:32] _joe_: https://phabricator.wikimedia.org/P5606 [08:39:58] <_joe_> ema: it would be interesting to see how many hosts have that error [08:40:01] <_joe_> let me do that [08:40:27] lvs2006 also has it for sure, creating a paste with the full error message now [08:41:07] 10Operations, 10ops-eqiad, 10Performance-Team, 10Thumbor, 10User-fgiunchedi: Rename mw1236 / mw1237 to thumbor1003 / thumbor1004 - https://phabricator.wikimedia.org/T168297#3366667 (10fgiunchedi) a:05fgiunchedi>03Cmjohnson [08:41:51] lvs2006: https://phabricator.wikimedia.org/P5607 [08:42:53] <_joe_> ema: only on those two hosts AFAICS [08:43:04] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::pybal: properly distribute reads from etcd [puppet] - 10https://gerrit.wikimedia.org/r/360617 (owner: 10Giuseppe Lavagetto) [08:43:14] <_joe_> anyways, merging the puppet change [08:43:22] <_joe_> and then I'll rolling restart all of them [08:46:52] (03PS1) 10Gehel: Revert "maps - storage ID for maps-test is v5" [puppet] - 10https://gerrit.wikimedia.org/r/360619 [08:48:00] (03CR) 10Ema: [C: 031] Add netlink-based Ipvsmanager implementation [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355082 (owner: 10Giuseppe Lavagetto) [08:48:20] (03CR) 10Gehel: [C: 032] Revert "maps - storage ID for maps-test is v5" [puppet] - 10https://gerrit.wikimedia.org/r/360619 (owner: 10Gehel) [08:48:21] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [08:48:48] <_joe_> uh? [08:48:54] Error: Could not find any hostgroup matching 'maps-test_codfw' (config file '/etc/icinga/puppet_hosts.cfg', starting on line 10675) [08:48:54] Error processing object config files! [08:48:55] <_joe_> ^^ can someone check icinga? [08:48:57] gehel: ^ [08:49:00] <_joe_> oh ok [08:49:03] damn, yes, this is me... [08:49:03] <_joe_> that was fast :P [08:49:07] I'm on it! [08:49:12] thanks [08:49:19] <_joe_> !log restarting etcd on lvs2003/2006, connection lost to etcd [08:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:31] <_joe_> !log correction: restarting pybal [08:49:32] 10Operations, 10Tracking: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494#3366675 (10hashar) [08:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:51] (03PS1) 10Gehel: maps / icinga - add the maps-test_codfw monitoring group [puppet] - 10https://gerrit.wikimedia.org/r/360620 (https://phabricator.wikimedia.org/T167871) [08:51:17] akosiaris: could I ask your review on ^? [08:51:45] * akosiaris looking [08:51:48] I did the obvious, but I'd like to not be missing something again... [08:52:10] PROBLEM - Host elastic1025 is DOWN: PING CRITICAL - Packet loss = 100% [08:52:33] hi Any news about https://phabricator.wikimedia.org/T168374 ? [08:52:35] elastic1025 is probably me... [08:52:51] (03CR) 10Alexandros Kosiaris: [C: 032] maps / icinga - add the maps-test_codfw monitoring group [puppet] - 10https://gerrit.wikimedia.org/r/360620 (https://phabricator.wikimedia.org/T167871) (owner: 10Gehel) [08:52:58] !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=restbase2001.codfw.wmnet,dc=codfw,service=restbase [08:53:02] akosiaris: thanks! [08:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:10] RECOVERY - Host elastic1025 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [08:53:13] merged [08:53:46] (03PS1) 10Filippo Giunchedi: Move ms-be10[01-12] to spare systems for decom [puppet] - 10https://gerrit.wikimedia.org/r/360621 (https://phabricator.wikimedia.org/T166489) [08:55:08] akosiaris: I'm running puppet on einsteinium and checking config [08:58:22] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [09:02:18] 10Operations, 10Tracking: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494#3366319 (10MoritzMuehlenhoff) I doubt this tracking bug is going to be particularly useful, there'll be lots of clusters/systems migrated without updating this bug. This is mostly useful when you're tracking... [09:06:37] !log rebooting restbase1017 for kernel update [09:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:11] !log reboot kafka2001 for kernel update (eventbus codfw) [09:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:51] Cc: moritzm [09:14:53] argh [09:15:02] Cc: mobrovac (kafka2001 reboot) [09:15:04] sorry :) [09:22:18] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/6829/" [puppet] - 10https://gerrit.wikimedia.org/r/360621 (https://phabricator.wikimedia.org/T166489) (owner: 10Filippo Giunchedi) [09:22:24] (03PS2) 10Filippo Giunchedi: Move ms-be10[01-12] to spare systems for decom [puppet] - 10https://gerrit.wikimedia.org/r/360621 (https://phabricator.wikimedia.org/T166489) [09:25:00] (03CR) 10Filippo Giunchedi: [C: 032] Move ms-be10[01-12] to spare systems for decom [puppet] - 10https://gerrit.wikimedia.org/r/360621 (https://phabricator.wikimedia.org/T166489) (owner: 10Filippo Giunchedi) [09:32:50] RECOVERY - puppet last run on ms-be1001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [09:43:29] 10Operations, 10Operations-Software-Development: New tool to track package updates/status for hosts and images (debmonitor) - https://phabricator.wikimedia.org/T167504#3366738 (10Volans) We should also investigate other available tools in the container space, for example one recently released is https://github... [09:47:23] 10Operations, 10Discovery, 10Interactive-Sprint, 10Maps (Maps-data): Configure monitoring / alerting of Postgresql / redis / ... cluster for maps - https://phabricator.wikimedia.org/T135647#3366755 (10Gehel) As part of T167871, the standard monitoring of postgres and redis are now applied, which should be... [09:48:14] 10Operations, 10DBA: Drop wikilove_image_log table from Wikimedia wikis - https://phabricator.wikimedia.org/T127219#3366760 (10Marostegui) a:03Marostegui [09:48:15] !log reboot aqs1005 for kernel update [09:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:44] !log reboot lvs[1010-1012] for kernel update [09:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:42] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [09:55:33] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [09:56:44] 10Operations, 10hardware-requests, 10User-fgiunchedi: Decommission ms-be1001 - ms-be1012 - https://phabricator.wikimedia.org/T166489#3366807 (10fgiunchedi) a:05fgiunchedi>03None [09:57:21] 10Operations, 10hardware-requests, 10User-fgiunchedi: Decommission ms-be1001 - ms-be1012 - https://phabricator.wikimedia.org/T166489#3297629 (10fgiunchedi) @Dzahn machines are marked as spares now and good to be decom'd /cc @RobH [10:01:08] !log rebooting auth* servers for kernel update [10:01:16] !log reboot analytics1002 (Hadoop master standby) for kernel update [10:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:02] !log reboot lvs[1004-1006] (eqiad secondaries) for kernel update [10:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:51] (03PS1) 10Alexandros Kosiaris: Switch oresrdb.svc.codfw.wmnet to oresrdb2002 [dns] - 10https://gerrit.wikimedia.org/r/360628 [10:10:15] (03CR) 10Alexandros Kosiaris: [C: 032] Switch oresrdb.svc.codfw.wmnet to oresrdb2002 [dns] - 10https://gerrit.wikimedia.org/r/360628 (owner: 10Alexandros Kosiaris) [10:17:28] !log running a script in tmux on rdb[12]003 called "check" to dump periodically LLEN enwiki:jobqueue:enqueue:l-unclaimed and stopped the one on rdb2004 [10:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:10] !log reboot lvs[1001-1003] (eqiad primaries) for kernel update [10:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:30] 10Operations, 10LDAP: Update certificates on productions replicas of corp.wikimedia.org LDAP - https://phabricator.wikimedia.org/T168460#3366865 (10MoritzMuehlenhoff) Yes, that is used by the corp LDAP replica in role::openldap::corp [10:27:30] (03CR) 10Jhernandez: [C: 031] "Related patch that renames the variables https://gerrit.wikimedia.org/r/#/c/360165" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360166 (https://phabricator.wikimedia.org/T167236) (owner: 10Phuedx) [10:30:08] !log rebooting bast4001 for kernel update [10:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:52] 10Operations, 10Wikimedia-IRC-RC-Server, 10User-notice: Reboot irc.wikimedia.org for kernel upgrades - https://phabricator.wikimedia.org/T167643#3366868 (10akosiaris) This is happening now. [10:33:39] 10Operations, 10Citoid, 10VisualEditor, 10Services (blocked), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3366870 (10Samwalton9) Thanks - I've sent an email to our contacts at Cochrane, who will hopefully be able t... [10:35:28] !log rebooting the entire codfw ganeti cluster for kernel upgrades. Silenced hosts in icinga already. T167643 [10:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:38] T167643: Reboot irc.wikimedia.org for kernel upgrades - https://phabricator.wikimedia.org/T167643 [10:40:12] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [10:41:03] (03PS1) 10Alexandros Kosiaris: Switch codfw puppetdb hosts to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/360629 (https://phabricator.wikimedia.org/T167643) [10:41:15] _joe_: ^ [10:41:18] fyi [10:41:26] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Switch codfw puppetdb hosts to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/360629 (https://phabricator.wikimedia.org/T167643) (owner: 10Alexandros Kosiaris) [10:41:35] <_joe_> akosiaris: why? [10:41:49] <_joe_> oh the ticket [10:41:56] _joe_: kernel reboots across the codfw ganeti cluster [10:41:59] I 'll revert later on [10:42:01] <_joe_> right [10:42:03] <_joe_> cool [10:43:41] !log reboot analytics1001 (Hadoop master) for kernel update [10:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:37] 10Operations, 10Dumps-Generation: Reboot snapshot hosts - https://phabricator.wikimedia.org/T168516#3366927 (10MoritzMuehlenhoff) [10:47:06] !log shutdown all VMs on the ganeti01.svc.codfw.wmnet cluster [10:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:05] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - kubemaster_6443 - Could not depool server acrux.codfw.wmnet because of too many down!: zotero_1969 - Could not depool server sca2004.codfw.wmnet because of too many down! [10:49:15] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [10:49:25] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [10:49:25] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:26] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:26] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:26] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:26] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:26] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:36] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:36] PROBLEM - restbase endpoints health on cerium is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:36] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:36] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:36] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:36] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:36] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:36] PROBLEM - restbase endpoints health on xenon is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:36] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:37] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:37] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:38] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:40] expected ^ [10:49:55] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:55] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:55] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:55] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:55] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:49:55] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:50:05] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:50:06] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:50:16] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:50:16] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:50:16] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [10:50:53] !log rebooting all ganeti200X nodes [10:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:49] !log reboot lvs[2004-2006] (codfw secondaries) for kernel update [10:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:34] !log reimage ms-be1018 / 1019 with stretch [11:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:24] !log starting up all instances on ganeti01.svc.codfw.wmnet [11:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:36] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [11:02:36] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [11:02:36] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [11:02:45] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [11:02:45] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [11:02:46] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [11:02:46] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [11:02:46] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [11:02:55] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [11:02:55] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [11:02:55] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [11:03:05] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [11:03:15] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [11:03:15] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [11:03:15] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [11:03:16] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [11:03:24] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3366966 (10Amire80) [11:03:25] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [11:03:26] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [11:03:35] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [11:03:35] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [11:03:35] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [11:03:35] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [11:03:35] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [11:03:35] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [11:03:35] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [11:03:36] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [11:03:36] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [11:03:37] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [11:03:37] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [11:03:38] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [11:04:02] !log rebooting mw1180-mw1188 for kernel update [11:04:05] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [11:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:45] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [11:07:06] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [11:11:03] 10Operations, 10Wikimedia-IRC-RC-Server, 10Patch-For-Review, 10User-notice: Reboot irc.wikimedia.org for kernel upgrades - https://phabricator.wikimedia.org/T167643#3366984 (10akosiaris) This has been completed successfully. I see 92 users (well bots actually/probably) already in #en.wikipedia and another... [11:12:25] (03PS1) 10Alexandros Kosiaris: Revert "Switch codfw puppetdb hosts to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/360633 [11:13:20] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Switch codfw puppetdb hosts to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/360633 (owner: 10Alexandros Kosiaris) [11:13:24] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "Switch codfw puppetdb hosts to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/360633 (owner: 10Alexandros Kosiaris) [11:14:01] !log reboot aqs1006 for kernel update [11:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:44] hi [11:19:28] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:19:28] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:19:28] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:19:29] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) timed out before a response was received: /analytics.wikimedia.org/v1/unique-devices/{project}/{access-site}/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received [11:19:38] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:19:48] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:19:48] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:19:49] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:19:49] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:19:49] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:19:49] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:19:49] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:19:58] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) timed out before a response was received [11:20:19] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [11:20:28] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [11:20:28] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [11:20:28] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [11:20:28] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [11:20:48] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [11:20:48] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [11:20:48] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [11:20:48] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [11:20:48] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [11:20:48] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [11:20:49] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [11:20:49] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy [11:20:57] what the hell? [11:22:39] I wonder what powers /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd}in mobileapps [11:22:51] aqs maybe ? [11:22:53] pretty sure AQS [11:22:54] * akosiaris hopes not [11:23:01] damn [11:23:08] PROBLEM - Host ganeti1007 is DOWN: PING CRITICAL - Packet loss = 100% [11:23:09] one more interservice dependency to remember [11:23:17] but I just rebooted aqs1006 (and before 1005), not sure why everything exploded [11:23:28] RECOVERY - Host ganeti1007 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [11:23:46] !log reboot ganeti1007 for insertion into ganeti cluster [11:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:55] aqs100[48] http endpoints had issues [11:29:49] so the aqs restbase part on aqs1004/8 returned [11:29:50] There was an error when trying to connect to the host 10.64.48.148 [11:30:05] when I stopped aqs1006 [11:30:18] so the cassandra driver must have not liked it [11:31:41] (03CR) 10Pmiazga: [C: 031] relatedArticles: SamplingRate -> BucketSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360166 (https://phabricator.wikimedia.org/T167236) (owner: 10Phuedx) [11:38:02] (03PS1) 10Alexandros Kosiaris: Revert "Switch oresrdb.svc.codfw.wmnet to oresrdb2002" [dns] - 10https://gerrit.wikimedia.org/r/360637 [11:38:47] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Switch oresrdb.svc.codfw.wmnet to oresrdb2002" [dns] - 10https://gerrit.wikimedia.org/r/360637 (owner: 10Alexandros Kosiaris) [11:40:28] PROBLEM - Host etcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [11:40:52] PROBLEM - Host neon is DOWN: PING CRITICAL - Packet loss = 100% [11:41:12] PROBLEM - Host chlorine is DOWN: PING CRITICAL - Packet loss = 100% [11:41:12] PROBLEM - Host etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [11:41:12] PROBLEM - Host etcd1004 is DOWN: PING CRITICAL - Packet loss = 100% [11:41:12] PROBLEM - Host ganeti1005 is DOWN: PING CRITICAL - Packet loss = 100% [11:41:12] PROBLEM - Host mwdebug1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:41:22] PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:41:22] PROBLEM - Host ganeti1006 is DOWN: PING CRITICAL - Packet loss = 100% [11:41:22] PROBLEM - Host ganeti1007 is DOWN: PING CRITICAL - Packet loss = 100% [11:41:32] PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:32] RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [11:42:32] RECOVERY - Host chlorine is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [11:42:32] RECOVERY - Host ganeti1005 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [11:42:33] RECOVERY - Host ganeti1007 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [11:42:33] RECOVERY - Host etcd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [11:42:44] !log rollback change in asw-a-eqiad for ganeti interface range due to alerts [11:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:02] RECOVERY - Host neon is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [11:45:03] PROBLEM - ganeti-confd running on ganeti1007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-confd), command name ganeti-confd [11:45:22] PROBLEM - salt-minion processes on ganeti1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [11:45:29] !log rebooting mediawiki api servers in codfw for kernel update [11:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:32] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2016897 [11:56:13] (03PS1) 10Hashar: (DO NOT SUBMIT) confftool: remove citoid to unbreak beta [puppet] - 10https://gerrit.wikimedia.org/r/360639 (https://phabricator.wikimedia.org/T168519) [11:56:22] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [11:56:22] PROBLEM - puppet last run on ganeti1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:56:55] (03CR) 10Hashar: [V: 04-1 C: 04-1] "HACK! Applied on beta cluster puppet master to unbreak varnish." [puppet] - 10https://gerrit.wikimedia.org/r/360639 (https://phabricator.wikimedia.org/T168519) (owner: 10Hashar) [11:59:21] !log rebooting mw1209-mw1220 for kernel update [11:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:22] PROBLEM - Check systemd state on mw1202 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:24] PROBLEM - SSH on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:03:24] PROBLEM - SSH on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:03:24] PROBLEM - nutcracker process on mw1204 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:24] PROBLEM - Disk space on mw1202 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:24] PROBLEM - configured eth on mw1203 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:32] PROBLEM - dhclient process on mw1204 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:32] PROBLEM - salt-minion processes on mw1203 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:32] PROBLEM - Check size of conntrack table on mw1200 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:32] PROBLEM - Check whether ferm is active by checking the default input chain on mw1202 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:32] PROBLEM - Apache HTTP on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:03:32] PROBLEM - DPKG on mw1202 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:33] PROBLEM - nutcracker process on mw1203 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:33] PROBLEM - SSH on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:03:34] PROBLEM - HHVM processes on mw1203 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:34] PROBLEM - DPKG on mw1203 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:35] PROBLEM - HHVM rendering on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:03:35] PROBLEM - HHVM rendering on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:03:36] PROBLEM - Check whether ferm is active by checking the default input chain on mw1200 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:36] PROBLEM - Check whether ferm is active by checking the default input chain on mw1203 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:52] PROBLEM - Disk space on mw1203 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:52] PROBLEM - dhclient process on mw1200 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:52] PROBLEM - HHVM rendering on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:03:52] PROBLEM - Nginx local proxy to apache on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:03:52] PROBLEM - Apache HTTP on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:03:53] PROBLEM - Check systemd state on mw1204 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:53] PROBLEM - puppet last run on mw1204 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:02] PROBLEM - dhclient process on mw1201 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:02] PROBLEM - nutcracker process on mw1201 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:02] PROBLEM - puppet last run on mw1200 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:02] PROBLEM - HHVM rendering on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:04:02] PROBLEM - configured eth on mw1200 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:02] PROBLEM - Check size of conntrack table on mw1204 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:03] PROBLEM - Nginx local proxy to apache on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:04:03] PROBLEM - HHVM processes on mw1201 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:04] PROBLEM - puppet last run on mw1201 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:04] PROBLEM - puppet last run on mw1202 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:05] PROBLEM - Nginx local proxy to apache on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:04:12] PROBLEM - Nginx local proxy to apache on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:04:12] PROBLEM - nutcracker process on mw1200 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:12] PROBLEM - Nginx local proxy to apache on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:04:12] PROBLEM - DPKG on mw1204 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:12] PROBLEM - SSH on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:04:13] PROBLEM - Check size of conntrack table on mw1202 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:13] PROBLEM - nutcracker port on mw1204 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:14] PROBLEM - dhclient process on mw1203 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:14] PROBLEM - DPKG on mw1200 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:15] PROBLEM - nutcracker port on mw1203 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:15] PROBLEM - HHVM processes on mw1202 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:16] PROBLEM - salt-minion processes on mw1204 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:22] PROBLEM - Disk space on mw1200 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:22] PROBLEM - SSH on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:04:22] PROBLEM - Disk space on mw1204 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:22] PROBLEM - Check systemd state on mw1201 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:22] PROBLEM - configured eth on mw1204 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:22] PROBLEM - Check whether ferm is active by checking the default input chain on mw1204 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:23] PROBLEM - Apache HTTP on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:05:02] RECOVERY - nutcracker process on mw1200 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [12:05:02] RECOVERY - DPKG on mw1204 is OK: All packages OK [12:05:02] RECOVERY - SSH on mw1204 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [12:05:03] RECOVERY - nutcracker port on mw1204 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [12:05:03] RECOVERY - salt-minion processes on mw1204 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:05:12] RECOVERY - DPKG on mw1200 is OK: All packages OK [12:05:12] RECOVERY - Disk space on mw1200 is OK: DISK OK [12:05:13] RECOVERY - Disk space on mw1204 is OK: DISK OK [12:05:13] RECOVERY - configured eth on mw1204 is OK: OK - interfaces up [12:05:13] RECOVERY - SSH on mw1202 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [12:05:13] RECOVERY - Check whether ferm is active by checking the default input chain on mw1204 is OK: OK ferm input default policy is set [12:05:13] RECOVERY - salt-minion processes on mw1200 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:05:14] RECOVERY - nutcracker port on mw1200 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [12:05:14] RECOVERY - dhclient process on mw1202 is OK: PROCS OK: 0 processes with command name dhclient [12:05:15] RECOVERY - nutcracker process on mw1202 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [12:05:15] RECOVERY - salt-minion processes on mw1202 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:05:16] RECOVERY - configured eth on mw1201 is OK: OK - interfaces up [12:05:16] RECOVERY - SSH on mw1201 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [12:05:17] RECOVERY - SSH on mw1203 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [12:05:32] RECOVERY - Check whether ferm is active by checking the default input chain on mw1200 is OK: OK ferm input default policy is set [12:05:32] RECOVERY - Check size of conntrack table on mw1201 is OK: OK: nf_conntrack is 0 % full [12:05:32] RECOVERY - Check whether ferm is active by checking the default input chain on mw1203 is OK: OK ferm input default policy is set [12:05:32] RECOVERY - Check size of conntrack table on mw1203 is OK: OK: nf_conntrack is 0 % full [12:05:32] RECOVERY - puppet last run on mw1203 is OK: OK: Puppet is currently enabled, last run 17 minutes ago with 0 failures [12:05:33] RECOVERY - DPKG on mw1201 is OK: All packages OK [12:05:33] RECOVERY - Check systemd state on mw1200 is OK: OK - running: The system is fully operational [12:05:34] RECOVERY - Check systemd state on mw1203 is OK: OK - running: The system is fully operational [12:05:34] RECOVERY - configured eth on mw1202 is OK: OK - interfaces up [12:05:35] RECOVERY - nutcracker port on mw1201 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [12:05:35] RECOVERY - HHVM processes on mw1204 is OK: PROCS OK: 6 processes with command name hhvm [12:05:36] RECOVERY - nutcracker port on mw1202 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [12:05:36] RECOVERY - salt-minion processes on mw1201 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:05:37] RECOVERY - HHVM processes on mw1200 is OK: PROCS OK: 6 processes with command name hhvm [12:05:52] RECOVERY - Check systemd state on mw1204 is OK: OK - running: The system is fully operational [12:05:52] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 11 minutes ago with 0 failures [12:05:52] RECOVERY - configured eth on mw1200 is OK: OK - interfaces up [12:05:52] RECOVERY - Check size of conntrack table on mw1204 is OK: OK: nf_conntrack is 0 % full [12:05:52] RECOVERY - dhclient process on mw1201 is OK: PROCS OK: 0 processes with command name dhclient [12:05:53] RECOVERY - nutcracker process on mw1201 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [12:05:53] RECOVERY - puppet last run on mw1200 is OK: OK: Puppet is currently enabled, last run 27 minutes ago with 0 failures [12:05:54] RECOVERY - HHVM processes on mw1201 is OK: PROCS OK: 6 processes with command name hhvm [12:05:54] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 7.675 second response time [12:05:55] RECOVERY - Nginx local proxy to apache on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 1.638 second response time [12:05:55] RECOVERY - puppet last run on mw1202 is OK: OK: Puppet is currently enabled, last run 23 minutes ago with 0 failures [12:06:02] RECOVERY - Nginx local proxy to apache on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 4.552 second response time [12:06:02] RECOVERY - Check size of conntrack table on mw1202 is OK: OK: nf_conntrack is 0 % full [12:06:02] RECOVERY - nutcracker port on mw1203 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [12:06:03] RECOVERY - dhclient process on mw1203 is OK: PROCS OK: 0 processes with command name dhclient [12:06:03] RECOVERY - HHVM processes on mw1202 is OK: PROCS OK: 6 processes with command name hhvm [12:06:12] RECOVERY - Check systemd state on mw1201 is OK: OK - running: The system is fully operational [12:06:12] RECOVERY - Check systemd state on mw1202 is OK: OK - running: The system is fully operational [12:06:12] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 4.659 second response time [12:06:22] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 9.783 second response time [12:06:22] RECOVERY - salt-minion processes on ganeti1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:06:22] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.043 second response time [12:06:22] RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 74577 bytes in 1.048 second response time [12:06:32] RECOVERY - HHVM rendering on mw1203 is OK: HTTP OK: HTTP/1.1 200 OK - 74578 bytes in 6.723 second response time [12:07:42] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 74578 bytes in 4.188 second response time [12:07:42] RECOVERY - HHVM rendering on mw1200 is OK: HTTP OK: HTTP/1.1 200 OK - 74576 bytes in 0.120 second response time [12:07:52] RECOVERY - Nginx local proxy to apache on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.030 second response time [12:07:52] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 74576 bytes in 0.143 second response time [12:07:52] RECOVERY - Host etcd1003 is UP: PING WARNING - Packet loss = 80%, RTA = 0.61 ms [12:08:02] RECOVERY - Nginx local proxy to apache on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.038 second response time [12:08:02] RECOVERY - Nginx local proxy to apache on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.047 second response time [12:08:03] RECOVERY - Host etcd1002 is UP: PING WARNING - Packet loss = 73%, RTA = 0.37 ms [12:08:12] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.026 second response time [12:08:22] RECOVERY - Host etcd1005 is UP: PING WARNING - Packet loss = 58%, RTA = 0.46 ms [12:08:23] RECOVERY - Host mwdebug1002 is UP: PING WARNING - Packet loss = 44%, RTA = 0.44 ms [12:09:50] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mwdebug1002.eqiad.wmnet [12:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:06] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=sca1004.eqiad.wmnet [12:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:42] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:12:32] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.068 second response time [12:16:29] (03PS3) 10Nschaaf: Enable Reader Survey using QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359936 (https://phabricator.wikimedia.org/T131949) [12:23:32] RECOVERY - puppet last run on ganeti1005 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [12:29:22] RECOVERY - Host ganeti1006 is UP: PING WARNING - Packet loss = 73%, RTA = 0.19 ms [12:31:42] PROBLEM - Check whether ferm is active by checking the default input chain on ganeti1006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [12:31:42] PROBLEM - Check systemd state on ganeti1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:32:22] PROBLEM - puppet last run on ganeti1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:32:31] !log rebooting mw1189-mw1199 for kernel update [12:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:44] !log deploying T167871 and restarting kartotherian / tilerator on maps eqiad [12:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:53] T167871: Refactor maps puppet code to the role / profile paradigm - https://phabricator.wikimedia.org/T167871 [12:36:02] PROBLEM - ganeti-noded running on ganeti1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded [12:36:12] PROBLEM - ganeti-mond running on ganeti1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond [12:36:12] PROBLEM - ganeti-mond running on ganeti1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond [12:36:12] PROBLEM - ganeti-confd running on ganeti1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-confd), command name ganeti-confd [12:36:19] jouncebot: refresh [12:36:21] I refreshed my knowledge about deployments. [12:36:22] PROBLEM - ganeti-confd running on ganeti1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-confd), command name ganeti-confd [12:36:22] PROBLEM - ganeti-mond running on ganeti1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond [12:37:02] RECOVERY - ganeti-noded running on ganeti1004 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded [12:37:12] RECOVERY - ganeti-mond running on ganeti1003 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond [12:37:12] RECOVERY - ganeti-mond running on ganeti1001 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond [12:37:12] RECOVERY - ganeti-confd running on ganeti1004 is OK: PROCS OK: 1 process with UID = 110 (gnt-confd), command name ganeti-confd [12:37:22] RECOVERY - ganeti-confd running on ganeti1001 is OK: PROCS OK: 1 process with UID = 110 (gnt-confd), command name ganeti-confd [12:37:22] RECOVERY - ganeti-mond running on ganeti1004 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond [12:39:13] RECOVERY - mediawiki-installation DSH group on mwdebug1002 is OK: OK [12:45:47] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2100806 [12:46:07] RECOVERY - ganeti-confd running on ganeti1007 is OK: PROCS OK: 1 process with UID = 111 (gnt-confd), command name ganeti-confd [12:46:27] RECOVERY - puppet last run on ganeti1006 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [12:46:37] RECOVERY - Check whether ferm is active by checking the default input chain on ganeti1006 is OK: OK ferm input default policy is set [12:46:37] RECOVERY - Check systemd state on ganeti1006 is OK: OK - running: The system is fully operational [12:48:33] jouncebot: next [12:48:33] In 0 hour(s) and 11 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170621T1300) [12:48:56] aude: looks like you are the only one for eu swat today, deploying your changes yourself? cc hashar [12:57:42] hi [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170621T1300). Please do the needful. [13:00:04] aude: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:03:51] Does anyone want to deploy? [13:04:10] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360644 [13:04:57] I am around to check test wikidata etc but not in suitable place to deploy myself [13:05:06] Otherwise maybe I can do it later [13:05:56] I am going to quickly deploy that change to db-eqiad.php [13:05:59] !log reboot analytics1003 (Hue, Camus, Oozie, Hive master) for kernel upgrade [13:06:01] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360644 (owner: 10Marostegui) [13:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:34] 10Operations, 10Gerrit, 10Patch-For-Review: Gerrit constantly throws HTTP 500 error when reviewing patches (due to "Too many open files") - https://phabricator.wikimedia.org/T168360#3367211 (10Paladox) According to bin/gerrit.sh status this is what the init script has GERRIT_FDS = 12000 (th... [13:07:11] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360644 (owner: 10Marostegui) [13:07:20] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360644 (owner: 10Marostegui) [13:08:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1070 - T166207 (duration: 00m 46s) [13:08:19] Probalby later is good for my thing [13:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:22] T166207: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207 [13:09:23] (03PS1) 10Marostegui: db-eqiad.php: Depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360645 (https://phabricator.wikimedia.org/T166207) [13:12:08] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360645 (https://phabricator.wikimedia.org/T166207) (owner: 10Marostegui) [13:13:27] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360645 (https://phabricator.wikimedia.org/T166207) (owner: 10Marostegui) [13:13:40] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360645 (https://phabricator.wikimedia.org/T166207) (owner: 10Marostegui) [13:14:26] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1045 - T166207 (duration: 00m 44s) [13:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:35] T166207: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207 [13:15:01] !log Deploy alter table s5 - db1045 - T166207 [13:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:30] !log Deploy alter table s5 - labsdb1001 - T166207 [13:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:55] !log reboot kafka1013 for kernel updates [13:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:21] !log Deploy alter table on s7 - directly on codfw master (db2029) - this will generate lag on codfw - T166208 [13:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:29] T166208: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208 [13:28:13] audephone, marostegui: mid-day swat done? [13:28:58] moritzm: I believe audephone will deploy his change later [13:29:03] !log reboot aqs1007 for kernel update [13:29:07] moritzm: From my side, I am not changing the config anymore today :-) [13:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:22] Unless elukey wants me to do 181538957398573 alter tables :-) [13:29:56] marostegui: for what??? :D :D :D [13:30:23] if you mean the EventLogging one I'd need to schedule them with you :P [13:30:51] elukey: yes those!!! :p [13:31:17] marostegui: ahahhahaah whenever you have time [13:32:00] Never? :) [13:32:27] Is anyone deploying in this swat window? [13:32:55] addshore: check the backlog, audephone said he wasn't in a good position to deploy when the time arrived [13:33:16] marostegui: https://www.youtube.com/watch?v=_Z5-P9v3F8w [13:33:31] hahahahahahahaha [13:33:45] just the music style I like! [13:33:50] I knew it! [13:34:05] In that case I am going to add my 2 patches to swat and deploy them in the remaining part of the window [13:39:05] !log reboot lvs[2001-2003] (codfw primaries) for kernel update [13:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:34] Addshore no one was deploying [13:40:42] ack [13:40:49] I have a change but can't deploy myself right now [13:41:10] Maybe I can do it later or otherwise not a big deal [13:44:56] !log reboot aqs100[89] for kernel updates [13:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:42] audephone: dont think I'll have the time to confrotably squeeze it in before my meeting [13:46:47] syncing mine now [13:47:07] !log addshore@tin Synchronized php-1.30.0-wmf.5/extensions/RevisionSlider/modules/ext.RevisionSlider.SliderView.js: SWAT: [[gerrit:360650|Fix errors leading to wrong slider scroll postions]] T168299 (duration: 00m 46s) [13:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:16] T168299: Problems clicking slider lines and browsing with arrows on Chrome RTL - https://phabricator.wikimedia.org/T168299 [13:48:27] !log addshore@tin Synchronized php-1.30.0-wmf.6/extensions/RevisionSlider/modules/ext.RevisionSlider.SliderView.js: SWAT: [[gerrit:360648|Fix errors leading to wrong slider scroll postions]] T168299 (duration: 00m 44s) [13:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:14] thats my 2 done [13:50:41] (03CR) 10Krinkle: "fixme: This broke most usage of wmf.png. The symlinks for project-logos are fine, since those are used by CSS where the change from square" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [13:50:42] !log pruning old kernels on prometheus* [13:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:15] PROBLEM - DPKG on prometheus1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:51:15] PROBLEM - Disk space on prometheus2004 is CRITICAL: DISK CRITICAL - free space: /boot 0 MB (0% inode=98%) [13:51:15] PROBLEM - Disk space on prometheus2003 is CRITICAL: DISK CRITICAL - free space: /boot 0 MB (0% inode=98%) [13:51:21] !log rebooting eventlog1001 for kernel update (eventlogging host) [13:51:25] PROBLEM - DPKG on prometheus2004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:38] addshore not a problem [13:51:55] PROBLEM - DPKG on prometheus1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:51:55] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /boot 0 MB (0% inode=98%) [13:51:55] PROBLEM - DPKG on prometheus2003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:52:12] !log install exim security updates on fermium (lists) [13:52:14] !log install analysis-kuromoji plugin on relforge [13:52:15] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [13:52:15] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [13:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:55] RECOVERY - DPKG on prometheus1003 is OK: All packages OK [13:53:06] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [13:53:06] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [13:53:15] RECOVERY - DPKG on prometheus1004 is OK: All packages OK [13:53:15] RECOVERY - Disk space on prometheus2004 is OK: DISK OK [13:53:15] RECOVERY - Disk space on prometheus2003 is OK: DISK OK [13:53:19] the mobile apps alarms are due to aqs reboots, still didn't get why [13:53:40] Cc: urandom --^ [13:53:55] RECOVERY - Disk space on prometheus1004 is OK: DISK OK [13:54:56] RECOVERY - DPKG on prometheus2003 is OK: All packages OK [13:55:25] RECOVERY - DPKG on prometheus2004 is OK: All packages OK [13:56:28] 10Operations, 10Performance-Team, 10monitoring, 10Release-Engineering-Team (Watching / External), 10Wikimedia-Incident: MediaWiki load time regression should trigger an alarm / page people - https://phabricator.wikimedia.org/T146125#3367352 (10Gilles) 05Open>03Resolved a:03Gilles Our new Grafana-ba... [13:58:05] PROBLEM - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 116 threshold =0.1 breach: status: red, number_of_nodes: 2, unassigned_shards: 110, number_of_pending_tasks: 19, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 36, task_max_waiting_in_queue_millis: 8564, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: [13:58:05] ve_shards: 41, initializing_shards: 6, number_of_data_nodes: 2, delayed_unassigned_shards: 0 [13:58:51] damn relforge above is me, already back to green, but not fast enough [13:59:05] RECOVERY - ElasticSearch health check for shards on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 145, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 157, initial [13:59:05] mber_of_data_nodes: 2, delayed_unassigned_shards: 0 [14:01:15] !log reimage ms-be1020 / ms-be1021 with stretch [14:01:17] !log restarting wdqs1001 for kernel upgrade [14:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:16] !log rebooting labvirt1001 [14:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:35] elukey: akosiaris: the mobileapps most-read endpoint makes several calls to the AQS Pageview API. What I don't understand is why when an aqs machine is rebooted that it doesn't get depooled from the cluster [14:03:37] !log reboot eventlog2001 for kernel update [14:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:43] bearND: I think that this may have happened - every aqs host runs two cassandra instances and one http/restbase/hypherswitch one [14:04:53] (too many ones but you got it) [14:05:25] PROBLEM - Host eventlog2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:32] bearND: but there is no "data locality", so it frequently happens that the http service on say aqs1004 needs data from cassandra on aqs1007 [14:05:48] that depends on the cassandra driver [14:06:21] what I am seeing now is that when I reboot a node (draining cassandra first and depooling AQS from the load balancer) [14:06:28] elukey: ah, i see. It has its own RB and cassandra instances [14:06:39] yeah [14:06:57] now this mess should not happen of course [14:07:21] data is replicated as well, in a rack-aware manner [14:07:33] hello :) [14:08:00] gwicke: my suspicion is about in flight requests from RB on aqsX to the cassandra instances that are stopped for the reboot [14:08:14] causing timeouts [14:08:24] !log reboot lvs[3003-3004] (esams secondaries) for kernel update [14:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:39] in theory, the reboot of a single machine shouldn't be noticeable; in practice, the driver sometimes doesn't mark a node as down immediately [14:08:49] elukey: Ok, having the own RB instances kinda makes sense since it's accessed through only wikimedia.org. [14:09:20] bearND: what is the impact to users for this kind of outages? I mean, how much mobile apps depends on AQS ? [14:09:35] elukey: no impact really [14:09:47] the very latest driver version has some improvements in that regard [14:10:19] elukey: since the WP RB cluster stored the responses in its own cassandra and gets cached in Varnish [14:10:56] it's slower to start up with many nodes & keyspaces, though, which is a problem for RB (currently looking into that) [14:11:31] bearND: but say we are unlucky and we have a miss everywhere, what is the impact? (just want to have a clear idea) [14:11:46] elukey: If it were to persist for a very long time then the list of top read articles in the apps would be skipped for a day, that's the worst case scenario. But it wouldn't cause any app crashes. [14:11:47] what is going on? [14:11:48] lvs? [14:12:00] bearND: okok! [14:12:13] mobrovac: ciao Marko, nothing is exploding [14:12:34] we are rebooting hosts like crazy and sometimes errors creeps up :) [14:12:49] ah that business [14:13:21] mobrovac: I rebooted kafka2001 this morning, if you give me the green light I'd complete the codfw cluster [14:13:27] and tomorrow the eqiad one [14:14:00] mobrovac: sounds like when one aqs host gets rebooted another one with the RB instance still wants to talk to it [14:15:02] 10Operations, 10Performance-Team, 10Traffic: Upgrade to Varnish 5 - https://phabricator.wikimedia.org/T168529#3367415 (10Gilles) [14:16:37] elukey: checking [14:17:18] 10Operations, 10Graphite, 10Patch-For-Review: Something puts many different metrics into graphite, allocating a lot of disk space - https://phabricator.wikimedia.org/T1075#3367437 (10fgiunchedi) [14:18:11] !log rebooting mx1001 for kernel upgrade [14:18:17] Reedy: https://gerrit.wikimedia.org/r/#/c/360652/ would like to roll out asap [14:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:34] elukey: looking good, go ahead with the rest [14:18:53] RECOVERY - Host eventlog2001 is UP: PING OK - Packet loss = 0%, RTA = 36.03 ms [14:18:55] bearND: which other RB? [14:20:49] mobrovac: well, that's from my limited understanding of the setup for the PageViews API. elukey mentioned here earlier: "every aqs host runs two cassandra instances and one http/restbase/hypherswitch one". and "there is no "data locality", so it frequently happens that the http service on say aqs1004 needs data from cassandra on aqs1007" [14:21:59] !log rebooting labvirt1002 [14:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:51] bearND: yes, that's AQS talking to the underlying cassandra instance, but when a request to one node fails, it retries with another one. the problem with reboots is that you need wait for the time out to expire before that happens [14:25:33] PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 255 bytes in 5.431 second response time [14:25:44] !log rebooting labvirt1003 [14:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:33] PROBLEM - Host www.toolserver.org is DOWN: CRITICAL - Host Unreachable (www.toolserver.org) [14:29:48] hmm i cannot connect to labs [14:29:59] channel 0: open failed: connect failed: No route to host [14:29:59] stdio forwarding failed [14:29:59] ssh_exchange_identification: Connection closed by remote host [14:30:40] chasemp andrewbogott ^^ [14:30:52] paladox: 'labs'? [14:30:56] yep [14:31:21] are you subscribed to labs-l or labs-announce? If not, I recommend it :) [14:31:30] oh [14:31:32] yeh i am [14:31:40] but i get too many emails :) [14:31:41] ah mobrovac the reboot doesn't make the request to fail immediately but triggers a timeout [14:31:51] and hence the alarms [14:31:53] PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/redis - 256 bytes in 57.497 second response time [14:32:24] !log reboot lvs[3001-3002] (esams primaries) for kernel update [14:32:33] RECOVERY - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.105 second response time [14:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:45] in any case, I am going to do aqs1009 now [14:35:03] RECOVERY - Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.012 second response time [14:38:13] !log rebooting labvirt1004 [14:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:44] RECOVERY - Host www.toolserver.org is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [14:40:53] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 255 bytes in 0.836 second response time [14:40:58] !log reboot remaining scb* hosts for kernel update [14:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:02] moritzm: lvs3001 is not rebooting properly [14:44:02] There are offline or missing virtual drives with preserved cache. [14:44:03] Please check the cables and ensure that all drives are present. [14:44:03] Press any key to enter the configuration utility. [14:44:07] ever seen that before? [14:44:28] hmm, no haven't seen that before [14:45:34] ema: could it be because mdadm misbehaving again? https://phabricator.wikimedia.org/T166965 [14:46:38] marostegui: it might be because of the faulty disk yeah [14:47:06] It is supposed to ignore it, but we all know mdadm... [14:47:20] marostegui: well this is the bios saying that, not mdadm [14:47:27] ema: T166965 [14:47:28] T166965: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965 [14:47:49] ops, sorry I'm late, marostegui was first [14:47:59] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.006 second response time [14:48:53] ema: that is not only mdadm, also megaraid [14:49:43] and the disks are each a virtual drive with raid0 between one disk (silly I know) [14:49:58] so ti makes sense that says that a virtual drive is offline/missing/broken [14:50:02] s/ti/it/ [14:50:20] volans: yeah I'm just trying to understand how to get past that [14:50:27] then why is configured that way I dunno ;) [14:50:48] can you mark the Virtual Drive as failed/offline and let it continue without it? [14:51:10] volans: I'm trying but the ui sucks [14:52:38] yeah! maybe chris can help you to find the right button ;) [14:52:42] :-P [14:52:59] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.210 second response time [14:53:31] ema: IIRC you can discard the preserved cache by selecting the controller then the F-key for the properties popup and one is about preserved cache [14:53:53] !log rebooting labvirt1005 [14:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:02] I'm not proud of knowning this "from memory" but swift dell hw has helped [14:54:09] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2076480 [14:56:39] PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 255 bytes in 3.007 second response time [14:56:59] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.005 second response time [15:01:00] !log reboot kafka200[23] for kernel updates (eventbus codfw) [15:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:27] !log reboot job runners in codfw for kernel update [15:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:59] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK [15:03:29] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [15:04:06] !log rebooting ruthenium for kernel update [15:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:48] lvs3001 is not coming back up because of the raid error mentioned above, the console is available if anyone wants to take a look ^ [15:09:39] RECOVERY - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.392 second response time [15:13:05] !log rebooting labvirt1006 [15:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:05] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.137 second response time [15:17:36] PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 255 bytes in 1.586 second response time [15:18:24] ACKNOWLEDGEMENT - Host scb1003 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T168534 [15:24:36] RECOVERY - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.693 second response time [15:26:05] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.124 second response time [15:29:15] PROBLEM - Host tools.wmflabs.org is DOWN: CRITICAL - Time to live exceeded (tools.wmflabs.org) [15:30:45] RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms