[00:00:06] (a ctrl + shift + r was needed to reset my CSS cache) [00:02:00] I am still not seeing it. >.> [00:02:05] Even with that. [00:02:08] Eeeegh. [00:02:09] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372087 (https://phabricator.wikimedia.org/T154371) (owner: 10Dereckson) [00:02:21] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable WikidataPageBanner on test wikis (T173388) (duration: 00m 51s) [00:02:26] Isarra: caching is fun [00:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:32] T173388: Install WikidataPageBanner on test.wikipedia.org - https://phabricator.wikimedia.org/T173388 [00:03:04] For Amir's, should we do those at some point when he's around, or what? [00:03:14] I don't know [00:03:18] You could just do it without telling anyone and see how long they take to notice. [00:03:32] (03Merged) 10jenkins-bot: Enable Timeless on four French wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372087 (https://phabricator.wikimedia.org/T154371) (owner: 10Dereckson) [00:03:35] 372087 do the fr. one [00:03:45] (03CR) 10jenkins-bot: Enable Timeless on four French wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372087 (https://phabricator.wikimedia.org/T154371) (owner: 10Dereckson) [00:03:54] (live on mwdebug1002) [00:04:31] Given that it's the same number, might as well lump the english and hebrew together. [00:04:31] Also I need to go get food. And better internet. This is patently awful. [00:05:47] (03PS2) 10Dereckson: Remove wbq_evaluation logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367913 (owner: 10Lucas Werkmeister (WMDE)) [00:05:53] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367913 (owner: 10Lucas Werkmeister (WMDE)) [00:05:57] Dereckson: Header looks broken on fr.wikisource. [00:07:11] ...legend is also not fixed. >.> [00:07:55] Dereckson: while waiting i also added 2 small config changes [00:08:04] ok if we don't want to do them today though [00:08:06] https://gerrit.wikimedia.org/r/#/c/367913/ [00:08:08] aude: seen them [00:08:11] https://gerrit.wikimedia.org/r/#/c/370846/ [00:08:18] (03Merged) 10jenkins-bot: Remove wbq_evaluation logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367913 (owner: 10Lucas Werkmeister (WMDE)) [00:08:24] thanks [00:08:27] (03CR) 10jenkins-bot: Remove wbq_evaluation logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367913 (owner: 10Lucas Werkmeister (WMDE)) [00:08:31] Isarra: https://phabricator.wikimedia.org/T154371#3526824 <- what I've [00:09:56] GODSDAMN ECHO. [00:09:56] Man, IRC is like tossing words into a void. [00:10:56] https://phab.wmfusercontent.org/file/data/joc3skf67xogosxmgkfm/PHID-FILE-yhyludnm7mzdethqoxv6/image.png [00:11:09] Dereckson: So what I'm seeing is like wiktionary, except way worse because one of the echo badges is kind of off the bottom entirely. [00:11:28] And I'm using both chrome and firefox and I'm not even logged in in chrome. [00:11:37] https://fr.wikisource.org/wiki/Spécial:Liste_de_suivi?useskin=timeless <- legend fixed [00:11:47] It's not fixed for me. >.> [00:12:11] even if you do a https://fr.wikisource.org/wiki/Spécial:Liste_de_suivi?useskin=timeless&debug=true ? [00:13:19] Yup. [00:13:24] aude: Remove wbq_evaluation logging live on mwdebug1002 [00:13:34] (I don't think this one is testable) [00:14:02] Isarra: perhaps caching works differently according the datacetner :/ [00:14:12] can check that it doesn't entirely break everything though [00:14:16] * aude checks [00:14:18] I browse through esams [00:15:01] looks good [00:16:11] Dereckson: Okay, the legend is fixed on watchlist, not recentchanges. [00:16:30] arg [00:16:34] always tested watchlist [00:16:38] that explains that [00:16:52] so yes wikt. : header brokern [00:17:12] wikinews : ok [00:17:13] The echo thing may be due to a user script. [00:17:21] Different users don't get it. [00:17:30] On the other hand, I may also just have different notifications. [00:17:33] ... [00:17:39] Mleargh. [00:17:41] wikiversity: ok [00:17:48] Let's just call this a success. [00:18:17] I revert for wiktionary or we'll fix it later? [00:19:46] (03CR) 10Dereckson: [C: 032] Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370846 (owner: 10Matěj Suchánek) [00:19:53] (03PS2) 10Dereckson: Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370846 (owner: 10Matěj Suchánek) [00:20:05] (03CR) 10Dereckson: [C: 032] "SWAT, take two" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370846 (owner: 10Matěj Suchánek) [00:21:17] Yeah, might as well revert it until a proper fix is available. [00:21:24] It'll take on-wiki configuration, but whatever. [00:21:31] (03Merged) 10jenkins-bot: Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370846 (owner: 10Matěj Suchánek) [00:21:41] (03CR) 10jenkins-bot: Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370846 (owner: 10Matěj Suchánek) [00:21:57] Isarra: and green light for wikisource/wikiversity/wikinews? [00:23:09] aude: Explicitly load badges extension live on mwdebug1002 (note: www.wikidata.org is currently on *wmf11*) [00:23:24] checking [00:23:39] wmf.13 and 14? [00:23:44] you did 12 and 14 [00:23:48] Wikinews' custom formatting looks a mite... strange, but yeah. [00:24:11] wmf.13 core should be using wmf12 wikidata stuff [00:24:20] * aude checks test wikis (wmf14) [00:24:50] looks ok on wmf14 [00:25:05] Okay, I need to go get food now bye. [00:25:12] Bye Isarra, bon appétit [00:25:17] and thanks for your assistance during the tests [00:25:35] (03PS1) 10Dereckson: Don't deploy Timeless on fr.wiktionary for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372092 [00:25:41] oh [00:25:53] i think we made a wmf.13 branch of wikidata [00:26:25] https://gerrit.wikimedia.org/r/#/c/372093/ [00:26:39] if you don't want to wait around for jenkins, suppose i could take care of it [00:26:43] <3 [00:27:10] (03PS2) 10Dereckson: Don't deploy Timeless on fr.wiktionary for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372092 (https://phabricator.wikimedia.org/T154371) [00:27:14] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [00:28:35] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [00:29:09] aude: I've still a scap to do + write 4 messages in village pumps to notify Timeless is there (or not for wikt) [00:29:31] hmmm https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1&from=now-3h&to=now [00:29:55] nothing on fatalmonitor [00:30:00] yeah [00:30:24] The top three are: [00:30:25] 315 proc line: 2959: warning: points must have either 4 or 2 values per line [00:30:29] 181 LuaSandboxFunction::call(): recursion detected in /srv/mediawiki/php-1.30.0-wmf.13/extensions/Scribunto/engines/LuaSandbox/Engine.php on line 312 [00:30:32] 19 LuaSandboxFunction::call(): unable to convert argument 1 to a lua value in /srv/mediawiki/php-1.30.0-wmf.13/extensions/Scribunto/engines/LuaSandbox/Engine.php on line 312 [00:30:40] lua is nothing new [00:30:55] and the ploticus neither [00:31:32] they might be upload errors [00:32:25] kibana offers Error connecting to {db_server}: {error} / Wikimedia\Rdbms\LoadMonitor::getServerStates: host {db_server} is unreachable at mediawiki-errors dashboard [00:32:50] think that is also somewhat normal [00:33:56] looks like it went down again [00:34:57] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372092 (https://phabricator.wikimedia.org/T154371) (owner: 10Dereckson) [00:35:37] aude: so Update Wikidata property blacklist live on mwdebug1002 [00:35:46] checking [00:36:22] (03Merged) 10jenkins-bot: Don't deploy Timeless on fr.wiktionary for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372092 (https://phabricator.wikimedia.org/T154371) (owner: 10Dereckson) [00:36:32] (03CR) 10jenkins-bot: Don't deploy Timeless on fr.wiktionary for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372092 (https://phabricator.wikimedia.org/T154371) (owner: 10Dereckson) [00:36:45] looks ok [00:37:55] so config done normally [00:38:17] i see that jenkins fails for wikidata on wmf13 [00:38:31] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable Timeless on three French wikis (T154371) + Fixes for Wikidata: Remove wbq_evaluation logging, Update Wikidata property blacklist ([[Gerrit:367913]] and [[Gerrit:370846]]) (duration: 00m 53s) [00:38:34] 00:32:16 1) Wikibase\Repo\Tests\Api\CreateRedirectTest::testSetRedirect_failure with data set "bad source id" ('xyz', 'Q12', 'invalid-entity-id') [00:38:37] 00:32:16 Use of ApiUsageException::getCodeString was deprecated in MediaWiki 1.29. [Called from s [00:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:44] same than for the submodule change [00:38:44] yeah, i'm not sure about trying to fix it [00:38:44] T154371: Review and deploy Timeless skin - https://phabricator.wikimedia.org/T154371 [00:38:55] if it is fixed in wmf14 [00:38:58] I guess any php-1.30.0-wmf.13 fix will trigger this one [00:39:11] we can just wait until tomorrow/thursday for the badges thing [00:39:16] to go out with the train [00:39:48] revert on wmf/1.30.0-wmf.14 and wmf/1.30.0-wmf.12 in such a case [00:39:55] wmf14 is ok [00:39:58] to keep [00:40:14] i don't think wmf12 is used so i wouldn't be concerned also [00:40:24] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:40:35] okay I sync [00:40:42] ok [00:41:44] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:42:18] !log dereckson@tin Synchronized php-1.30.0-wmf.14/extensions/Wikidata/Wikidata.php: Explicitly load badges extension ([[Gerrit:372051]]) (duration: 00m 51s) [00:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:38] !log dereckson@tin Synchronized php-1.30.0-wmf.12/extensions/Wikidata/Wikidata.php: Explicitly load badges extension ([[Gerrit:372088]]) (duration: 00m 51s) [00:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:58] aude: for wmf13 we abandon on manually submit? [00:44:20] (or) [00:45:11] i abandoned [00:45:15] either way ok w/ me [00:45:19] ok [00:45:26] so SWAT done \o/ [00:45:30] i noted in phabricator that this goes out with the train [00:45:39] thanks [00:45:41] :) [01:06:09] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3526883 (10DarTar) Just checking in to see if there's any update on this as the 3-day period is over as of yesterday. Thanks, folks! [01:25:14] PROBLEM - BGP status on cr2-knams is CRITICAL: BGP CRITICAL - AS1257/IPv4: Active, AS1257/IPv6: Active [02:07:44] PROBLEM - MegaRAID on labsdb1003 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [02:23:05] 10Operations, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: Move codfw frack to new infra - https://phabricator.wikimedia.org/T171970#3526886 (10ayounsi) some answers from Juniper about the other issues noticed: - Presence of core dumps ``` /var/crash/corefiles: total blocks: 70484 -rw-r--r-- 1 r... [02:28:17] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.11) (duration: 08m 51s) [02:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:04] RECOVERY - BGP status on cr2-knams is OK: BGP OK - up: 11, down: 0, shutdown: 0 [02:51:05] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.13) (duration: 08m 08s) [02:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:35] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 741.31 seconds [03:28:22] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.14) (duration: 16m 32s) [03:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:35:43] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Aug 16 03:35:43 UTC 2017 (duration 7m 21s) [03:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:25] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 1 [04:33:05] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 132.36 seconds [05:17:14] PROBLEM - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:17:15] PROBLEM - cassandra-a SSL 10.192.32.137:7001 on restbase2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [05:19:34] PROBLEM - cassandra-a service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [05:20:14] PROBLEM - Check systemd state on restbase2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:30:04] 10Operations, 10Ops-Access-Requests, 10User-Addshore: Requesting access to mwlog1001.eqiad.wmnet for goransm - https://phabricator.wikimedia.org/T171958#3526974 (10Abraham) @ayounsi I confirm this request and ask to grant the requested access. Thanks. [05:32:44] PROBLEM - HHVM rendering on mw1216 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [05:32:54] PROBLEM - Apache HTTP on mw1216 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [05:33:04] PROBLEM - Nginx local proxy to apache on mw1216 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time [05:33:44] RECOVERY - HHVM rendering on mw1216 is OK: HTTP OK: HTTP/1.1 200 OK - 74200 bytes in 1.946 second response time [05:33:54] RECOVERY - Apache HTTP on mw1216 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.032 second response time [05:34:04] RECOVERY - Nginx local proxy to apache on mw1216 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.061 second response time [05:39:24] RECOVERY - Check systemd state on restbase2004 is OK: OK - running: The system is fully operational [05:39:44] RECOVERY - cassandra-a service on restbase2004 is OK: OK - cassandra-a is active [05:41:14] RECOVERY - cassandra-a SSL 10.192.32.137:7001 on restbase2004 is OK: SSL OK - Certificate restbase2004-a valid until 2018-07-19 10:52:21 +0000 (expires in 337 days) [05:41:15] RECOVERY - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is OK: TCP OK - 0.038 second response time on 10.192.32.137 port 9042 [06:01:18] !log Stop replication on db2076 to fix duplicate entries - T151029 [06:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:32] T151029: duplicate key problems - https://phabricator.wikimedia.org/T151029 [06:17:00] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:00] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:00] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:01] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:01] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:01] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:01] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:10] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:10] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:10] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:10] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:21] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:21] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:21] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:30] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:30] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:30] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:30] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:30] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:31] backups ^ [06:17:31] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:18:50] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [06:18:50] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:18:51] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:19:00] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:19:00] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:19:01] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:19:01] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [06:19:01] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:19:01] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:19:01] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [06:19:02] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [06:19:11] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [06:19:11] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:19:11] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:19:20] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:19:21] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:19:21] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:19:21] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:19:21] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:19:30] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [06:32:54] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Analytics Kafka cluster causing timeouts to Varnishkafka since July 28th - https://phabricator.wikimedia.org/T172681#3526998 (10elukey) {F9085708} Seems definitely solved, but some follow ups would need to be done: 1) Make sure to insta... [06:33:02] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Analytics Kafka cluster causing timeouts to Varnishkafka since July 28th - https://phabricator.wikimedia.org/T172681#3526999 (10elukey) p:05High>03Normal [06:36:00] PROBLEM - Disk space on graphite2001 is CRITICAL: DISK CRITICAL - free space: / 1768 MB (3% inode=97%) [06:36:20] PROBLEM - HHVM rendering on mw2129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:10] RECOVERY - HHVM rendering on mw2129 is OK: HTTP OK: HTTP/1.1 200 OK - 74108 bytes in 0.282 second response time [06:40:37] !log Run pt-table-checksum on s3 for revision table - T164488 [06:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:49] T164488: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488 [06:41:00] PROBLEM - Disk space on graphite2001 is CRITICAL: DISK CRITICAL - free space: / 1739 MB (3% inode=97%) [06:46:01] PROBLEM - Disk space on graphite2001 is CRITICAL: DISK CRITICAL - free space: / 1762 MB (3% inode=97%) [06:51:01] PROBLEM - Disk space on graphite2001 is CRITICAL: DISK CRITICAL - free space: / 1755 MB (3% inode=97%) [06:56:10] PROBLEM - Disk space on graphite2001 is CRITICAL: DISK CRITICAL - free space: / 1715 MB (3% inode=97%) [07:00:18] checking --^ [07:02:23] seems the same invalid line (librenms. spam issue in the carbon logs [07:08:08] !log executed sudo find -type f -mtime +30 -exec rm {} \; in /var/log/carbon to free some space [07:08:10] RECOVERY - Disk space on graphite2001 is OK: DISK OK [07:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:41] (opening a task [07:13:53] ahhhh /var/log/carbon/*.log in logrotate does not get applied to dirs like /var/log/carbon/carbon-cache* [07:18:21] 10Operations, 10monitoring, 10Graphite: graphite2001 disk space alarms for big log files in /var/log/carbon - https://phabricator.wikimedia.org/T173401#3527040 (10elukey) [07:25:01] !log Stop MySQL on db2076 to copy its content to dbstore2001 - T168409 [07:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:14] T168409: Migrate dbstore2001 to multi instance - https://phabricator.wikimedia.org/T168409 [07:39:48] elukey: thanks for the task! [07:41:34] yw! Hope that makes sense :) [07:45:11] (03CR) 10Filippo Giunchedi: [C: 031] "Is there a task associated with this change?" [puppet] - 10https://gerrit.wikimedia.org/r/326151 (owner: 10EBernhardson) [07:47:58] (03PS1) 10Marostegui: mariadb: Add db2077 as s7 slave [puppet] - 10https://gerrit.wikimedia.org/r/372104 (https://phabricator.wikimedia.org/T170662) [07:48:59] (03CR) 10Filippo Giunchedi: [C: 032] Reduce the per-IP concurrency limit in Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/372054 (https://phabricator.wikimedia.org/T172930) (owner: 10Gilles) [07:51:34] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/7441/" [puppet] - 10https://gerrit.wikimedia.org/r/372104 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [07:54:16] (03PS2) 10Marostegui: mariadb: Add db2077 as s7 slave [puppet] - 10https://gerrit.wikimedia.org/r/372104 (https://phabricator.wikimedia.org/T170662) [07:55:54] (03CR) 10Marostegui: [C: 032] mariadb: Add db2077 as s7 slave [puppet] - 10https://gerrit.wikimedia.org/r/372104 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [08:01:14] (03PS1) 10Marostegui: db-codfw.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372106 (https://phabricator.wikimedia.org/T170662) [08:03:56] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372106 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [08:05:27] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372106 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [08:05:44] (03CR) 10Filippo Giunchedi: [C: 031] Add 90s command_timeout override to nrpe_local.cfg [puppet] - 10https://gerrit.wikimedia.org/r/370858 (https://phabricator.wikimedia.org/T172921) (owner: 10Herron) [08:06:17] (03CR) 10jenkins-bot: db-codfw.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372106 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [08:07:14] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2047 - T170662 (duration: 01m 06s) [08:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:25] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [08:10:20] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [08:10:24] !log bounced pdfrender on scb1004 (T159922) [08:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:36] T159922: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922 [08:11:24] (03PS15) 10Elukey: role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) [08:12:00] !log Stop MySQL on db2047 to copy its content to db2077 - T170662 [08:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:56] PROBLEM - mysqld processes on db2047 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [08:17:07] That is me :( [08:17:12] Looks like my browser timedout [08:17:16] when I downtimed it [08:17:50] sorry [08:19:36] 10Operations, 10Wiki-Loves-Monuments (2017): Import Wiki Loves Monuments photos from Flickr to Commons - https://phabricator.wikimedia.org/T173056#3527091 (10fgiunchedi) >>! In T173056#3524578, @Multichill wrote: > @fgiunchedi Shame we missed at Wikimania! We'll keep in touch about this. Indeed! I'm subscrib... [08:20:16] (03PS16) 10Elukey: role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) [08:30:04] (03PS17) 10Elukey: role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) [08:33:12] 10Operations, 10Thumbor, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817#3527119 (10fgiunchedi) >>! In T170817#3526036, @Gilles wrote: > Right off the bat, the first one with major differences, Century Schoolbook L, comes from th... [08:34:18] 10Operations, 10Thumbor, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817#3527120 (10fgiunchedi) >>! In T170817#3525964, @Gilles wrote: > It's probably a minor difference in rsvg rendering. 98.8% is very good similarity. Let's dou... [08:37:24] !log Drop wikigrok tables from s1, s3 and s5 - T172020 [08:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:37] T172020: Drop WikiGrok tables from production - https://phabricator.wikimedia.org/T172020 [08:40:32] (03CR) 10Elukey: role::analytics_cluster::hadoop::client: move to profiles (p2) (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [08:43:18] (03PS1) 10Filippo Giunchedi: Add gsfonts build and runtime dependency [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/372117 [08:44:32] (03PS2) 10Filippo Giunchedi: Add gsfonts build and runtime dependency [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/372117 [08:45:26] (03CR) 10Filippo Giunchedi: [C: 032] Add gsfonts build and runtime dependency [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/372117 (owner: 10Filippo Giunchedi) [08:54:11] disabled puppet across most of analytics nodes for https://gerrit.wikimedia.org/r/#/c/370798 (precaution) [08:54:40] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7442/" [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [08:56:44] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372118 (https://phabricator.wikimedia.org/T128546) [09:02:45] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: do not hardcode jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/371034 (https://phabricator.wikimedia.org/T170817) (owner: 10Filippo Giunchedi) [09:12:32] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/7443/" [puppet] - 10https://gerrit.wikimedia.org/r/370969 (https://phabricator.wikimedia.org/T170817) (owner: 10Filippo Giunchedi) [09:13:31] (03PS4) 10Filippo Giunchedi: mediawiki: clean up deprecated fonts packages [puppet] - 10https://gerrit.wikimedia.org/r/370969 (https://phabricator.wikimedia.org/T170817) [09:14:33] (03PS2) 10Filippo Giunchedi: thumbor: do not hardcode jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/371034 (https://phabricator.wikimedia.org/T170817) [09:14:50] (03CR) 10Filippo Giunchedi: [C: 032] mediawiki: clean up deprecated fonts packages [puppet] - 10https://gerrit.wikimedia.org/r/370969 (https://phabricator.wikimedia.org/T170817) (owner: 10Filippo Giunchedi) [09:15:33] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:25:05] 10Operations, 10Thumbor, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817#3527176 (10fgiunchedi) >>! In T170817#3527120, @fgiunchedi wrote: >>>! In T170817#3525964, @Gilles wrote: >> It's a text rendering difference. Not that it's... [09:28:05] 10Operations, 10monitoring, 10Graphite, 10User-fgiunchedi: graphite2001 disk space alarms for big log files in /var/log/carbon - https://phabricator.wikimedia.org/T173401#3527188 (10fgiunchedi) [09:31:35] (03CR) 10Muehlenhoff: "The handling of ttf-ubuntu-font-family isn't correct; it's not part of Debian in general, but we built it for jessie-wikimedia: Otherwise " [puppet] - 10https://gerrit.wikimedia.org/r/370969 (https://phabricator.wikimedia.org/T170817) (owner: 10Filippo Giunchedi) [09:50:37] (03PS1) 10Elukey: profile::druid::common: fix merge druid::properties [puppet] - 10https://gerrit.wikimedia.org/r/372122 (https://phabricator.wikimedia.org/T167790) [09:53:40] (03CR) 10Elukey: [C: 032] profile::druid::common: fix merge druid::properties [puppet] - 10https://gerrit.wikimedia.org/r/372122 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [09:58:20] (03CR) 10Lucas Werkmeister (WMDE): "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367913 (owner: 10Lucas Werkmeister (WMDE)) [10:03:56] !log copy ubuntu-font-family-sources to stretch-wikimedia - T170817 [10:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:08] T170817: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 [10:07:10] (03PS2) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [10:07:38] (03CR) 10jerkins-bot: [V: 04-1] [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [10:11:36] (03PS3) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [10:15:23] (03PS1) 10Filippo Giunchedi: mediawiki: fix ttf-ubuntu-font-family handling [puppet] - 10https://gerrit.wikimedia.org/r/372125 (https://phabricator.wikimedia.org/T170817) [10:15:37] (03CR) 10Filippo Giunchedi: "Thanks Moritz! Fixed in https://gerrit.wikimedia.org/r/#/c/372125/" [puppet] - 10https://gerrit.wikimedia.org/r/370969 (https://phabricator.wikimedia.org/T170817) (owner: 10Filippo Giunchedi) [10:26:44] (03PS4) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [10:27:13] (03CR) 10jerkins-bot: [V: 04-1] [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [10:30:14] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Analytics Kafka cluster causing timeouts to Varnishkafka since July 28th - https://phabricator.wikimedia.org/T172681#3527270 (10elukey) Created report https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=872327 [10:30:44] 10Operations, 10monitoring, 10netops, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3527271 (10fgiunchedi) Indeed it looks like librenms sends both metrics with whitespace in the name and metrics without values: ``` librenms.asw-b-... [10:31:10] (03CR) 10Muehlenhoff: [C: 031] "Thanks, looks good." [puppet] - 10https://gerrit.wikimedia.org/r/372125 (https://phabricator.wikimedia.org/T170817) (owner: 10Filippo Giunchedi) [10:31:45] (03CR) 10Filippo Giunchedi: [C: 032] mediawiki: fix ttf-ubuntu-font-family handling [puppet] - 10https://gerrit.wikimedia.org/r/372125 (https://phabricator.wikimedia.org/T170817) (owner: 10Filippo Giunchedi) [10:37:41] (03PS1) 10Marostegui: db-codfw.php: Repool db2076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372128 (https://phabricator.wikimedia.org/T170662) [10:39:23] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372128 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [10:40:01] (03PS1) 10Elukey: role::analytics_cluster::monitoring::disk: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/372129 (https://phabricator.wikimedia.org/T167790) [10:41:02] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372128 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [10:41:02] (03PS2) 10Elukey: role::analytics_cluster::monitoring::disk: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/372129 (https://phabricator.wikimedia.org/T167790) [10:41:06] (03CR) 10jenkins-bot: db-codfw.php: Repool db2076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372128 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [10:41:59] (03CR) 10Elukey: [C: 032] role::analytics_cluster::monitoring::disk: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/372129 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [10:42:34] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Pool db2076 - T170662 (duration: 00m 51s) [10:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:45] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [10:46:44] !log Stop replication in sync on db1015 and db1078 - T164488 [10:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:58] T164488: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488 [10:54:01] (03PS1) 10Elukey: Introduce role::analytics_cluster::coordinator [puppet] - 10https://gerrit.wikimedia.org/r/372131 (https://phabricator.wikimedia.org/T167790) [10:54:28] (03CR) 10jerkins-bot: [V: 04-1] Introduce role::analytics_cluster::coordinator [puppet] - 10https://gerrit.wikimedia.org/r/372131 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [10:56:34] (03PS2) 10Elukey: Introduce role::analytics_cluster::coordinator [puppet] - 10https://gerrit.wikimedia.org/r/372131 (https://phabricator.wikimedia.org/T167790) [11:02:47] 10Operations, 10monitoring, 10netops, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3527329 (10fgiunchedi) 05Resolved>03Open Reported upstream at https://github.com/librenms/librenms/issues/7167 and https://github.com/librenms/l... [11:25:22] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:47:52] RECOVERY - MegaRAID on labsdb1003 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [12:01:56] RECOVERY - mysqld processes on db2047 is OK: PROCS OK: 1 process with command name mysqld [12:02:38] the lone page I get is for the recovery? [12:02:43] (db2047) [12:02:53] apergos: the down arrived some hours ago [12:02:56] you never got it? [12:03:31] if it was a while back I may have noted it, peeked in, and moved on [12:03:33] * apergos checks [12:03:54] ah there it is, you are right [12:04:13] Ah good, I was going to ask which phone you have, as it looks like it filters the bad news for you automatically! :) [12:06:52] (03PS1) 10Marostegui: s7.hosts: db2077 is now replicating s7 [software] - 10https://gerrit.wikimedia.org/r/372134 (https://phabricator.wikimedia.org/T170662) [12:07:57] (03CR) 10Marostegui: [C: 032] s7.hosts: db2077 is now replicating s7 [software] - 10https://gerrit.wikimedia.org/r/372134 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [12:08:43] (03Merged) 10jenkins-bot: s7.hosts: db2077 is now replicating s7 [software] - 10https://gerrit.wikimedia.org/r/372134 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [12:11:44] (03PS1) 10Marostegui: db2047.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/372136 (https://phabricator.wikimedia.org/T148507) [12:12:12] PROBLEM - Host cp3036 is DOWN: PING CRITICAL - Packet loss = 100% [12:14:15] (03CR) 10Marostegui: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7451/" [puppet] - 10https://gerrit.wikimedia.org/r/372136 (https://phabricator.wikimedia.org/T148507) (owner: 10Marostegui) [12:17:03] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3036_v4, cp3036_v6 [12:17:12] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3036_v4, cp3036_v6 [12:17:13] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3036_v4, cp3036_v6 [12:17:13] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3036_v4, cp3036_v6 [12:17:23] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3036_v4, cp3036_v6 [12:17:23] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3036_v4, cp3036_v6 [12:17:32] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3036_v4, cp3036_v6 [12:17:32] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3036_v4, cp3036_v6 [12:17:32] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3036_v4, cp3036_v6 [12:17:33] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3036_v4, cp3036_v6 [12:17:33] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3036_v4, cp3036_v6 [12:17:39] (03PS2) 10ArielGlenn: Use gzip -9 for compressing the Wikidata entity dumps [puppet] - 10https://gerrit.wikimedia.org/r/371946 (owner: 10Hoo man) [12:17:42] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3036_v4, cp3036_v6 [12:17:42] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3036_v4, cp3036_v6 [12:17:42] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3036_v4, cp3036_v6 [12:17:43] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3036_v4, cp3036_v6 [12:17:52] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3036_v4, cp3036_v6 [12:17:52] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3036_v4, cp3036_v6 [12:17:52] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3036_v4, cp3036_v6 [12:17:52] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3036_v4, cp3036_v6 [12:17:53] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3036_v4, cp3036_v6 [12:17:53] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3036_v4, cp3036_v6 [12:19:02] IIRC it was already in maintenance/acked no? [12:20:11] ah no cp3009 [12:20:37] can't find anything in SAL or Phabricator [12:21:04] nothing on the serial console [12:21:14] no output I mean [12:21:39] I was about to check [12:21:57] I'd say to explicitly depool it and then powercycle [12:21:59] (03CR) 10ArielGlenn: [C: 032] Use gzip -9 for compressing the Wikidata entity dumps [puppet] - 10https://gerrit.wikimedia.org/r/371946 (owner: 10Hoo man) [12:22:38] !log jmm@puppetmaster1001 conftool action : set/pooled=no; selector: cp3036.esams.wmnet [12:22:46] ack, just depooled it [12:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:17] super thanks [12:24:41] !log Compressing InnoDB on db2077 - T168409 [12:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:52] T168409: Migrate dbstore2001 to multi instance - https://phabricator.wikimedia.org/T168409 [12:25:25] !log powercycling cp3036 [12:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:53] !log ladsgroup@terbium:~$ time /usr/local/bin/mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki testwikidatawiki --entity-type=property (T172776, T171460) [12:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:05] T172776: Property labels missing on some items - https://phabricator.wikimedia.org/T172776 [12:26:05] T171460: Populate term_full_entity_id on www.wikidata.org - https://phabricator.wikimedia.org/T171460 [12:27:42] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 58 ESP OK [12:27:42] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 58 ESP OK [12:27:42] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 58 ESP OK [12:27:42] RECOVERY - Host cp3036 is UP: PING OK - Packet loss = 0%, RTA = 83.76 ms [12:27:43] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 58 ESP OK [12:27:43] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 58 ESP OK [12:27:52] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 72 ESP OK [12:27:52] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 58 ESP OK [12:27:53] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 58 ESP OK [12:28:02] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 72 ESP OK [12:28:02] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 58 ESP OK [12:28:02] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 72 ESP OK [12:28:02] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 58 ESP OK [12:28:02] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 72 ESP OK [12:28:12] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 72 ESP OK [12:28:13] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 72 ESP OK [12:28:22] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 58 ESP OK [12:28:23] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 72 ESP OK [12:28:32] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 72 ESP OK [12:28:33] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 72 ESP OK [12:28:33] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 72 ESP OK [12:28:33] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 58 ESP OK [12:31:27] Isarra: ping? [12:33:50] (03PS1) 10Phuedx: pagePreviews: Deploy to next 100 stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372137 (https://phabricator.wikimedia.org/T162672) [12:39:50] jouncebot: next [12:39:50] In 0 hour(s) and 20 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170816T1300) [12:42:11] (03PS1) 10Marostegui: mysql-dbstore_codfw: Add s6 [puppet] - 10https://gerrit.wikimedia.org/r/372138 (https://phabricator.wikimedia.org/T168409) [12:46:33] (03PS2) 10Marostegui: mysql-dbstore_codfw: Add s6 [puppet] - 10https://gerrit.wikimedia.org/r/372138 (https://phabricator.wikimedia.org/T168409) [12:46:48] 10Operations, 10ops-eqiad, 10DBA: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3527467 (10Cmjohnson) A case for a new disk has been created. Your case was successfully submitted. Please note your Case ID: 5322179480 for future reference. Regarding the DB crash because of 1 disk failur... [12:48:15] 10Operations, 10ops-eqiad, 10DBA: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3527471 (10Marostegui) >>! In T173365#3527467, @Cmjohnson wrote: > A case for a new disk has been created. Your case was successfully submitted. Please note your Case ID: 5322179480 for future reference. >... [12:49:45] 10Operations, 10ops-eqiad, 10DBA: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3527487 (10Cmjohnson) Sure, sometime in the next few hours is fine. [12:50:09] (03CR) 10Marostegui: [C: 032] mysql-dbstore_codfw: Add s6 [puppet] - 10https://gerrit.wikimedia.org/r/372138 (https://phabricator.wikimedia.org/T168409) (owner: 10Marostegui) [12:58:31] (03PS1) 10Filippo Giunchedi: graphite: cleanup carbon-cache log files [puppet] - 10https://gerrit.wikimedia.org/r/372141 (https://phabricator.wikimedia.org/T173401) [12:58:58] (03CR) 10jerkins-bot: [V: 04-1] graphite: cleanup carbon-cache log files [puppet] - 10https://gerrit.wikimedia.org/r/372141 (https://phabricator.wikimedia.org/T173401) (owner: 10Filippo Giunchedi) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170816T1300). [13:00:04] jan_drewniak and phuedx: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:37] o/ [13:00:55] o/ [13:00:57] o/ [13:01:21] phuedx: want to deploy your commit, or should I do it? [13:01:31] I can SWAT today! [13:01:44] zeljkof: if you could, then that would be good, because there's a lot to test [13:01:55] phuedx: sure, will do [13:02:34] jan_drewniak: your commit is first, this time I even know what to do! ;) [13:02:49] (03PS2) 10Filippo Giunchedi: graphite: cleanup carbon-cache log files [puppet] - 10https://gerrit.wikimedia.org/r/372141 (https://phabricator.wikimedia.org/T173401) [13:04:48] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372118 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:06:14] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372118 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:06:25] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372118 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:08:56] !log zfilipin@tin Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 52s) [13:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:48] !log zfilipin@tin Synchronized portals: (no justification provided) (duration: 00m 52s) [13:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:05] jan_drewniak: 372118 is deployed, please check if it looks ok [13:10:23] phuedx: reviewing 372137 [13:11:16] zeljkof: yup! looks good, thanks! [13:11:48] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372137 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [13:12:42] phuedx: nothing special about deploying 372137? I just deploy the file to mwdebug1002? (I rarely deploy dblist files, so checking) [13:12:58] zeljkof: afaik, no [13:13:02] mwdebug1002 is great [13:13:09] (03Merged) 10jenkins-bot: pagePreviews: Deploy to next 100 stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372137 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [13:13:17] ok, will ping you when it's there, in a minute or so [13:13:23] (03CR) 10jenkins-bot: pagePreviews: Deploy to next 100 stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372137 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [13:13:52] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 345.79 seconds [13:14:22] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 351.95 seconds [13:15:12] phuedx: 372137 is at mwdebug1002, please test and let me know when I can proceed [13:15:17] excellent [13:15:24] zeljkof: there's a fair amount of testing to do [13:15:27] expect a little delay [13:15:30] but i'll try to keep you updated [13:15:45] phuedx: no problem, take your time, I'll wait [13:15:52] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 0.09 seconds [13:16:23] RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 0.48 seconds [13:20:20] 10Operations, 10ops-eqiad, 10DBA: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3527550 (10jcrespo) > Regarding the DB crash because of 1 disk failure is odd. I will need to take the server offline. Let me know when it's safe to do so. Joe's thesis is that maybe smartpath setup doesn't... [13:23:33] ok, the existing wikis look good [13:23:52] phuedx: ok, deploying [13:23:59] zeljkof: wait [13:24:03] oh, ok [13:24:16] more testing to do? [13:24:20] gotta check a few of the new wikis [13:24:22] sec [13:24:24] ok [13:29:03] zeljkof: i've tested flows on a couple of wikis and it lgtm [13:29:14] phuedx: ok to deploy? [13:29:40] zeljkof: one more wiki (rtl this time ;) ) [13:32:47] zeljkof: ok, go [13:33:39] phuedx: ok, deploying [13:34:40] (03CR) 10Alexandros Kosiaris: [C: 031] "That's an old trick (circa 2005-2007). IIRC from back then, many spam botnets had adapted already were respecting the delay so I don't rea" [puppet] - 10https://gerrit.wikimedia.org/r/371958 (https://phabricator.wikimedia.org/T173143) (owner: 10Herron) [13:35:06] !log zfilipin@tin Synchronized dblists/pp_stage1.dblist: SWAT: [[gerrit:372137|pagePreviews: Deploy to next 100 stage 1 wikis (T162672)]] (duration: 00m 50s) [13:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:17] T162672: Deploy page previews to 90% of users on all wikis but English and German - https://phabricator.wikimedia.org/T162672 [13:35:18] (03PS2) 10Filippo Giunchedi: hieradata: create pagecompilation account [puppet] - 10https://gerrit.wikimedia.org/r/371579 (https://phabricator.wikimedia.org/T172123) [13:35:27] phuedx: deployed, please check [13:36:38] (03PS3) 10Filippo Giunchedi: hieradata: create pagecompilation account [puppet] - 10https://gerrit.wikimedia.org/r/371579 (https://phabricator.wikimedia.org/T172123) [13:38:21] zeljkof: on it [13:40:57] (03PS4) 10Filippo Giunchedi: hieradata: create pagecompilation account [puppet] - 10https://gerrit.wikimedia.org/r/371579 (https://phabricator.wikimedia.org/T172123) [13:43:28] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: create pagecompilation account [puppet] - 10https://gerrit.wikimedia.org/r/371579 (https://phabricator.wikimedia.org/T172123) (owner: 10Filippo Giunchedi) [13:50:27] (03CR) 10Elukey: graphite: cleanup carbon-cache log files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/372141 (https://phabricator.wikimedia.org/T173401) (owner: 10Filippo Giunchedi) [13:53:07] (03CR) 10Filippo Giunchedi: graphite: cleanup carbon-cache log files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/372141 (https://phabricator.wikimedia.org/T173401) (owner: 10Filippo Giunchedi) [13:53:39] phuedx: there is a slight increase in the number of reqs on the RB side (~+25reqs/sec) \o/ [13:54:02] * elukey repeats to himself: Filippo knows what he is doing, don't make silly comments in code reviews [13:54:14] :D [13:54:18] mobrovac: awesome! that's +100 wikis (in alphabetical order) - the big ones [13:54:31] sorry "minus the big ones" [13:55:28] elukey: lolz, it is a fair question though since we use logrotate everywhere [13:55:46] phuedx: yup yup, that's great! [13:55:55] in a perfect world we shouldn't and everything is funneled through syslog but meh [13:56:50] phuedx: swat window is almost done, do you need more time? [13:57:11] zeljkof: mibad, everything's looking fine [13:57:24] phuedx: ok, then closing the swat window [13:57:24] thank you! :) [13:57:30] !log EU SWAT finished [13:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:00] phuedx: thanks for flying with #releng! [13:58:13] ^ that always make me chuckle [13:58:13] (03CR) 10Elukey: [C: 031] "https://puppet-compiler.wmflabs.org/compiler02/7457/graphite2001.codfw.wmnet/ looks good" [puppet] - 10https://gerrit.wikimedia.org/r/372141 (https://phabricator.wikimedia.org/T173401) (owner: 10Filippo Giunchedi) [13:58:49] (03PS3) 10Filippo Giunchedi: graphite: cleanup carbon-cache log files [puppet] - 10https://gerrit.wikimedia.org/r/372141 (https://phabricator.wikimedia.org/T173401) [13:59:28] phuedx: :) [14:00:55] (03CR) 10Filippo Giunchedi: [C: 032] graphite: cleanup carbon-cache log files [puppet] - 10https://gerrit.wikimedia.org/r/372141 (https://phabricator.wikimedia.org/T173401) (owner: 10Filippo Giunchedi) [14:05:33] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:05:42] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2070233 [14:05:45] 10Operations, 10monitoring, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi: graphite2001 disk space alarms for big log files in /var/log/carbon - https://phabricator.wikimedia.org/T173401#3527632 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi We're now deleting carbon-cache logs older than 15d... [14:06:12] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:07:22] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:07:23] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:07:23] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:07:23] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:07:23] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:07:32] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:07:33] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:07:33] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:07:33] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:07:42] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:07:42] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:07:42] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:07:42] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:07:43] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:08:12] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [14:08:13] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:08:13] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [14:08:13] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [14:08:22] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:08:22] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:08:23] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:08:23] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:08:23] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [14:08:32] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [14:08:32] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [14:08:33] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:08:33] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [14:08:42] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [14:08:51] (03PS5) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [14:09:09] 10Operations, 10media-storage: Deleting file on Commons "Error deleting file: An unknown error occurred in storage backend "local-multiwrite"." - https://phabricator.wikimedia.org/T173374#3525950 (10fgiunchedi) Is there an exception id or anything like that attached to the error? I can't find anything related... [14:09:25] (03CR) 10jerkins-bot: [V: 04-1] [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [14:13:57] is it possible that page previews also caused increased misses for cache upload? I see an increase of requests to swift starting around 13:40 [14:14:32] also, what's an easy way for me to see page preview in action? [14:14:34] phuedx: ^ [14:16:20] godog: sure, it's available as a beta feature on a number of wikis, e.g. you can navigate to https://en.wikipedia.org/wiki/Special:BetaFeatures [14:16:26] (03PS6) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [14:16:26] and enable the Page Previews beta feature [14:16:44] then hover over a link to an article on the main page [14:16:51] and you'll see a preview [14:17:56] thanks phuedx ! [14:18:03] godog: sure [14:18:22] so yeah definitely will hit upload more as well when the page has an image [14:18:30] godog: yes, that [14:21:22] PROBLEM - pdfrender on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 5252: Connection refused [14:21:30] that's me ^ [14:23:06] godog: is the increase in cache misses worrying? [14:24:22] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [14:24:26] Hi... we've got a global rename stuck [14:24:37] well, more like 13 global renames stuck [14:24:38] <_< [14:25:10] Dereckson: Hi. [14:25:27] phuedx: I'm not sure yet, I see about +30% requests to swift which might be temporary as varnish fills up [14:27:59] 10Operations, 10DBA, 10Wikimedia-Site-requests, 10User-MarcoAurelio: 13 global renames stuck - https://phabricator.wikimedia.org/T173419#3527677 (10MarcoAurelio) [14:29:49] !log Stress testing Thumbor from single IP [14:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:02] 10Operations, 10Wikimedia-Site-requests, 10User-MarcoAurelio: 13 global renames stuck - https://phabricator.wikimedia.org/T173419#3527692 (10Marostegui) I am removing the DBA tag, as there is not much for us to do here :-) Once you have the tasks for each rename, if they need to happen, ping us as you normal... [14:30:07] phuedx: is traffic team aware of what will be coming up in terms of load to upload ? I'm asking because varnish upload already struggles a bit since it has a ton of objects [14:30:35] (03PS1) 10EBernhardson: Apply token count limits to phrase queries on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372154 (https://phabricator.wikimedia.org/T172653) [14:31:01] marostegui: it is not that I want to rename 13 users, it's that the jobs for those have failed for some reason and need to be reenqueued [14:32:50] (03PS3) 10EBernhardson: [cirrus] Tune ordering of crossproject search results on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368776 (https://phabricator.wikimedia.org/T171803) (owner: 10DCausse) [14:33:14] phuedx: just to make sure I understood, today's deploy enabled page previews for all users on the wikis to which it was enabled regardless of the user settings opting it to the beta feature? [14:33:19] (03PS7) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [14:34:44] 10Operations, 10Wikimedia-Site-requests, 10User-MarcoAurelio: Unblock 13 global renames stuck at Meta-Wiki and elsewhere - https://phabricator.wikimedia.org/T173419#3527700 (10MarcoAurelio) [14:34:59] godog: sorry -- just finishing up a meeting [14:35:04] i have seen your messages! [14:35:12] phuedx: no worries, it can wait [14:35:32] Dereckson: can you run fatalmonitor for me and see why centralauth renames are all failing at Meta? [14:36:33] I'll end requesting access to logstash :( [14:36:42] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2088000 [14:38:13] 10Operations, 10Performance-Team, 10Thumbor, 10Patch-For-Review, 10User-fgiunchedi: Long running thumbnail requests locking up Thumbor instances - https://phabricator.wikimedia.org/T172930#3527708 (10Gilles) I still triggered 502s, that wasn't sufficient. The script used (file is purged before proceedin... [14:38:33] !log Thumbor stress test finished [14:38:36] 10Operations, 10Electron-PDFs, 10Patch-For-Review, 10Readers-Web-Backlog (Tracking), 10Services (blocked): pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3527709 (10mobrovac) I tried setting a time-out during the initialisation process th... [14:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:21] (03PS1) 10Elukey: role::cache::kafka::webrequest: tune graphite alarms [puppet] - 10https://gerrit.wikimedia.org/r/372155 (https://phabricator.wikimedia.org/T172681) [14:39:42] (03CR) 10jerkins-bot: [V: 04-1] role::cache::kafka::webrequest: tune graphite alarms [puppet] - 10https://gerrit.wikimedia.org/r/372155 (https://phabricator.wikimedia.org/T172681) (owner: 10Elukey) [14:41:42] godog: to clarify, the change that was just deployed enabled page previews for all anons on those wikis, the feature is disabled by default for logged in users [14:42:03] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [14:42:50] we had a meeting with services and traffic way back to talk through deploying page previews but then the deployment was stalled for a couple of months on a critical instrumentation bug [14:42:53] (03PS2) 10Elukey: role::cache::kafka::webrequest: tune graphite alarms [puppet] - 10https://gerrit.wikimedia.org/r/372155 (https://phabricator.wikimedia.org/T172681) [14:43:11] i'll admit that i dropped the ball in notifying traffic of today's deployment (because i forgot the team's name) [14:43:21] i'll do that now [14:43:45] phuedx: ack, thanks! [14:43:52] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [14:44:05] there's a couple of varnish machines with mailbox problems, I'll restart those as it might be the cause of 500s [14:44:45] !log restart varnish on cp1049 to clear mailbox lag [14:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:23] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [14:45:45] godog: i'm curious about the spike in load of the varnish upload [14:45:53] (03PS1) 10Mobrovac: PDF Render: Lower the concurrency to 4 [puppet] - 10https://gerrit.wikimedia.org/r/372156 (https://phabricator.wikimedia.org/T159922) [14:46:07] (03PS8) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [14:46:10] should the team hold off on rolling out until that can be checked out? [14:46:42] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 0 [14:47:06] phuedx: when are the rest of rollouts planned for? further objects into varnish upload might exacerbate the problems we've seen yeah [14:47:30] we were going to more today -- but if there's a reason to stop or to roll back then of course we'll do that [14:47:40] are you on the traffic list? [14:48:02] !log restart varnish on cp1074 to clear mailbox lag [14:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:49] phuedx: I'm not, what's the address? [14:49:02] phuedx: but yeah please hold off for today until the traffic team is aware [14:49:12] godog: roger [14:52:43] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2110425 [14:53:22] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 340.19 seconds [14:53:43] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 337.10 seconds [14:54:30] 10Operations, 10Performance-Team, 10Thumbor, 10Patch-For-Review, 10User-fgiunchedi: Long running thumbnail requests locking up Thumbor instances - https://phabricator.wikimedia.org/T172930#3527742 (10Gilles) What I need to verify is whether a thumbor process is truly blocking while waiting on a poolcount... [14:54:32] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:55:02] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:55:25] these 500s alerts are all delayed btw, I'm not seeing the same in logstash [14:55:43] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [14:56:32] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:56:38] !log restart varnish on cp1099 to clear mailbox lag [14:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:53] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:57:11] (03PS9) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [14:57:32] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:57:52] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.77 seconds [14:58:02] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:03] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:03] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:03] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:12] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:22] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 317.16 seconds [14:58:32] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:32] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:33] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:33] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:33] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:38] 10Operations, 10Wikimedia-Site-requests, 10User-MarcoAurelio: Unblock 13 global renames stuck at Meta-Wiki and elsewhere - https://phabricator.wikimedia.org/T173419#3527677 (10RuyP) Just out of curiosity. Would it be possible to prevent this from occurring again by blocking further renames whenever there was... [14:58:43] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:43] (03CR) 10Mobrovac: JobQueueEventBus: Enable group1. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370975 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [14:58:52] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:52] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:52] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:53] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:59:02] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:59:02] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:59:02] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:59:02] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:59:02] Going to silence that annoying dbstore1001 and its backups [15:00:22] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [15:00:22] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [15:00:23] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [15:00:23] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [15:00:23] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [15:00:23] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [15:00:32] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:00:42] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [15:00:42] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [15:00:42] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [15:00:42] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [15:00:43] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [15:00:49] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Analytics Kafka cluster causing timeouts to Varnishkafka since July 28th - https://phabricator.wikimedia.org/T172681#3527754 (10elukey) a:03elukey [15:00:52] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [15:00:52] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [15:00:52] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [15:00:52] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [15:01:02] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [15:01:02] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [15:01:02] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [15:01:02] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [15:01:03] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [15:01:12] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:02:17] PROBLEM - MariaDB Slave Lag: s4 on db1064 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 387.56 seconds [15:02:22] godog: what's yer email? [15:02:39] phuedx: filippo@wikimedia.org [15:02:52] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 0 [15:02:53] sorry -- couldn't look it up for some reason :/ [15:02:58] checking that lag, probably our old friend the jobqueue [15:03:06] (03PS10) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [15:03:45] (03CR) 10Alexandros Kosiaris: [C: 032] PDF Render: Lower the concurrency to 4 [puppet] - 10https://gerrit.wikimedia.org/r/372156 (https://phabricator.wikimedia.org/T159922) (owner: 10Mobrovac) [15:03:58] purgue lag is high, buffer pool efficiency very low [15:03:59] Oh, it is a vslow..why did it page.. [15:04:48] There is an select, that has been running for more than a minute now [15:05:19] TabbyCat: Krinkle yesterday bumped against tihs: 4 fatal error: Argument 1 passed to SpoofUser::batchRecord() must be an instance of Wikimedia\Rdbms\Database, Wikimedia\Rdbms\DBConnRef given in /srv/mediawiki/php-1.30.0-wmf.14/extensions/AntiSpoof/Spo [15:05:22] it will scan this number of rows: 1568883070 [15:05:23] ofUser.php on line 107 [15:05:24] \o/ [15:05:35] TabbyCat: trying to login to a new local account [15:06:00] check if a task already exists, and if not, create one against wikimedia log errors please [15:06:09] Isarra: what's the setting to change for fr.wikt? [15:06:18] Dereckson: do you think that is related? [15:06:23] phuedx: np, to be clear I'm not 100% sure PP was the cause of increased rate of requests to swift but it definitely could, the traffic team heads up is a good idea regardless [15:06:28] that is normal, but lag keeps going up [15:06:33] I already created a task for the 14 (yes, one more) stuck renames [15:06:49] godog: absolutely! [15:07:17] the email's a bit of a mea culpa, a bit of a timeline, and a "what's next?" [15:07:35] Raid looks good, and disks have no errors [15:07:55] TabbyCat: it's a 500 triggering when you have a SUL account and want a local account on a wiki with the AntiSpoof extension. [15:08:40] lots of UPDATE `commonswiki`.`page` [15:09:40] our friend: https://phabricator.wikimedia.org/T164173#3515010 ? [15:09:49] !log restart varnish on cp1072 to clear mailbox lag [15:09:51] that is millions of page_touched updates [15:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:08] yes, it is that [15:10:30] made worse by row based replication and weight 0 [15:10:54] jynus: I talked to daniel and hoo during wikimania and apparently the fix will be released this week or in the next few days [15:11:07] as it was already merged, but the release got reverted [15:11:19] I will ping them to see if they know a more accurate date [15:11:26] but it affects all shards: https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=4&fullscreen&orgId=1 [15:11:36] I mean, all databases [15:12:03] just db1068 takes the worse part, but it activates read only on most dbs [15:12:21] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3527782 (10Marostegui) @hoo this happened again just now. As we talked during wikimania here is the... [15:12:23] 64 [15:12:38] https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=4&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&from=1502809950675&to=1502896350675 [15:13:35] (03CR) 10Herron: [C: 032] Add 90s command_timeout override to nrpe_local.cfg [puppet] - 10https://gerrit.wikimedia.org/r/370858 (https://phabricator.wikimedia.org/T172921) (owner: 10Herron) [15:13:45] (03PS4) 10Herron: Add 90s command_timeout override to nrpe_local.cfg [puppet] - 10https://gerrit.wikimedia.org/r/370858 (https://phabricator.wikimedia.org/T172921) [15:14:23] Dereckson: a new rename that was just performed by Litlok has become stuck at Meta-Wiki but not on the other wikis... and all have the AntiSpoof extension enabled... [15:14:36] all are failing at metawiki [15:15:52] 10Operations, 10Thumbor, 10Patch-For-Review, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817#3527791 (10Gilles) I've compared deployment-imagescaler02 again and I see rendering differences for kochi fonts. Isn't it the same iss... [15:21:42] RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 0.09 seconds [15:22:12] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 25.48 seconds [15:27:54] 10Operations, 10Thumbor, 10Patch-For-Review, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817#3527816 (10MoritzMuehlenhoff) fonts-mgopen was removed from Debian after the jessie release: Turns out the removal request came from @... [15:28:01] 10Operations, 10Page-Previews, 10Traffic: [Spike] Investigate the increase in the number of requests to Swift after the Page Previews deploy - https://phabricator.wikimedia.org/T173422#3527818 (10phuedx) [15:28:26] 10Operations, 10Page-Previews, 10Traffic: [Spike] Investigate the increase in the number of requests to Swift after the Page Previews deploy - https://phabricator.wikimedia.org/T173422#3527831 (10phuedx) [15:28:35] marostegui: I am prepping to drop a huge table on dbstore1002,db1047 and db1046 [15:29:06] \o\ |o| /o/ [15:29:31] RECOVERY - MariaDB Slave Lag: s4 on db1064 is OK: OK slave_sql_lag Replication lag: 0.35 seconds [15:30:19] elukey: you planning to rename it first to see if something breaks? [15:31:18] marostegui: the eventlogging event is not used anymore and we backed up data on hdfs, I'd say we could simply drop [15:34:22] Sure! Whatever you think it is fine [15:34:50] (03PS11) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [15:35:06] 10Operations, 10Wikimedia-Site-requests, 10User-MarcoAurelio: Unblock 13 global renames stuck at Meta-Wiki and elsewhere - https://phabricator.wikimedia.org/T173419#3527874 (10MarcoAurelio) A new global rename has become stuck just now: https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Roquetero... [15:35:16] 10Operations, 10Page-Previews, 10Traffic: Investigate the increase in the number of requests to Swift after the Page Previews deploy - https://phabricator.wikimedia.org/T173422#3527879 (10phuedx) [15:35:27] (03CR) 10jerkins-bot: [V: 04-1] [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [15:35:35] godog: any detail you could add to https://phabricator.wikimedia.org/T173422 would be greatly appreciated [15:35:57] !log Rename cx_drafts table on db1029 - T172364 [15:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:10] T172364: Remove cx_drafts table from production - https://phabricator.wikimedia.org/T172364 [15:40:32] (03PS12) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [15:41:24] (03PS1) 10Gilles: Enable Thumbor webp original support [puppet] - 10https://gerrit.wikimedia.org/r/372158 (https://phabricator.wikimedia.org/T172939) [15:41:33] !log delete outdated CFs cassandra metrics from graphite2002 and graphite1003 [15:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:06] phuedx: yup will do! still looking into it [15:43:07] (03CR) 10Ppchelko: JobQueueEventBus: Enable group1. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370975 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [15:43:09] !log drop PageContentSaveComplete_5588433_15423246 from db1047 and dbstore1002 (analytics-slaves) [15:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:33] 10Operations, 10Performance-Team, 10Thumbor: Track incoming HTTP request count on the Thumbor boxes - https://phabricator.wikimedia.org/T151554#3527935 (10Gilles) As a note, just looking at yesterday's data, nginx 502s once per minute on average. Much larger old error log files suggest that this might peak a... [15:43:41] marostegui: any action to take to see space freed? [15:43:47] 10Operations, 10Performance-Team, 10Thumbor: Track incoming HTTP request count on the Thumbor boxes - https://phabricator.wikimedia.org/T151554#3527938 (10Gilles) p:05Normal>03High a:03Gilles [15:45:12] Thanks marostegui [15:47:33] (03CR) 10Alexandros Kosiaris: "All pdfrender services have been restarted successfully. This may have just worked." [puppet] - 10https://gerrit.wikimedia.org/r/372156 (https://phabricator.wikimedia.org/T159922) (owner: 10Mobrovac) [15:47:54] 10Operations, 10Wikimedia-Site-requests, 10User-MarcoAurelio, 10Wikimedia-log-errors: Unblock 13 global renames stuck at Meta-Wiki and elsewhere - https://phabricator.wikimedia.org/T173419#3527945 (10MarcoAurelio) @Platonides said to add this, so it can be investigated. [15:49:51] Dereckson: [[MediaWiki:timeless-sitetitle]], I think. [15:50:04] 10Operations, 10Performance-Team, 10Thumbor, 10User-fgiunchedi: Long running thumbnail requests locking up Thumbor instances - https://phabricator.wikimedia.org/T172930#3527955 (10Gilles) [15:53:23] (03PS1) 10Jdlrobson: Roll page previews out to all wikis except en and de wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372160 (https://phabricator.wikimedia.org/T162672) [15:54:26] (03PS2) 10Jdlrobson: Roll page previews out to all wikis except en and de wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372160 (https://phabricator.wikimedia.org/T162672) [15:54:52] !log Stop MySQL and shutdown db1078 for HW checks - T173365 [15:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:05] T173365: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365 [15:55:09] 10Operations, 10Page-Previews, 10Traffic, 10Readers-Web-Backlog (Tracking): Investigate the increase in the number of requests to Swift after the Page Previews deploy - https://phabricator.wikimedia.org/T173422#3527965 (10Jdlrobson) [15:56:33] 10Operations, 10Thumbor, 10Patch-For-Review, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817#3527971 (10Gilles) OK, so if I'm following that means people are now advised to use other fonts than these ones, right? Meaning it's o... [16:03:53] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3528004 (10Papaul) @elukey Good morning Papaul, I would suggest booting the system to our Support Live ISO and running the stressapptest for an extended period of time. The ISO can be downloaded from the fo... [16:04:36] elukey: what do you mean? [16:08:23] marostegui: I can see only 100G dropped in space used.. [16:09:08] in all the hosts? [16:14:46] 10Operations, 10DBA, 10media-storage, 10monitoring: icinga hp raid check timeout on busy ms-be and db machines - https://phabricator.wikimedia.org/T141252#3528035 (10herron) [16:14:49] 10Operations, 10monitoring, 10Patch-For-Review: Nrpe command_timeout and "Service Check Timed Out" errors - https://phabricator.wikimedia.org/T172921#3528032 (10herron) 05Open>03Resolved This looks good so far. The 4 ms-be10NN HP RAID checks that were in service timed out state before deploying are now... [16:16:57] (03PS3) 10Gehel: wdqs - send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) [16:19:10] 10Operations, 10DBA, 10media-storage, 10monitoring: icinga hp raid check timeout on busy ms-be and db machines - https://phabricator.wikimedia.org/T141252#3528044 (10jcrespo) Should we close this too, or too early to say? @herron [16:23:04] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3528059 (10Eevans) [16:23:24] 10Operations, 10monitoring, 10Patch-For-Review: Add RAID monitoring for HP servers - https://phabricator.wikimedia.org/T97998#3528063 (10herron) [16:23:26] 10Operations, 10DBA, 10media-storage, 10monitoring: icinga hp raid check timeout on busy ms-be and db machines - https://phabricator.wikimedia.org/T141252#3528060 (10herron) 05Open>03Resolved a:03herron Sure, sounds good to me. We could always reopen and evaluate if the issue occurs again in the fut... [16:24:36] 10Operations, 10monitoring: Nrpe command_timeout and "Service Check Timed Out" errors - https://phabricator.wikimedia.org/T172921#3528065 (10herron) [16:26:08] (03PS1) 10Ayounsi: Add goransm to the mw-log-readers group [puppet] - 10https://gerrit.wikimedia.org/r/372165 (https://phabricator.wikimedia.org/T171958) [16:30:16] (03CR) 10Mobrovac: [C: 031] JobQueueEventBus: Enable group1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370975 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [16:41:49] 10Operations, 10ops-eqiad, 10DBA: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3528105 (10Cmjohnson) I verified the settings, everything appears normal. [16:43:22] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting access to mwlog1001.eqiad.wmnet for goransm - https://phabricator.wikimedia.org/T171958#3481349 (10elukey) @Addshore if the data is not sensitive we could set up a rsync job in `statistics::rsync::mediawiki` and get the lo... [16:45:46] marostegui: yep in all the hosts.. [16:50:16] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3528134 (10elukey) @Papaul sounds good to me. The host is now in maintenance for Icinga (until Sept 9th), and depooled from any service. You are free to do the test whenever you prefer :) [16:51:25] 10Operations, 10Electron-PDFs, 10OfflineContentGenerator, 10Services (designing): Improve stability and maintainability of our browser-based PDF render service - https://phabricator.wikimedia.org/T172815#3528137 (10mobrovac) [16:52:19] 10Operations, 10Performance-Team, 10Thumbor, 10User-fgiunchedi: Track incoming HTTP request count on the Thumbor boxes - https://phabricator.wikimedia.org/T151554#3528138 (10fgiunchedi) [17:07:36] !log demon@tin Synchronized php-1.30.0-wmf.14/extensions/AntiSpoof/SpoofUser.php: T173394 (duration: 00m 51s) [17:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:48] T173394: Fatal error (blank page) served after logging in - https://phabricator.wikimedia.org/T173394 [17:12:46] 10Operations, 10Page-Previews, 10Traffic, 10Readers-Web-Backlog (Tracking): Investigate the increase in the number of requests to Swift after the Page Previews deploy - https://phabricator.wikimedia.org/T173422#3528243 (10fgiunchedi) So the increase in swift requests seem to be cyclic (daily) and correspon... [17:15:25] PROBLEM - HHVM rendering on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:15] RECOVERY - HHVM rendering on mw1297 is OK: HTTP OK: HTTP/1.1 200 OK - 74096 bytes in 0.113 second response time [17:17:23] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labmon1002 - https://phabricator.wikimedia.org/T165784#3528256 (10Cmjohnson) @robh I must've confused this with one of the other lab servers..no controller card present on labmon1002.....only 4 disk couldn't do a Raid10 if... [17:26:15] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3224974 (10thcipriani) >>! In T164173#3527782, @Marostegui wrote: > @hoo this happened again just n... [17:26:41] 10Operations, 10ORES, 10Scoring-platform-team: Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851#3528314 (10Halfak) [17:30:37] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labmon1002 - https://phabricator.wikimedia.org/T165784#3528336 (10Cmjohnson) a:05Cmjohnson>03RobH [17:31:15] !log T169939: Rolling restart of Cassandra instances, codfw, rack b [17:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:29] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [17:33:55] PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:34:15] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused [17:34:55] RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2018-07-19 10:52:04 +0000 (expires in 336 days) [17:35:16] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.162 port 9042 [17:37:21] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: Delete graphite metrics for old CFs - https://phabricator.wikimedia.org/T173436#3528352 (10fgiunchedi) [17:38:16] PROBLEM - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.163 and port 9042: Connection refused [17:38:51] 10Operations, 10ops-eqiad, 10netops: eqiad: rack frack refresh equipment - https://phabricator.wikimedia.org/T169644#3528372 (10Cmjohnson) this has been slow progress...During the initial racking, all the screws were tightened too tight and now have to be drilled off. [17:39:16] RECOVERY - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.163 port 9042 [17:45:44] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3528397 (10greg) [17:46:10] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3224976 (10greg) [17:46:35] PROBLEM - cassandra-a SSL 10.192.16.165:7001 on restbase2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:46:36] PROBLEM - cassandra-a CQL 10.192.16.165:9042 on restbase2002 is CRITICAL: connect to address 10.192.16.165 and port 9042: Connection refused [17:47:36] RECOVERY - cassandra-a SSL 10.192.16.165:7001 on restbase2002 is OK: SSL OK - Certificate restbase2002-a valid until 2018-07-19 10:52:10 +0000 (expires in 336 days) [17:47:36] RECOVERY - cassandra-a CQL 10.192.16.165:9042 on restbase2002 is OK: TCP OK - 0.036 second response time on 10.192.16.165 port 9042 [17:50:45] PROBLEM - cassandra-b SSL 10.192.16.166:7001 on restbase2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:50:45] PROBLEM - cassandra-b CQL 10.192.16.166:9042 on restbase2002 is CRITICAL: connect to address 10.192.16.166 and port 9042: Connection refused [17:51:45] RECOVERY - cassandra-b SSL 10.192.16.166:7001 on restbase2002 is OK: SSL OK - Certificate restbase2002-b valid until 2018-07-19 10:52:11 +0000 (expires in 336 days) [17:51:46] RECOVERY - cassandra-b CQL 10.192.16.166:9042 on restbase2002 is OK: TCP OK - 0.036 second response time on 10.192.16.166 port 9042 [17:54:28] (03CR) 10Ayounsi: (WIP): Add SNMP classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/364753 (owner: 10Faidon Liambotis) [17:54:45] PROBLEM - cassandra-c CQL 10.192.16.167:9042 on restbase2002 is CRITICAL: connect to address 10.192.16.167 and port 9042: Connection refused [17:55:45] RECOVERY - cassandra-c CQL 10.192.16.167:9042 on restbase2002 is OK: TCP OK - 0.036 second response time on 10.192.16.167 port 9042 [18:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170816T1800). Please do the needful. [18:00:05] Pchelolo and ebernhardson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:24] Here [18:01:16] I can SWAT [18:02:15] \o [18:02:23] Pchelolo: so group1 is still a mixed bag of wmf.11 and wmf.13. I plan to roll everything forward to wmf.14 for train today. Is that fine for your patch? [18:02:39] 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#2799650 (10greg) > Stage 4, August 2017 > * retire OCG service Just a note from {T129142}: We (RelEng, Ops, a... [18:02:50] thcipriani: I've backported all the needed patches to .11 [18:02:52] cool :) [18:03:11] (03PS6) 10Thcipriani: JobQueueEventBus: Enable group1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370975 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [18:03:18] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370975 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [18:04:51] (03Merged) 10jenkins-bot: JobQueueEventBus: Enable group1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370975 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [18:05:04] ebernhardson: could you +1 the backports: https://gerrit.wikimedia.org/r/#/c/372169/ https://gerrit.wikimedia.org/r/#/c/372170/ [18:05:32] thcipriani: done [18:05:38] thanks :) [18:05:55] Pchelolo: thcipriani: yay for JQ \o/ [18:06:12] (03CR) 10jenkins-bot: JobQueueEventBus: Enable group1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370975 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [18:06:17] mobrovac: it's not deployed yet, so too early to yay [18:06:33] Pchelolo: does this remove redis from the pipeline for jobs? Should i be closely monitoring my deduplicated jobs this week? :) [18:06:33] :) [18:06:44] Pchelolo: change is live on mwdebug1002, check please [18:07:00] ebernhardson: nooooo, that is still far ahead [18:07:03] ok [18:07:05] it's just some preparation work [18:07:33] thcipriani: testing. I might need a bit more time to test this, it's a fairly big thing [18:08:39] ok [18:08:56] ebernhardson: but do expect a meeting soonish about converting CirrusSearch jobs :P [18:09:05] mobrovac: :) [18:16:17] ok thcipriani the change works and I can't see any blockers to proceed [18:16:58] Pchelolo: ok, going live [18:18:45] !log T169939: Rolling restart of Cassandra instances, codfw, rack c [18:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:58] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [18:20:13] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:370975|JobQueueEventBus: Enable group1]] T163380 (duration: 00m 54s) [18:20:19] ^ Pchelolo live now [18:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:26] T163380: Support posting Jobs to EventBus simultaneously with normal job processing - https://phabricator.wikimedia.org/T163380 [18:20:31] Thank you, I'll continue monitoring [18:20:59] thanks [18:21:05] PROBLEM - cassandra-a SSL 10.192.32.134:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:21:35] aude: Dereckson when I pulled down changes for wmf.13/14 I saw changes for https://gerrit.wikimedia.org/r/#/q/Ie7f4db1f357ebc73832989d1fbc21e8dc16b05dc come down, too. Does this need a deployment? [18:21:56] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3528601 (10Marostegui) >>! In T164173#3528312, @thcipriani wrote: >>>! In T164173#3527782, @Maroste... [18:22:06] RECOVERY - cassandra-a SSL 10.192.32.134:7001 on restbase2003 is OK: SSL OK - Certificate restbase2003-a valid until 2018-07-19 10:52:15 +0000 (expires in 336 days) [18:23:11] ebernhardson: Disable cirrus MLR ab test for WikimediaEvents is live on mwdebug1002, check please [18:23:31] for both 13 and 14 [18:23:36] thcipriani: looking [18:24:46] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:26:02] thcipriani: looks good [18:26:21] ebernhardson: okie doke, going live wmf.14 then wmf.13 [18:28:46] 10Operations, 10Performance-Team, 10Thumbor, 10User-fgiunchedi: Long running thumbnail requests locking up Thumbor instances - https://phabricator.wikimedia.org/T172930#3528623 (10Gilles) Some relieving news: I've tested a specific lock (per-original) and the event-based async-like behavior of thumbor work... [18:29:01] !log thcipriani@tin Synchronized php-1.30.0-wmf.14/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: SWAT: [[gerrit:372170|Disable cirrus MLR ab test]] T171214 (duration: 00m 51s) [18:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:15] T171214: Interleaved results A/B test: turn off test - https://phabricator.wikimedia.org/T171214 [18:29:16] PROBLEM - cassandra-c CQL 10.192.32.136:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.136 and port 9042: Connection refused [18:30:16] RECOVERY - cassandra-c CQL 10.192.32.136:9042 on restbase2003 is OK: TCP OK - 0.036 second response time on 10.192.32.136 port 9042 [18:30:32] !log thcipriani@tin Synchronized php-1.30.0-wmf.13/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: SWAT: [[gerrit:372169|Disable cirrus MLR ab test]] T171214 (duration: 00m 50s) [18:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:16] (03PS2) 10Thcipriani: Apply token count limits to phrase queries on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372154 (https://phabricator.wikimedia.org/T172653) (owner: 10EBernhardson) [18:31:22] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372154 (https://phabricator.wikimedia.org/T172653) (owner: 10EBernhardson) [18:32:55] (03Merged) 10jenkins-bot: Apply token count limits to phrase queries on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372154 (https://phabricator.wikimedia.org/T172653) (owner: 10EBernhardson) [18:33:04] (03CR) 10jenkins-bot: Apply token count limits to phrase queries on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372154 (https://phabricator.wikimedia.org/T172653) (owner: 10EBernhardson) [18:33:25] PROBLEM - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.137 and port 9042: Connection refused [18:34:27] RECOVERY - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on 10.192.32.137 port 9042 [18:34:27] ebernhardson: ^ is live on mwdebug1002, check if possible please [18:34:35] well...the last jenkins-bot thing :) [18:37:25] PROBLEM - cassandra-b SSL 10.192.32.138:7001 on restbase2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:38:25] RECOVERY - cassandra-b SSL 10.192.32.138:7001 on restbase2004 is OK: SSL OK - Certificate restbase2004-b valid until 2018-07-19 10:52:22 +0000 (expires in 336 days) [18:38:42] thcipriani: seems reasonable too [18:39:27] * thcipriani syncs [18:41:25] PROBLEM - cassandra-c SSL 10.192.32.139:7001 on restbase2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:41:45] PROBLEM - cassandra-c CQL 10.192.32.139:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.139 and port 9042: Connection refused [18:41:48] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:372154|Apply token count limits to phrase queries on all wikis]] T172653 (duration: 00m 53s) [18:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:17] (03PS4) 10Thcipriani: [cirrus] Tune ordering of crossproject search results on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368776 (https://phabricator.wikimedia.org/T171803) (owner: 10DCausse) [18:42:21] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368776 (https://phabricator.wikimedia.org/T171803) (owner: 10DCausse) [18:42:26] RECOVERY - cassandra-c SSL 10.192.32.139:7001 on restbase2004 is OK: SSL OK - Certificate restbase2004-c valid until 2018-07-19 10:52:23 +0000 (expires in 336 days) [18:42:46] RECOVERY - cassandra-c CQL 10.192.32.139:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on 10.192.32.139 port 9042 [18:43:18] (03PS2) 10Ayounsi: (WIP): Add SNMP classes [puppet] - 10https://gerrit.wikimedia.org/r/364753 (owner: 10Faidon Liambotis) [18:43:48] (03Merged) 10jenkins-bot: [cirrus] Tune ordering of crossproject search results on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368776 (https://phabricator.wikimedia.org/T171803) (owner: 10DCausse) [18:45:45] PROBLEM - cassandra-a SSL 10.192.32.143:7001 on restbase2008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:46:08] (03CR) 10jenkins-bot: [cirrus] Tune ordering of crossproject search results on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368776 (https://phabricator.wikimedia.org/T171803) (owner: 10DCausse) [18:46:15] PROBLEM - cassandra-a CQL 10.192.32.143:9042 on restbase2008 is CRITICAL: connect to address 10.192.32.143 and port 9042: Connection refused [18:46:45] RECOVERY - cassandra-a SSL 10.192.32.143:7001 on restbase2008 is OK: SSL OK - Certificate restbase2008-a valid until 2018-07-19 10:52:38 +0000 (expires in 336 days) [18:47:15] RECOVERY - cassandra-a CQL 10.192.32.143:9042 on restbase2008 is OK: TCP OK - 0.036 second response time on 10.192.32.143 port 9042 [18:49:11] ACKNOWLEDGEMENT - HTTPS on netmon2001 is CRITICAL: SSL CRITICAL - Certificate librenms.wikimedia.org expired daniel_zahn https://phabricator.wikimedia.org/T172712 [18:49:21] (03CR) 10Ayounsi: "Not sure if it's how I'm supposed to proceed (piggyback on that CR). But the two things needed to have it work was:" [puppet] - 10https://gerrit.wikimedia.org/r/364753 (owner: 10Faidon Liambotis) [18:49:28] ebernhardson: tune ordering of crossproject results is live on mwdebug1002, check please [18:50:16] PROBLEM - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:50:25] PROBLEM - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is CRITICAL: connect to address 10.192.32.144 and port 9042: Connection refused [18:51:25] RECOVERY - cassandra-b SSL 10.192.32.144:7001 on restbase2008 is OK: SSL OK - Certificate restbase2008-b valid until 2018-07-19 10:52:39 +0000 (expires in 336 days) [18:51:26] RECOVERY - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is OK: TCP OK - 0.036 second response time on 10.192.32.144 port 9042 [18:52:55] thcipriani: looks sane [18:53:01] ebernhardson: ok, going live [18:54:05] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [18:56:03] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:368776|cirrus Tune ordering of crossproject search results on enwiki]] T171803 PART I (duration: 00m 51s) [18:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:13] T171803: Update to ordering of the sister project snippet display - https://phabricator.wikimedia.org/T171803 [18:57:29] !log thcipriani@tin Synchronized wmf-config/CirrusSearch-common.php: SWAT: [[gerrit:368776|cirrus Tune ordering of crossproject search results on enwiki]] T171803 PART II (duration: 00m 50s) [18:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:47] ^ ebernhardson all done [18:58:17] thcipriani: still looks good. Thanks! [18:58:30] thanks for doublechecking :) [19:00:05] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170816T1900). Please do the needful. [19:00:15] * thcipriani does the needful [19:03:57] 10Operations, 10JobRunner-Service, 10Performance-Team, 10monitoring, and 2 others: Collect error logs from jobchron/jobrunner services in Logstash - https://phabricator.wikimedia.org/T172479#3528746 (10Gilles) p:05Triage>03Low [19:04:52] 10Operations, 10JobRunner-Service, 10Performance-Team, 10monitoring, and 2 others: Collect error logs from jobchron/jobrunner services in Logstash - https://phabricator.wikimedia.org/T172479#3528748 (10Krinkle) >>! In T172479#3502395, @greg wrote: > Adding ~~`#mediawiki-platform-team`~~`#performance-team`... [19:06:55] !log T169939: Rolling restart of Cassandra instances, codfw, rack d [19:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:06] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [19:11:04] (03PS1) 10Herron: WIP: Add acl to warn on forged HELO messages on lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/372174 (https://phabricator.wikimedia.org/T173338) [19:26:10] (03PS1) 10Thcipriani: group1 wikis to 1.30.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372177 [19:26:13] (03CR) 10Thcipriani: [C: 032] group1 wikis to 1.30.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372177 (owner: 10Thcipriani) [19:27:42] (03Merged) 10jenkins-bot: group1 wikis to 1.30.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372177 (owner: 10Thcipriani) [19:27:51] (03CR) 10jenkins-bot: group1 wikis to 1.30.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372177 (owner: 10Thcipriani) [19:30:47] (03PS1) 10Thcipriani: Group1 wikis to wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372178 [19:31:21] (03CR) 10Thcipriani: [C: 032] Group1 wikis to wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372178 (owner: 10Thcipriani) [19:32:47] (03PS1) 10Ppchelko: WIP: Increase max kafka message size [puppet] - 10https://gerrit.wikimedia.org/r/372179 [19:32:52] (03Merged) 10jenkins-bot: Group1 wikis to wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372178 (owner: 10Thcipriani) [19:33:05] (03CR) 10jenkins-bot: Group1 wikis to wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372178 (owner: 10Thcipriani) [19:33:12] (03CR) 10jerkins-bot: [V: 04-1] WIP: Increase max kafka message size [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [19:35:13] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.30.0-wmf.14 [19:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:13] !log thcipriani@tin Synchronized php: group1 wikis to 1.30.0-wmf.14 (duration: 00m 46s) [19:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:46] 10Operations, 10Wikimedia-Site-requests, 10User-MarcoAurelio, 10Wikimedia-log-errors: Unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T173419#3528856 (10MarcoAurelio) [19:43:20] (03PS1) 10Brian Wolff: Only retain private securepoll data for 60 days after election [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372180 (https://phabricator.wikimedia.org/T173393) [19:44:05] PROBLEM - Host cp1008 is DOWN: PING CRITICAL - Packet loss = 100% [19:45:41] 10Operations, 10Wikimedia-Site-requests, 10User-MarcoAurelio, 10Wikimedia-log-errors: Unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T173419#3528862 (10MarcoAurelio) Just to clarify: it is not that those renames are stuck, which they are, but that every global rename is becom... [19:47:05] RECOVERY - Host cp1008 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [19:48:19] (03CR) 10Mobrovac: [C: 04-1] WIP: Increase max kafka message size (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [19:49:05] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1008 is CRITICAL: connect to address 208.80.154.42 and port 3120: Connection refused [19:49:06] PROBLEM - Disk space on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:06] PROBLEM - Confd template for /var/lib/gdnsd/discovery-restbase-async.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:06] PROBLEM - Confd template for /var/lib/gdnsd/discovery-api-rw.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:06] PROBLEM - salt-minion processes on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:09] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 15 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3528888 (10DStrine) @Pcoombe can you verify if this is working? [19:49:15] PROBLEM - Confd template for /var/lib/gdnsd/discovery-citoid.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:15] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1008 is CRITICAL: connect to address 208.80.154.42 and port 3122: Connection refused [19:49:15] PROBLEM - Confd template for /var/lib/gdnsd/discovery-appservers-ro.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:15] PROBLEM - Check size of conntrack table on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:15] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1008 is CRITICAL: connect to address 208.80.154.42 and port 3126: Connection refused [19:49:16] PROBLEM - Varnish traffic logger - varnishstatsd on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:16] PROBLEM - Varnish traffic logger - varnishrls on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:16] PROBLEM - eventlogging Varnishkafka log producer on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:25] PROBLEM - Freshness of OCSP Stapling files on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:25] PROBLEM - Confd template for /var/lib/gdnsd/discovery-swift-rw.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:25] PROBLEM - HTTPS Unified RSA on cp1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:49:25] PROBLEM - Webrequests Varnishkafka log producer on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:27] PROBLEM - Freshness of zerofetch successful run file on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:27] PROBLEM - Varnish HTCP daemon on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:27] PROBLEM - Check whether ferm is active by checking the default input chain on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:27] PROBLEM - Varnish traffic logger - varnishxcps on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:27] PROBLEM - puppet last run on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:28] PROBLEM - Varnish traffic logger - varnishxcache on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:35] PROBLEM - Confd template for /var/lib/gdnsd/discovery-swift-ro.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:35] PROBLEM - Confd vcl based reload on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:35] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:35] PROBLEM - HTTPS Unified ECDSA on cp1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:49:36] PROBLEM - Confd template for /var/lib/gdnsd/discovery-graphoid.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:36] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1008 is CRITICAL: connect to address 208.80.154.42 and port 3124: Connection refused [19:49:36] PROBLEM - Confd template for /var/lib/gdnsd/discovery-api-ro.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:36] PROBLEM - Confd template for /var/lib/gdnsd/discovery-imagescaler-ro.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:38] okay who unplugged the server [19:49:45] PROBLEM - Confd template for /var/lib/gdnsd/discovery-trendingedits.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:45] PROBLEM - Confd template for /var/lib/gdnsd/discovery-mathoid.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:45] PROBLEM - Confd template for /var/lib/gdnsd/discovery-mobileapps.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:45] PROBLEM - Confd template for /var/lib/gdnsd/discovery-pdfrender.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:45] PROBLEM - configured eth on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:45] PROBLEM - Confd template for /var/lib/gdnsd/discovery-wdqs.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:46] PROBLEM - Check systemd state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:46] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1008 is CRITICAL: connect to address 208.80.154.42 and port 3123: Connection refused [19:49:51] (03PS2) 10Ppchelko: Increase max kafka message size [puppet] - 10https://gerrit.wikimedia.org/r/372179 [19:49:55] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1008 is CRITICAL: connect to address 208.80.154.42 and port 3125: Connection refused [19:49:55] PROBLEM - Confd template for /var/lib/gdnsd/discovery-eventstreams.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:55] PROBLEM - Varnish HTTP text-backend - port 3128 on cp1008 is CRITICAL: connect to address 208.80.154.42 and port 3128: Connection refused [19:49:55] PROBLEM - Confd template for /var/lib/gdnsd/discovery-imagescaler-rw.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:55] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1008 is CRITICAL: connect to address 208.80.154.42 and port 80: Connection refused [19:49:55] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1008 is CRITICAL: connect to address 208.80.154.42 and port 3127: Connection refused [19:49:55] PROBLEM - Confd template for /var/lib/gdnsd/discovery-parsoid.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:56] PROBLEM - Confd template for /var/lib/gdnsd/discovery-kartotherian.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:56] PROBLEM - Confd template for /var/lib/gdnsd/discovery-restbase.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:57] PROBLEM - Confd template for /var/lib/gdnsd/discovery-recommendation-api.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:57] PROBLEM - Confd template for /var/lib/gdnsd/discovery-cxserver.state on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:49:58] PROBLEM - DPKG on cp1008 is CRITICAL: Return code of 255 is out of bounds [19:52:53] !log Manually cleaning up PI on enwiki (T173393) [19:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:28] PROBLEM - Host cp1008 is DOWN: PING CRITICAL - Packet loss = 100% [19:55:25] RECOVERY - Host cp1008 is UP: PING WARNING - Packet loss = 54%, RTA = 0.88 ms [19:56:30] cp1008 is me! sorry for the noise [19:56:45] PROBLEM - HTTPS Unified RSA on cp1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:56:55] PROBLEM - HTTPS Unified ECDSA on cp1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:57:24] !log T169939: Rolling restart of Cassandra instances, eqiad, rack a [19:57:30] cmjohnson1: ok! [19:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:40] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [19:58:47] (03CR) 10Ppchelko: "@mobrovac heh, you were way too fast to make the review.." [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170816T2000). [20:00:25] PROBLEM - Check the NTP synchronisation status of timesyncd on cp1008 is CRITICAL: Return code of 255 is out of bounds [20:06:14] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 15 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3528988 (10Pcoombe) 05Open>03Resolved Yes, and life is much easier. Thanks! [20:10:25] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.001 second response time [20:10:26] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.001 second response time [20:10:35] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.001 second response time [20:10:36] RECOVERY - Freshness of OCSP Stapling files on cp1008 is OK: OK [20:10:45] RECOVERY - Confd template for /var/lib/gdnsd/discovery-swift-rw.state on cp1008 is OK: No errors detected [20:10:45] RECOVERY - Webrequests Varnishkafka log producer on cp1008 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf [20:10:47] RECOVERY - HTTPS Unified RSA on cp1008 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345497 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2017-11-22 07:59:59 +0000 (expires in 97 days) [20:10:47] RECOVERY - Check whether ferm is active by checking the default input chain on cp1008 is OK: OK ferm input default policy is set [20:10:47] RECOVERY - Varnish HTCP daemon on cp1008 is OK: PROCS OK: 1 process with UID = 115 (vhtcpd), args vhtcpd [20:10:48] RECOVERY - Varnish traffic logger - varnishxcache on cp1008 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishxcache, UID = 0 (root) [20:10:48] RECOVERY - Varnish traffic logger - varnishxcps on cp1008 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishxcps, UID = 0 (root) [20:10:48] RECOVERY - Freshness of zerofetch successful run file on cp1008 is OK: OK [20:10:48] RECOVERY - puppet last run on cp1008 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [20:10:55] RECOVERY - Confd template for /var/lib/gdnsd/discovery-swift-ro.state on cp1008 is OK: No errors detected [20:10:55] RECOVERY - Confd vcl based reload on cp1008 is OK: reload-vcl successfully ran 0h, 0 minutes ago. [20:10:55] RECOVERY - HTTPS Unified ECDSA on cp1008 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345485 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2017-11-22 07:59:59 +0000 (expires in 97 days) [20:10:55] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.001 second response time [20:10:57] RECOVERY - Confd template for /var/lib/gdnsd/discovery-imagescaler-ro.state on cp1008 is OK: No errors detected [20:10:57] RECOVERY - Confd template for /var/lib/gdnsd/discovery-api-ro.state on cp1008 is OK: No errors detected [20:10:57] RECOVERY - Confd template for /var/lib/gdnsd/discovery-graphoid.state on cp1008 is OK: No errors detected [20:10:57] RECOVERY - Confd template for /var/lib/gdnsd/discovery-trendingedits.state on cp1008 is OK: No errors detected [20:10:57] RECOVERY - Confd template for /var/lib/gdnsd/discovery-mathoid.state on cp1008 is OK: No errors detected [20:10:57] RECOVERY - Confd template for /var/lib/gdnsd/discovery-pdfrender.state on cp1008 is OK: No errors detected [20:10:57] RECOVERY - Confd template for /var/lib/gdnsd/discovery-mobileapps.state on cp1008 is OK: No errors detected [20:11:05] RECOVERY - Confd template for /var/lib/gdnsd/discovery-wdqs.state on cp1008 is OK: No errors detected [20:11:05] RECOVERY - configured eth on cp1008 is OK: OK - interfaces up [20:11:05] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.001 second response time [20:11:06] RECOVERY - MD RAID on cp1008 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [20:11:15] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.001 second response time [20:11:15] RECOVERY - Varnish HTTP text-backend - port 3128 on cp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 176 bytes in 0.001 second response time [20:11:15] RECOVERY - Confd template for /var/lib/gdnsd/discovery-eventstreams.state on cp1008 is OK: No errors detected [20:11:15] RECOVERY - Confd template for /var/lib/gdnsd/discovery-imagescaler-rw.state on cp1008 is OK: No errors detected [20:11:15] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.001 second response time [20:11:15] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.001 second response time [20:11:15] RECOVERY - Confd template for /var/lib/gdnsd/discovery-kartotherian.state on cp1008 is OK: No errors detected [20:11:16] RECOVERY - Confd template for /var/lib/gdnsd/discovery-parsoid.state on cp1008 is OK: No errors detected [20:11:16] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.001 second response time [20:11:17] RECOVERY - Confd template for /var/lib/gdnsd/discovery-restbase.state on cp1008 is OK: No errors detected [20:11:17] RECOVERY - Confd template for /var/lib/gdnsd/discovery-cxserver.state on cp1008 is OK: No errors detected [20:11:18] RECOVERY - Confd template for /var/lib/gdnsd/discovery-recommendation-api.state on cp1008 is OK: No errors detected [20:11:35] RECOVERY - Confd template for /var/lib/gdnsd/discovery-appservers-ro.state on cp1008 is OK: No errors detected [20:11:35] RECOVERY - Confd template for /var/lib/gdnsd/discovery-citoid.state on cp1008 is OK: No errors detected [20:11:35] RECOVERY - Check size of conntrack table on cp1008 is OK: OK: nf_conntrack is 1 % full [20:11:46] RECOVERY - Varnish traffic logger - varnishstatsd on cp1008 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishstatsd, UID = 0 (root) [20:11:46] RECOVERY - eventlogging Varnishkafka log producer on cp1008 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf [20:11:46] RECOVERY - Varnish traffic logger - varnishrls on cp1008 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishrls, UID = 0 (root) [20:11:47] PROBLEM - Check Varnish expiry mailbox lag on cp1062 is CRITICAL: CRITICAL: expiry mailbox lag is 2110846 [20:18:25] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [20:19:13] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [20:21:29] 10Operations, 10Traffic: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T171028#3529010 (10Cmjohnson) a:03ema @ema I replaced the ssd and reinstalled. All yours! resolve once you confirmed everything is okay [20:30:24] RECOVERY - Check the NTP synchronisation status of timesyncd on cp1008 is OK: OK: synced at Wed 2017-08-16 20:30:21 UTC. [20:32:27] (03CR) 10Mobrovac: Increase max kafka message size (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [20:32:50] (03PS1) 10Urbanecm: Add one throttling exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372189 (https://phabricator.wikimedia.org/T173444) [20:36:27] (03PS14) 10Eevans: Reshape RESTBase Cassandra production cluster; Provision new 3.x cluster [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) [20:37:01] !log arlolra@tin Started deploy [parsoid/deploy@a9dc803]: Updating Parsoid to 1832a78e [20:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:50] (03PS2) 10Urbanecm: Add one throttling exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372189 (https://phabricator.wikimedia.org/T173444) [20:38:48] (03PS16) 10MarcoAurelio: [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 [20:40:07] Urbanecm: that dblist computed patch of ours is failing again on jenkins. Just rebased and see what it happens. [20:40:20] (03CR) 10jerkins-bot: [V: 04-1] [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [20:40:37] get lost jerkins-bot [20:40:46] TabbyCat, will try to fix. [20:40:54] In the meanwhile, can you have a look at T160491 for me? [20:40:54] T160491: Update Wikiversity logos - https://phabricator.wikimedia.org/T160491 [20:41:05] 1) DbListTests::testComputedListsFreshness [20:41:07] 20:40:19 Contents of 'securepollglobal' must match expansion of 'securepollglobal-computed' [20:41:14] but they do match [20:41:20] * TabbyCat no understand [20:43:43] TabbyCat, they must be sorted alphabetically. Move loginwiki after lnwiktionary and it'll be okay :) [20:44:12] Okay. Will do. I'll also exclude chapter wikis [20:44:12] In fact, jenkins is right, they do not match line by line. [20:44:38] but I'm hungry, let me have some dinner [20:44:54] can I haz dinnah [20:44:57] ;) [20:45:03] Sure :D [20:45:44] (03CR) 10Eevans: "Changes between 13 and 14:" [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [20:45:53] !log arlolra@tin Finished deploy [parsoid/deploy@a9dc803]: Updating Parsoid to 1832a78e (duration: 08m 52s) [20:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:09] !log T169939: Rolling restart of Cassandra instances, eqiad, rack b [20:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:21] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [20:48:32] (03PS3) 10Ppchelko: Increase max kafka message size [puppet] - 10https://gerrit.wikimedia.org/r/372179 [20:57:29] (03PS4) 10Ppchelko: Increase max kafka message size [puppet] - 10https://gerrit.wikimedia.org/r/372179 [20:59:59] (03PS17) 10MarcoAurelio: [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 [21:00:32] (03CR) 10Eevans: "[PC](http://puppet-compiler.wmflabs.org/7468)" [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [21:07:21] (03CR) 10Mobrovac: Increase max kafka message size (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [21:12:46] (03PS5) 10Ppchelko: Increase max kafka message size [puppet] - 10https://gerrit.wikimedia.org/r/372179 [21:17:15] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [21:18:25] PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused [21:19:25] RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on 10.64.32.207 port 9042 [21:21:28] !log thcipriani@tin Synchronized php: revert group1 wikis to 1.30.0-wmf.14 for T173462 (duration: 00m 47s) [21:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:45] T173462: LinksUpdate::acquirePageLock: Cannot flush pre-lock snapshot because writes are pending - https://phabricator.wikimedia.org/T173462 [21:22:09] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: revert group1 wikis to 1.30.0-wmf.14 for T173462 [21:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:02] (03PS1) 10Thcipriani: Revert "Group1 wikis to wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372193 [21:24:05] (03CR) 10Thcipriani: [C: 032] Revert "Group1 wikis to wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372193 (owner: 10Thcipriani) [21:25:24] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [21:25:35] (03Merged) 10jenkins-bot: Revert "Group1 wikis to wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372193 (owner: 10Thcipriani) [21:26:21] (03CR) 10jenkins-bot: Revert "Group1 wikis to wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372193 (owner: 10Thcipriani) [21:30:10] !log T169939: Rolling restart of Cassandra instances, eqiad, rack d [21:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:20] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [21:33:05] !log deleting private info in enwiki arbcom1_vote table (T173393) [21:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:10] (03CR) 10Mobrovac: [C: 04-1] Reshape RESTBase Cassandra production cluster; Provision new 3.x cluster (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [21:36:43] (03CR) 10MarcoAurelio: "I'm adding Bawolff here since he's deleting old private data stored from old board elections. Here we limit the wikis where voting could o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (owner: 10MarcoAurelio) [21:37:28] !log train is on hold pending resolution of T173462 [21:37:38] email Soon™ [21:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:40] T173462: LinksUpdate::acquirePageLock: Cannot flush pre-lock snapshot because writes are pending - https://phabricator.wikimedia.org/T173462 [21:38:05] !log deleting private info from securepoll_votes that the script missed due to ref-integ issues for old elections (T173393) [21:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:45] (03PS1) 10Urbanecm: Change $wgArticleCountMethod to any for srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372195 (https://phabricator.wikimedia.org/T172974) [21:41:31] (03CR) 10Urbanecm: "mwscript updateArticleCount.php --wiki=srwikiquote --update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372195 (https://phabricator.wikimedia.org/T172974) (owner: 10Urbanecm) [21:43:04] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [21:43:08] (03PS6) 10Ppchelko: Increase max kafka message size [puppet] - 10https://gerrit.wikimedia.org/r/372179 [21:43:54] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [21:46:35] PROBLEM - Apache HTTP on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [21:46:55] PROBLEM - HHVM rendering on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [21:47:05] PROBLEM - Apache HTTP on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [21:47:14] PROBLEM - HHVM rendering on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [21:47:14] PROBLEM - Nginx local proxy to apache on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time [21:47:34] PROBLEM - Nginx local proxy to apache on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time [21:47:55] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 74051 bytes in 0.403 second response time [21:48:14] RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.025 second response time [21:48:14] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 74051 bytes in 0.863 second response time [21:48:14] RECOVERY - Nginx local proxy to apache on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.165 second response time [21:48:34] (03PS7) 10Ppchelko: Increase max kafka message size [puppet] - 10https://gerrit.wikimedia.org/r/372179 [21:48:35] RECOVERY - Nginx local proxy to apache on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.050 second response time [21:48:35] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.124 second response time [21:48:44] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [21:49:35] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [21:58:36] (03CR) 10Mobrovac: [C: 04-1] Increase max kafka message size (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [22:04:11] (03PS8) 10Ppchelko: Increase max kafka message size [puppet] - 10https://gerrit.wikimedia.org/r/372179 [22:08:44] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [22:13:19] !log T169939: Rolling restart of Cassandra complete [22:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:32] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [22:14:18] !log T169939: Cleaning up wikipedia parsoid snapshots [22:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:01] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3529359 (10Eevans) [22:26:26] (03PS9) 10Ppchelko: Increase max kafka message size [puppet] - 10https://gerrit.wikimedia.org/r/372179 [22:28:54] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [22:30:13] !log T169939: Decommissioning Cassandra/restbase2001-a.codfw.wmnet [22:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:25] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [22:33:57] (03PS10) 10Ppchelko: Increase max kafka message size [puppet] - 10https://gerrit.wikimedia.org/r/372179 [22:34:03] (03PS3) 10Ayounsi: Icinga: add check_bfd check (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/370103 [22:39:29] (03PS1) 10Gilles: Revert "thumbor: fix connections-per-backend in nginx" [puppet] - 10https://gerrit.wikimedia.org/r/372199 [22:40:16] (03PS11) 10Ppchelko: Increase max kafka message size [puppet] - 10https://gerrit.wikimedia.org/r/372179 [22:45:08] 10Operations, 10Performance-Team, 10Thumbor, 10User-fgiunchedi: Long running thumbnail requests locking up Thumbor instances - https://phabricator.wikimedia.org/T172930#3529449 (10Gilles) I've filed a revert for the 1-connection-per-backend: https://gerrit.wikimedia.org/r/#/c/372199/ On Vagrant, while repr... [22:59:04] PROBLEM - puppet last run on meitnerium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170816T2300). [23:08:03] 10Operations, 10Wikimedia-Site-requests, 10User-MarcoAurelio, 10Wikimedia-log-errors: Unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T173419#3527677 (10Legoktm) Hmm, this is weird. I don't see any relevant exceptions in exception.log for the past two days. And when I run: ```... [23:11:03] (03PS12) 10Mobrovac: Increase max kafka message size [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [23:11:14] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [23:12:41] 10Operations, 10Electron-PDFs, 10OfflineContentGenerator, 10Services (designing): Improve stability and maintainability of our browser-based PDF render service - https://phabricator.wikimedia.org/T172815#3529495 (10GWicke) The latest in headless Chrome wrapping technologies seems to be https://github.com/G... [23:14:12] (03PS13) 10Mobrovac: Increase max kafka message size [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [23:14:14] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [23:19:32] (03PS1) 10Dzahn: librenms: no https/cert monitoring on inactive server [puppet] - 10https://gerrit.wikimedia.org/r/372205 (https://phabricator.wikimedia.org/T172712) [23:20:14] (03PS14) 10Mobrovac: Increase max kafka message size [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [23:21:14] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [23:23:24] (03CR) 10Eevans: Reshape RESTBase Cassandra production cluster; Provision new 3.x cluster (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [23:23:51] (03PS15) 10Eevans: Reshape RESTBase Cassandra production cluster; Provision new 3.x cluster [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) [23:27:14] RECOVERY - puppet last run on meitnerium is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [23:28:15] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [23:30:43] (03PS15) 10Mobrovac: Increase max kafka message size [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [23:34:52] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/7484/" [puppet] - 10https://gerrit.wikimedia.org/r/372205 (https://phabricator.wikimedia.org/T172712) (owner: 10Dzahn) [23:36:49] (03PS16) 10Mobrovac: Increase max kafka message size [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [23:53:28] (03PS1) 10Dzahn: admins: add new ssh key for dzahn [puppet] - 10https://gerrit.wikimedia.org/r/372210