[00:00:04] twentyafterfour: It is that lovely time of the day again! You are hereby commanded to deploy Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171026T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:06:56] PROBLEM - puppet last run on graphite1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/carbon] [00:12:42] (03CR) 10Aaron Schulz: [C: 031] proxysql: Setup proxysql on terbium/wasat as a test [puppet] - 10https://gerrit.wikimedia.org/r/384695 (https://phabricator.wikimedia.org/T175672) (owner: 10Jcrespo) [00:16:44] (03CR) 10Faidon Liambotis: [C: 04-1] "Yeah... I vaguely recall jessie's GnuTLS being quite behind on cipher support (jessie's OpenSSL was too, after all) and was waiting for th" [puppet] - 10https://gerrit.wikimedia.org/r/335232 (owner: 10BBlack) [00:25:56] (03PS1) 10RobH: correction of ulsfo mgmt asset tags [dns] - 10https://gerrit.wikimedia.org/r/386559 [00:27:10] (03CR) 10RobH: [C: 032] correction of ulsfo mgmt asset tags [dns] - 10https://gerrit.wikimedia.org/r/386559 (owner: 10RobH) [00:33:01] (03PS1) 10RobH: setting new misc systems in ulsfo mgmt dns entries [dns] - 10https://gerrit.wikimedia.org/r/386561 [00:34:00] (03PS2) 10RobH: setting new misc systems in ulsfo mgmt dns entries [dns] - 10https://gerrit.wikimedia.org/r/386561 [00:34:02] (03CR) 10RobH: [C: 032] setting new misc systems in ulsfo mgmt dns entries [dns] - 10https://gerrit.wikimedia.org/r/386561 (owner: 10RobH) [00:42:33] (03CR) 10Dereckson: [C: 031] Fixing interwiki sort order for Northern Sami [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386324 (https://phabricator.wikimedia.org/T178965) (owner: 10Jon Harald Søby) [00:45:51] (03CR) 10Dereckson: [C: 031] "@Jon This change is ready for deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386324 (https://phabricator.wikimedia.org/T178965) (owner: 10Jon Harald Søby) [00:46:29] 10Operations, 10ops-ulsfo, 10Traffic: setup bast4001/WMF7218 - https://phabricator.wikimedia.org/T179050#3711643 (10RobH) [00:46:40] 10Operations, 10ops-ulsfo, 10Traffic: decommission/replace bast4001.wikimedia.org - https://phabricator.wikimedia.org/T178592#3711660 (10RobH) [00:46:41] 10Operations, 10ops-ulsfo, 10Traffic: setup bast4001/WMF7218 - https://phabricator.wikimedia.org/T179050#3711643 (10RobH) [00:46:56] (03CR) 10Dereckson: [C: 04-1] "Blocked on community consensus (per task comment)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385818 (https://phabricator.wikimedia.org/T178750) (owner: 10GeoffreyT2000) [00:48:46] (03CR) 10Dereckson: [C: 04-1] "Technically -1 also: CommonsSettings.php (CS) goal is to do the configure step, but the value per database should be in InitialiseSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385818 (https://phabricator.wikimedia.org/T178750) (owner: 10GeoffreyT2000) [00:49:04] 10Operations, 10ops-ulsfo, 10Traffic: setup bast4001/WMF7218 - https://phabricator.wikimedia.org/T179050#3711661 (10RobH) [00:49:16] (03CR) 10Dereckson: [C: 04-1] "(and then in CS, you can refer to this variable as $wmg...)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385818 (https://phabricator.wikimedia.org/T178750) (owner: 10GeoffreyT2000) [00:51:13] (03CR) 10Jon Harald Søby: "@Dereckson Thanks! The one tomorrow 13–14 UTC is good for me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386324 (https://phabricator.wikimedia.org/T178965) (owner: 10Jon Harald Søby) [01:05:06] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1508979896 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4039962 keys, up 4 minutes 53 seconds - replication_delay is 1508979896 [01:05:06] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6481 [01:06:06] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4031196 keys, up 5 minutes 53 seconds - replication_delay is 0 [01:06:06] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4030872 keys, up 5 minutes 53 seconds - replication_delay is 0 [01:27:50] (03PS1) 10Jon Harald Søby: Enabling signature button in Projekt namespace on sewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386567 (https://phabricator.wikimedia.org/T175363) [01:40:01] (03CR) 10Dereckson: [C: 031] Enabling signature button in Projekt namespace on sewikimedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386567 (https://phabricator.wikimedia.org/T175363) (owner: 10Jon Harald Søby) [01:42:27] (03PS2) 10Jon Harald Søby: Enable signature button in Projekt namespace on sewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386567 (https://phabricator.wikimedia.org/T175363) [01:44:12] (03PS2) 10Jon Harald Søby: Fix interwiki sort order for Northern Sami [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386324 (https://phabricator.wikimedia.org/T178965) [01:53:37] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.136 second response time [02:20:47] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.131 second response time [02:31:47] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.4) (duration: 10m 34s) [02:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:31] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.5) (duration: 09m 30s) [02:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:20] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Oct 26 03:01:19 UTC 2017 (duration 6m 48s) [03:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:27:37] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 633.33 seconds [04:21:56] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 187.48 seconds [04:36:56] PROBLEM - puppet last run on elastic2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:06:56] RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:09:59] (03PS1) 10Marostegui: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386575 (https://phabricator.wikimedia.org/T174509) [05:11:45] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386575 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:12:12] !log Optimize pagelinks and templatelinks on db1060 - T174509 [05:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:22] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [05:12:54] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386575 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:14:15] !log Optimize recentchanges on s4 and s6 on labsdb1009 - T177772 [05:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:25] T177772: Purge 90% of rows from recentchanges (and posibly defragment) from commonswiki and ruwiki (the ones with source:wikidata) - https://phabricator.wikimedia.org/T177772 [05:14:59] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1060 - T174509 (duration: 00m 50s) [05:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:17] (03PS1) 10Marostegui: labsdb-replica: Increase table cache definition [puppet] - 10https://gerrit.wikimedia.org/r/386576 (https://phabricator.wikimedia.org/T179041) [05:29:51] (03CR) 10Marostegui: [C: 032] labsdb-replica: Increase table cache definition [puppet] - 10https://gerrit.wikimedia.org/r/386576 (https://phabricator.wikimedia.org/T179041) (owner: 10Marostegui) [05:32:53] marostegui: wow thanks for the fast patch [05:52:16] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 22 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [05:57:16] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 8 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [06:15:22] !log upgrade mw1299-mw1311 to wikidiff2 1.5.1 [06:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:27] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.344 second response time [06:40:03] zhuyifei1999_: you are welcome! [06:50:14] (03CR) 10Gehel: [C: 04-1] "see comments inline and ping me for a chat!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/386536 (https://phabricator.wikimedia.org/T178096) (owner: 10Bearloga) [06:50:16] (03PS1) 10Marostegui: Revert "db-codfw.php: Repool db2038" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386577 [06:50:18] (03PS2) 10Marostegui: Revert "db-codfw.php: Repool db2038" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386577 [06:51:45] !log Optimize recentchanges on s4 and s6 on labsdb1010 - T177772 [06:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:54] T177772: Purge 90% of rows from recentchanges (and posibly defragment) from commonswiki and ruwiki (the ones with source:wikidata) - https://phabricator.wikimedia.org/T177772 [06:54:46] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Repool db2038" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386577 (owner: 10Marostegui) [06:55:58] !log Optimize pagelinks and templatelinks on db1095 s1 - T174509 [06:56:03] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Repool db2038" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386577 (owner: 10Marostegui) [06:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:07] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [06:57:23] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2038 (duration: 00m 50s) [06:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:33] !log Stop MySQL on db2088 and db2084 to copy s2 and s4 to db2091 - T178359 [06:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:40] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [07:00:37] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.258 second response time [07:05:55] (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386578 (https://phabricator.wikimedia.org/T164488) [07:10:48] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386578 (https://phabricator.wikimedia.org/T164488) (owner: 10Marostegui) [07:10:50] !log upgrade mw1319-mw1328 to wikidiff2 1.5.1 [07:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:17] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386578 (https://phabricator.wikimedia.org/T164488) (owner: 10Marostegui) [07:12:56] !log Stop replication in sync on db1103 and db1077 to fix data drifts - T164488 [07:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:04] T164488: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488 [07:13:27] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1077 - T164488 (duration: 00m 50s) [07:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:26] PROBLEM - Nginx local proxy to apache on mw1321 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time [07:17:36] PROBLEM - HHVM rendering on mw1322 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [07:17:36] PROBLEM - Apache HTTP on mw1321 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [07:17:36] PROBLEM - Nginx local proxy to apache on mw1322 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.010 second response time [07:17:46] PROBLEM - Apache HTTP on mw1322 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [07:17:47] PROBLEM - Nginx local proxy to apache on mw1323 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time [07:17:57] PROBLEM - Apache HTTP on mw1323 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [07:18:07] PROBLEM - HHVM rendering on mw1321 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [07:18:07] PROBLEM - HHVM rendering on mw1323 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [07:18:46] RECOVERY - Apache HTTP on mw1322 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.203 second response time [07:18:57] PROBLEM - HHVM processes on mw1321 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [07:19:07] RECOVERY - HHVM rendering on mw1321 is OK: HTTP OK: HTTP/1.1 200 OK - 76789 bytes in 0.781 second response time [07:19:07] RECOVERY - HHVM rendering on mw1323 is OK: HTTP OK: HTTP/1.1 200 OK - 76790 bytes in 1.072 second response time [07:19:27] RECOVERY - Nginx local proxy to apache on mw1321 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 3.952 second response time [07:19:36] RECOVERY - Apache HTTP on mw1321 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.052 second response time [07:19:36] RECOVERY - Nginx local proxy to apache on mw1322 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.371 second response time [07:19:37] RECOVERY - HHVM rendering on mw1322 is OK: HTTP OK: HTTP/1.1 200 OK - 76791 bytes in 3.537 second response time [07:19:46] RECOVERY - Nginx local proxy to apache on mw1323 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.050 second response time [07:19:57] RECOVERY - Apache HTTP on mw1323 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.039 second response time [07:19:57] RECOVERY - HHVM processes on mw1321 is OK: PROCS OK: 6 processes with command name hhvm [07:20:44] sorry, these are depooled. downtime expired [07:23:50] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386581 [07:25:36] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386581 (owner: 10Marostegui) [07:27:18] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386581 (owner: 10Marostegui) [07:28:38] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1077 - T164488 (duration: 00m 50s) [07:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:47] T164488: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488 [07:37:13] (03PS1) 10Tpt: Disables TwoColConflict waiting for compatibility with ProofreadPage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386582 (https://phabricator.wikimedia.org/T179056) [07:43:37] !log Stop MySQL on db1047 to copy data over db1108 - T177405 T156844 [07:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:46] T177405: rack and setup db1107 and db1108 - https://phabricator.wikimedia.org/T177405 [07:43:46] T156844: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844 [07:48:07] PROBLEM - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [07:48:46] PROBLEM - haproxy failover on dbproxy1009 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [07:53:07] RECOVERY - haproxy failover on dbproxy1004 is OK: OK check_failover servers up 2 down 0 [07:53:47] RECOVERY - haproxy failover on dbproxy1009 is OK: OK check_failover servers up 2 down 0 [07:56:50] ^ that is expected as we are dealing with db1047 [07:57:07] these annoying analytics people [07:57:11] :P [07:57:41] !log Drop databases in s1 and s2 from db1047 and unconfigure replication - T177405 [07:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:49] T177405: rack and setup db1107 and db1108 - https://phabricator.wikimedia.org/T177405 [08:14:58] !log upgrade deployment servers to wikidiff2 1.5.1 [08:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:27] RECOVERY - Disk space on graphite1003 is OK: DISK OK [08:22:03] !log add 100G to graphite1003's /var/lib/carbon [08:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:33] (03PS1) 10Elukey: profile::mariadb::misc::eventlogging::database: set correct mysql params [puppet] - 10https://gerrit.wikimedia.org/r/386586 (https://phabricator.wikimedia.org/T177405) [08:28:58] (03CR) 10jerkins-bot: [V: 04-1] profile::mariadb::misc::eventlogging::database: set correct mysql params [puppet] - 10https://gerrit.wikimedia.org/r/386586 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [08:31:06] (03PS2) 10Elukey: profile::mariadb::misc::eventlogging::database: set correct mysql params [puppet] - 10https://gerrit.wikimedia.org/r/386586 (https://phabricator.wikimedia.org/T177405) [08:34:34] !log uploaded php-wikidiff2/hhvm-wikidiff2 1.5.1 for stretch-wikimedia [08:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:12] (03PS1) 10Filippo Giunchedi: graphite: cleanup cassandra metrics [puppet] - 10https://gerrit.wikimedia.org/r/386587 (https://phabricator.wikimedia.org/T179057) [08:41:55] (03PS1) 10Ladsgroup: mediawiki: Disable rebuildTermSqlIndex [puppet] - 10https://gerrit.wikimedia.org/r/386588 (https://phabricator.wikimedia.org/T163551) [08:45:26] (03CR) 10Marostegui: [C: 031] mediawiki: Disable rebuildTermSqlIndex [puppet] - 10https://gerrit.wikimedia.org/r/386588 (https://phabricator.wikimedia.org/T163551) (owner: 10Ladsgroup) [08:50:37] (03PS2) 10Ema: cache: set timeout_idle on text and upload [puppet] - 10https://gerrit.wikimedia.org/r/385985 (https://phabricator.wikimedia.org/T159429) [08:50:45] (03CR) 10Ema: [V: 032 C: 032] cache: set timeout_idle on text and upload [puppet] - 10https://gerrit.wikimedia.org/r/385985 (https://phabricator.wikimedia.org/T159429) (owner: 10Ema) [08:53:42] (03PS3) 10Elukey: profile::mariadb::misc::eventlogging::database: set correct mysql params [puppet] - 10https://gerrit.wikimedia.org/r/386586 (https://phabricator.wikimedia.org/T177405) [08:58:27] !log cache_text/upload varnish-be: set rutime parameter timeout_idle to 120s [08:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:47] (03CR) 10Elukey: "pcc https://puppet-compiler.wmflabs.org/compiler02/8470/" [puppet] - 10https://gerrit.wikimedia.org/r/386586 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [09:01:20] !log upgrade script runners to wikidiff2 1.5.1 [09:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:37] 10Operations, 10Traffic, 10Patch-For-Review: Allow setting varnish connection timeouts in puppet - https://phabricator.wikimedia.org/T159429#3711925 (10ema) 05Open>03Resolved a:03ema All varnish runtime parameters can now be specified with the `profile::cache::base::be_runtime_params` hiera setting. [09:03:15] 10Operations, 10wikidiff2, 10User-Addshore, 10WMDE-QWERTY-Team-Board: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3711928 (10MoritzMuehlenhoff) [09:03:37] (03CR) 10Marostegui: [C: 031] profile::mariadb::misc::eventlogging::database: set correct mysql params [puppet] - 10https://gerrit.wikimedia.org/r/386586 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [09:04:54] (03CR) 10Elukey: [C: 032] profile::mariadb::misc::eventlogging::database: set correct mysql params [puppet] - 10https://gerrit.wikimedia.org/r/386586 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [09:04:59] 10Operations, 10wikidiff2, 10User-Addshore, 10WMDE-QWERTY-Team-Board: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3674094 (10MoritzMuehlenhoff) @Tobi_WMDE_SW , @Addshore : wikidiff2 1.5.1 is now rolled out in production across all our mediawiki, you can pro... [09:05:24] (03CR) 10Marostegui: [C: 032] mediawiki: Disable rebuildTermSqlIndex [puppet] - 10https://gerrit.wikimedia.org/r/386588 (https://phabricator.wikimedia.org/T163551) (owner: 10Ladsgroup) [09:05:30] (03PS2) 10Marostegui: mediawiki: Disable rebuildTermSqlIndex [puppet] - 10https://gerrit.wikimedia.org/r/386588 (https://phabricator.wikimedia.org/T163551) (owner: 10Ladsgroup) [09:06:55] (03PS1) 10Hoo man: Log Wikidata dispatchers on terbium [puppet] - 10https://gerrit.wikimedia.org/r/386591 (https://phabricator.wikimedia.org/T178624) [09:07:28] 10Operations, 10wikidiff2, 10User-Addshore, 10WMDE-QWERTY-Team-Board: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3711938 (10Addshore) [09:07:54] 10Operations, 10wikidiff2, 10User-Addshore, 10WMDE-QWERTY-Team-Board: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3674094 (10Addshore) [09:08:37] 10Operations, 10wikidiff2, 10User-Addshore, 10WMDE-QWERTY-Team-Board: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3674094 (10Addshore) >>! In T177891#3711928, @MoritzMuehlenhoff wrote: > @Tobi_WMDE_SW , @Addshore : wikidiff2 1.5.1 is now rolled out in produ... [09:10:06] (03PS9) 10Muehlenhoff: Create /run/nutcracker on stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457) [09:17:02] (03CR) 10Muehlenhoff: [C: 032] Create /run/nutcracker on stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457) (owner: 10Muehlenhoff) [09:18:34] (03PS11) 10Ema: VCL: Exp cache admission policy for varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/386192 (https://phabricator.wikimedia.org/T144187) [09:18:57] (03CR) 10Ema: VCL: Exp cache admission policy for varnish-fe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/386192 (https://phabricator.wikimedia.org/T144187) (owner: 10Ema) [09:21:54] (03CR) 10Mobrovac: [C: 031] graphite: cleanup cassandra metrics [puppet] - 10https://gerrit.wikimedia.org/r/386587 (https://phabricator.wikimedia.org/T179057) (owner: 10Filippo Giunchedi) [09:36:21] (03PS1) 10Ema: cache: move frontend memory cache sizing to varnish::common [puppet] - 10https://gerrit.wikimedia.org/r/386593 [09:39:57] (03CR) 10Elukey: [C: 031] "Checked with Moritz what mw servers include mediawiki::web, seems good (jobrunners/videoscalers don't so it doesn't match with mw*)." [puppet] - 10https://gerrit.wikimedia.org/r/384713 (https://phabricator.wikimedia.org/T174431) (owner: 10Muehlenhoff) [09:46:26] PROBLEM - Check Varnish expiry mailbox lag on cp4026 is CRITICAL: CRITICAL: expiry mailbox lag is 2001858 [09:50:40] (03PS2) 10Hoo man: Log Wikidata dispatchers on terbium [puppet] - 10https://gerrit.wikimedia.org/r/386591 (https://phabricator.wikimedia.org/T179060) [09:52:17] (03PS1) 10Ladsgroup: Add property for RDF mapping of external identifiers for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386594 (https://phabricator.wikimedia.org/T178180) [09:54:45] (03CR) 10Ema: [C: 032] cache: move frontend memory cache sizing to varnish::common [puppet] - 10https://gerrit.wikimedia.org/r/386593 (owner: 10Ema) [10:08:57] (03PS12) 10Ema: VCL: Exp cache admission policy for varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/386192 (https://phabricator.wikimedia.org/T144187) [10:20:45] (03CR) 10Ema: "pcc looks good: https://puppet-compiler.wmflabs.org/compiler02/8475/" [puppet] - 10https://gerrit.wikimedia.org/r/386192 (https://phabricator.wikimedia.org/T144187) (owner: 10Ema) [10:24:49] 10Operations, 10DBA: Decommission db1015, db1035, db1044 and db1038 - https://phabricator.wikimedia.org/T148078#3712033 (10Marostegui) [10:25:34] 10Operations, 10DBA: Decommission db1015, db1035, db1044 and db1038 - https://phabricator.wikimedia.org/T148078#2714228 (10Marostegui) [10:29:07] (03PS1) 10Marostegui: db-eqiad.php: Clean up comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386600 [10:30:09] (03PS2) 10Marostegui: db-eqiad.php: Clean up comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386600 [10:32:08] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Clean up comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386600 (owner: 10Marostegui) [10:34:55] (03Merged) 10jenkins-bot: db-eqiad.php: Clean up comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386600 (owner: 10Marostegui) [10:36:02] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Clean up some comments in s3 (duration: 00m 50s) [10:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:26] RECOVERY - Check Varnish expiry mailbox lag on cp4026 is OK: OK: expiry mailbox lag is 0 [10:41:14] (03PS4) 10Muehlenhoff: Fix setup of libapache2-mod-security2 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/384713 (https://phabricator.wikimedia.org/T174431) [10:41:59] (03CR) 10Muehlenhoff: [C: 032] Fix setup of libapache2-mod-security2 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/384713 (https://phabricator.wikimedia.org/T174431) (owner: 10Muehlenhoff) [10:43:56] (03PS1) 10Muehlenhoff: Add upstream bug references [puppet] - 10https://gerrit.wikimedia.org/r/386601 [10:44:40] (03CR) 10Muehlenhoff: [C: 032] Add upstream bug references [puppet] - 10https://gerrit.wikimedia.org/r/386601 (owner: 10Muehlenhoff) [10:47:36] RECOVERY - nutcracker port on labweb1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [10:47:47] RECOVERY - nutcracker process on labweb1001 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [10:54:47] (03PS2) 10Filippo Giunchedi: graphite: cleanup cassandra metrics [puppet] - 10https://gerrit.wikimedia.org/r/386587 (https://phabricator.wikimedia.org/T179057) [10:57:06] RECOVERY - puppet last run on graphite1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:57:37] RECOVERY - nutcracker port on labweb1002 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [10:58:26] RECOVERY - nutcracker process on labweb1002 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [11:07:03] (03PS1) 10Filippo Giunchedi: hieradata: expand SMART health check rollout in codfw [puppet] - 10https://gerrit.wikimedia.org/r/386603 (https://phabricator.wikimedia.org/T86552) [11:09:10] 10Operations, 10Puppet, 10User-Joe: Puppet: Error: Evaluation Error: Error while evaluating a Function Call, undefined local variable or method `known_resource_types' - https://phabricator.wikimedia.org/T179033#3712137 (10Joe) I feared this would happen. `require_package` (and `role` as well) do deep in the... [11:09:35] 10Operations, 10Puppet, 10User-Joe: Puppet: Error: Evaluation Error: Error while evaluating a Function Call, undefined local variable or method `known_resource_types' - https://phabricator.wikimedia.org/T179033#3712138 (10Joe) p:05Normal>03High a:03Joe [11:31:47] RECOVERY - HHVM rendering on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 76774 bytes in 0.234 second response time [11:32:06] RECOVERY - Check systemd state on labweb1001 is OK: OK - running: The system is fully operational [11:32:26] RECOVERY - Apache HTTP on labweb1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 620 bytes in 0.048 second response time [11:32:30] ^ labweb1001 is me, doing some stretch compat tests/debugging [11:40:27] (03PS13) 10Mobrovac: Improve the checking procedure and emit better messages; v0.1.4 [software/service-checker] - 10https://gerrit.wikimedia.org/r/386116 (https://phabricator.wikimedia.org/T150560) [11:41:57] (03CR) 10jerkins-bot: [V: 04-1] Improve the checking procedure and emit better messages; v0.1.4 [software/service-checker] - 10https://gerrit.wikimedia.org/r/386116 (https://phabricator.wikimedia.org/T150560) (owner: 10Mobrovac) [11:44:00] (03CR) 10Mobrovac: "Tested against all SCB services and RB, works like advertised." [software/service-checker] - 10https://gerrit.wikimedia.org/r/386116 (https://phabricator.wikimedia.org/T150560) (owner: 10Mobrovac) [11:53:26] RECOVERY - puppet last run on labweb1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:53:37] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.137 second response time [12:16:05] 10Operations, 10Parsoid, 10Traffic, 10VisualEditor, 10HTTPS: Parsoid, VisualEditor not working with SSL / HTTPS - https://phabricator.wikimedia.org/T178778#3712186 (10Deskana) >>! In T178778#3707527, @PlanetKrypton wrote: > @Deskana I had plugin temporarily disabled so people didn't try to use it. Try no... [12:20:47] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.174 second response time [12:29:55] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp4024.ulsfo.wmnet [12:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:42] !log powercycle cp4024 (depooled) [12:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:47] RECOVERY - Host cp4024 is UP: PING WARNING - Packet loss = 86%, RTA = 79.27 ms [12:32:47] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 70 ESP OK [12:32:47] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 70 ESP OK [12:32:47] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 70 ESP OK [12:32:47] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK [12:32:56] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [12:32:56] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [12:32:57] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 70 ESP OK [12:32:57] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [12:32:57] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [12:33:06] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [12:33:07] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 70 ESP OK [12:33:07] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 114 ESP OK [12:33:17] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 114 ESP OK [12:33:17] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 114 ESP OK [12:33:17] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 114 ESP OK [12:33:17] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [12:33:17] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [12:33:17] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 70 ESP OK [12:33:26] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 114 ESP OK [12:33:36] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [12:33:37] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 70 ESP OK [12:33:46] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [12:33:46] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK [12:33:46] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [12:33:46] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [12:33:46] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 70 ESP OK [12:33:47] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 114 ESP OK [12:40:59] (03PS14) 10Mobrovac: Improve the checking procedure and emit better messages; v0.1.4 [software/service-checker] - 10https://gerrit.wikimedia.org/r/386116 (https://phabricator.wikimedia.org/T150560) [12:41:46] PROBLEM - Host cp4024 is DOWN: PING CRITICAL - Packet loss = 100% [12:41:53] (03CR) 10Eevans: [C: 031] graphite: cleanup cassandra metrics [puppet] - 10https://gerrit.wikimedia.org/r/386587 (https://phabricator.wikimedia.org/T179057) (owner: 10Filippo Giunchedi) [12:42:06] (03CR) 10jerkins-bot: [V: 04-1] Improve the checking procedure and emit better messages; v0.1.4 [software/service-checker] - 10https://gerrit.wikimedia.org/r/386116 (https://phabricator.wikimedia.org/T150560) (owner: 10Mobrovac) [12:42:13] (03PS1) 10Marostegui: install_server: Reimage db2086 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/386606 (https://phabricator.wikimedia.org/T178359) [12:43:44] (03PS15) 10Mobrovac: Improve the checking procedure and emit better messages; v0.1.4 [software/service-checker] - 10https://gerrit.wikimedia.org/r/386116 (https://phabricator.wikimedia.org/T150560) [12:45:26] (03CR) 10jerkins-bot: [V: 04-1] Improve the checking procedure and emit better messages; v0.1.4 [software/service-checker] - 10https://gerrit.wikimedia.org/r/386116 (https://phabricator.wikimedia.org/T150560) (owner: 10Mobrovac) [12:46:46] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [12:46:47] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [12:46:56] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [12:46:58] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4024_v4, cp4024_v6 [12:47:07] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [12:47:10] (03PS1) 10Marostegui: db-codfw.php: Add db2088 and db2084 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386607 (https://phabricator.wikimedia.org/T178359) [12:47:16] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4024_v4, cp4024_v6 [12:47:16] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 112 connecting: cp4024_v4, cp4024_v6 [12:47:17] (03PS16) 10Mobrovac: Improve the checking procedure and emit better messages; v0.1.4 [software/service-checker] - 10https://gerrit.wikimedia.org/r/386116 (https://phabricator.wikimedia.org/T150560) [12:47:17] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 112 connecting: cp4024_v4, cp4024_v6 [12:47:17] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 112 connecting: cp4024_v4, cp4024_v6 [12:47:21] (03CR) 10Marostegui: [C: 032] install_server: Reimage db2086 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/386606 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [12:47:26] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 112 connecting: cp4024_v4, cp4024_v6 [12:47:26] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [12:47:26] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [12:47:27] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4024_v4, cp4024_v6 [12:47:27] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 112 connecting: cp4024_v4, cp4024_v6 [12:47:37] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [12:47:46] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4024_v4, cp4024_v6 [12:47:47] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [12:47:47] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4024_v4, cp4024_v6 [12:47:49] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4024_v4, cp4024_v6 [12:47:56] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 112 connecting: cp4024_v4, cp4024_v6 [12:47:56] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [12:47:56] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4024_v4, cp4024_v6 [12:47:57] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4024_v4, cp4024_v6 [12:47:57] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4024_v4, cp4024_v6 [12:47:57] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [12:47:57] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [12:48:06] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4024_v4, cp4024_v6 [12:52:27] (03PS1) 10Gehel: logstash: explicit mappings for a few problematic fields [puppet] - 10https://gerrit.wikimedia.org/r/386608 (https://phabricator.wikimedia.org/T179058) [12:54:19] 10Operations, 10Cloud-Services: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3712209 (10chasemp) Thank you @robh :) [12:54:47] (03CR) 10Marostegui: [C: 032] db-codfw.php: Add db2088 and db2084 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386607 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [12:56:29] (03Merged) 10jenkins-bot: db-codfw.php: Add db2088 and db2084 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386607 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [12:57:02] (03CR) 10Mobrovac: [C: 031] "looks sensible to me" [puppet] - 10https://gerrit.wikimedia.org/r/386608 (https://phabricator.wikimedia.org/T179058) (owner: 10Gehel) [12:57:47] jouncebot: next [12:57:47] In 0 hour(s) and 2 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171026T1300) [12:57:54] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Reorganize future rc multi-instance hosts - T178359 (duration: 00m 50s) [12:57:58] right on time :-) [12:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:02] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [12:58:06] (03CR) 10Ladsgroup: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/386591 (https://phabricator.wikimedia.org/T179060) (owner: 10Hoo man) [12:59:35] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1101" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386609 [12:59:38] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1101" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386609 [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171026T1300). Please do the needful. [13:00:04] Jhs and Amir1: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] I can SWAT today! [13:00:25] o/ [13:00:42] Jhs: around for SWAT? [13:00:50] (03PS2) 10Gehel: logstash: explicit mappings for a few problematic fields [puppet] - 10https://gerrit.wikimedia.org/r/386608 (https://phabricator.wikimedia.org/T179058) [13:02:14] Amir1: looks like Jhs is not around, so I will start with your patches [13:02:33] Awesome! [13:03:04] I'm around zeljkof ! [13:03:29] Jhs: ok, in that case starting with you :) cc Amir1 [13:03:36] :) [13:03:45] what do i need to do? This is my first time doing this [13:03:56] Jhs: I am reviewing 386324, I will let you know when it's at mwdebug1002 [13:04:01] (Y) [13:04:03] do you know how to test there? [13:04:09] nope [13:04:12] :) [13:04:18] let me see, there are docs [13:05:18] Jhs: https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Staging_changes [13:05:32] okay :) [13:05:41] you have to install a browser extension that will let you access just one debug machine [13:06:03] I will deploy first to that machine, you can test there, if it's all good, I will deploy to the entire cluster [13:06:16] Jhs: any questions? [13:06:37] probably many. have installed extension [13:07:16] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386324 (https://phabricator.wikimedia.org/T178965) (owner: 10Jon Harald Søby) [13:07:49] Jhs: 386324 is being merged, you can follow the progress here: https://integration.wikimedia.org/zuul/ [13:08:01] when merged I will deploy the change to mwdebug1002 [13:08:12] you can test the change there using the browser addon [13:08:26] will let you know in a minute or two when it's ready [13:08:33] (03Merged) 10jenkins-bot: Fix interwiki sort order for Northern Sami [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386324 (https://phabricator.wikimedia.org/T178965) (owner: 10Jon Harald Søby) [13:08:46] zeljkof, Okay, so what I need to do is use the extension with mwdebug1002.eqiad.wmnet and then visit a page where it can be tested? [13:08:59] Jhs: correct [13:09:09] the extension has on/off button [13:09:26] when it's on, you can select the site, you should select mwdebug1002, that's where the change will be [13:09:47] then just do whatever you need to check that the change works and did not break something else [13:09:58] ok [13:10:38] Jhs: 386324 is at mwdebug1002 [13:11:01] let me know if you need more that few minutes to test [13:11:07] and let me know when you are done [13:11:44] zeljkof, seems to work as intended (Y) [13:12:00] Jhs: great, deploying then [13:12:24] yay [13:13:39] !log zfilipin@tin Synchronized wmf-config/InterwikiSortOrders.php: SWAT: [[gerrit:386324|Fix interwiki sort order for Northern Sami (T178965)]] (duration: 00m 50s) [13:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:46] T178965: Northern Sami needs to be moved in the default interwiki sort orders - https://phabricator.wikimedia.org/T178965 [13:13:52] Jhs: as you can see in the line above, it's deployed :) [13:14:11] great! [13:14:22] you can disable the extension and check again, it's on production now [13:14:29] I will start with the next commit [13:14:30] (03PS3) 10Gehel: logstash: explicit mappings for a few problematic fields [puppet] - 10https://gerrit.wikimedia.org/r/386608 (https://phabricator.wikimedia.org/T179058) [13:14:40] I will let you know when it's at mwdebug [13:15:14] works in production too, yeah [13:16:07] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386567 (https://phabricator.wikimedia.org/T175363) (owner: 10Jon Harald Søby) [13:16:23] Jhs: 386567 is getting merged ^ [13:17:09] (03Merged) 10jenkins-bot: Enable signature button in Projekt namespace on sewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386567 (https://phabricator.wikimedia.org/T175363) (owner: 10Jon Harald Søby) [13:18:08] zeljkof, still mwdebug1002 i assume? [13:18:20] Jhs: yes, just deployed there, please check [13:20:07] zeljkof, works as intended (Y) [13:20:15] Jhs: ok, deploying [13:20:53] PROBLEM - puppet last run on labtestvirt2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:21:16] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:386567|Enable signature button in Projekt namespace on sewikimedia (T175363)]] (duration: 00m 50s) [13:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:24] T175363: Allow for Visual Editor signing in the Projekt namespace on se.wikimedia - https://phabricator.wikimedia.org/T175363 [13:21:28] !log Compress InnoDB on db2040 - T178359 [13:21:32] Jhs: deployed ^ please check [13:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:35] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [13:21:50] Amir1: reviewing 386594 [13:22:32] zeljkof: not testable [13:22:49] I need to run a maintenance script after that, if that's okay [13:23:12] zeljkof, it works. So now I should just close the tasks, there's nothing more that needs to be done by me? [13:23:35] Jhs: if there is nothing else to do in the task, feel free to close it [13:24:01] great! thanks for the help & patience :) [13:24:02] Jhs: thanks for deploying with #releng! :) (#wikimedia-releng) [13:24:27] :) [13:24:54] Amir1: ok, so I deploy without mwdebug? will the script be quick or slow? :) [13:24:55] godog: https://phabricator.wikimedia.org/T177216#3712258, it sounds using a special label topic="__all__" might be a bad idea then, right? [13:25:17] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386594 (https://phabricator.wikimedia.org/T178180) (owner: 10Ladsgroup) [13:25:18] zeljkof: yes, the script will be fast [13:25:50] Amir1: do you want to deploy everything yourself? (just checking, I can deploy) [13:26:01] ottomata: yeah it'd have the same problem [13:26:21] nah, Other things are straightforward [13:26:30] (03Merged) 10jenkins-bot: Add property for RDF mapping of external identifiers for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386594 (https://phabricator.wikimedia.org/T178180) (owner: 10Ladsgroup) [13:26:52] ok [13:28:04] !log zfilipin@tin Synchronized wmf-config/Wikibase.php: SWAT: [[gerrit:386594|Add property for RDF mapping of external identifiers for Wikidata (T178180)]] (duration: 00m 50s) [13:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:11] T178180: Enable RDF mapping for external identifiers for Wikidata.org - https://phabricator.wikimedia.org/T178180 [13:28:14] 10Operations, 10ChangeProp, 10RESTBase, 10Wikimedia-Logstash, and 2 others: RB and CP logs disappeared from Logstash - https://phabricator.wikimedia.org/T179058#3712268 (10Gehel) [13:28:19] !log installing icu security update on trusty [13:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:27] Amir1: 386594 is deployed, I will start with the next commit while you run the script [13:28:28] (03PS3) 10Filippo Giunchedi: graphite: cleanup cassandra metrics [puppet] - 10https://gerrit.wikimedia.org/r/386587 (https://phabricator.wikimedia.org/T179057) [13:28:40] Thanks [13:28:49] (03PS1) 10Rush: openstack: refactor deployment specific puppetmaster code [puppet] - 10https://gerrit.wikimedia.org/r/386612 (https://phabricator.wikimedia.org/T171494) [13:29:22] (03CR) 10jerkins-bot: [V: 04-1] openstack: refactor deployment specific puppetmaster code [puppet] - 10https://gerrit.wikimedia.org/r/386612 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [13:29:25] (03CR) 10Filippo Giunchedi: [C: 032] graphite: cleanup cassandra metrics [puppet] - 10https://gerrit.wikimedia.org/r/386587 (https://phabricator.wikimedia.org/T179057) (owner: 10Filippo Giunchedi) [13:31:04] PROBLEM - puppet last run on labtestneutron2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:31:55] (03PS1) 10Muehlenhoff: Add library hint for ICU [puppet] - 10https://gerrit.wikimedia.org/r/386613 [13:32:42] (03PS1) 10Giuseppe Lavagetto: require_package: puppet 4.x compatibility [puppet] - 10https://gerrit.wikimedia.org/r/386614 (https://phabricator.wikimedia.org/T179033) [13:32:44] !log force delete old graphite cassandra metrics - T179057 [13:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:52] T179057: Cleanup stale cassandra graphite metrics - https://phabricator.wikimedia.org/T179057 [13:34:55] (03CR) 10Gehel: [C: 031] "trivial enough" [software/cumin] - 10https://gerrit.wikimedia.org/r/386400 (https://phabricator.wikimedia.org/T179002) (owner: 10Volans) [13:38:16] 10Operations, 10Kubernetes, 10User-fgiunchedi: Improve monitoring of the Kubernetes clusters - https://phabricator.wikimedia.org/T177395#3712284 (10fgiunchedi) [13:38:37] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-Joe, 10User-fgiunchedi: Improve monitoring of the Kubernetes clusters - https://phabricator.wikimedia.org/T177395#3657156 (10fgiunchedi) [13:40:25] (03CR) 10Gehel: [C: 031] "LGTM. It looks more important to cleanup this logging code than to not break compatibility with existing clients at this point in the proj" [software/cumin] - 10https://gerrit.wikimedia.org/r/386399 (https://phabricator.wikimedia.org/T179002) (owner: 10Volans) [13:41:08] !log delete old CF cassandra metrics - T173436 [13:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:19] T173436: Delete graphite metrics for old CFs - https://phabricator.wikimedia.org/T173436 [13:41:29] zeljkof: I'm still working on it [13:41:49] it seems there is a problem but doesn't require rollback [13:42:11] (03CR) 10Alexandros Kosiaris: [C: 031] "I am guessing an extended catalog compilation test is in order, but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/386614 (https://phabricator.wikimedia.org/T179033) (owner: 10Giuseppe Lavagetto) [13:42:36] (03CR) 10Muehlenhoff: [C: 032] Add library hint for ICU [puppet] - 10https://gerrit.wikimedia.org/r/386613 (owner: 10Muehlenhoff) [13:42:39] Amir1: ok, 386598 just merged, should I deploy it to mwdebug? or wait? [13:42:53] mwdebug would be great [13:45:13] (03CR) 10Anomie: "This should work for most if not all code in MediaWiki itself. The settings here are intended to be passed into the LBFactory class and ar" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385152 (https://phabricator.wikimedia.org/T178553) (owner: 10Marostegui) [13:46:23] !log ladsgroup@terbium:~$ mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildPropertyInfo.php --wiki=wikidatawiki --rebuild-all --force (T178180) [13:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:33] T178180: Enable RDF mapping for external identifiers for Wikidata.org - https://phabricator.wikimedia.org/T178180 [13:46:36] Amir1: 386598 is at mwdebug1002 [13:48:13] zeljkof: works just fine [13:49:55] Amir1: ok, deploying [13:50:51] RECOVERY - puppet last run on labtestvirt2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:52:55] !log zfilipin@tin Synchronized php-1.31.0-wmf.5/extensions/Wikidata: SWAT: [[gerrit:386598|Make search for titles be always uppercase (T179045)]] (duration: 02m 10s) [13:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:03] T179045: Wikibase prefix search for IDs is case sensitive - https://phabricator.wikimedia.org/T179045 [13:54:03] Amir1: I have deployed php-1.31.0-wmf.5/extensions/Wikidata but now I see all files from are 386598 actually in extensions/Wikibase, so I am confused [13:54:06] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: Delete graphite metrics for old CFs - https://phabricator.wikimedia.org/T173436#3712310 (10fgiunchedi) 05Open>03Resolved All done! [13:55:04] Amir1: should I deploy extensions/Wikibase? [13:55:48] Amir1: 386548 will be merged soon so I will deploy Wikibase anyway... I am still confused :) [13:56:19] akosiaris gehel marostegui volans heads up, I got https://gerrit.wikimedia.org/r/#/c/386603/ out to extend the smart health check deployment, if you could take a look that'd be awesome! [13:56:38] zeljkof: the wikibase change is just for consistency [13:56:44] otherwise it's not needed [13:57:24] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/386603 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [13:57:28] godog: done ;) [13:57:33] Amir1: it still makes no sense to me :) [13:57:44] volans: thanks! [13:57:46] to me neither :D [13:58:18] Amir1: 386548 is merged, do I need to sync it to mwdebug, or deploy immediately? [13:58:21] RECOVERY - puppet last run on labtestneutron2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:58:29] deploy immediately please, it's not used at all [13:58:29] (03CR) 10Marostegui: [C: 031] hieradata: expand SMART health check rollout in codfw [puppet] - 10https://gerrit.wikimedia.org/r/386603 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [13:58:38] Amir1: ok [13:58:46] (03CR) 10Gehel: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/386603 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [13:59:21] Amir1: all good with wikidata deploy? and the config one? [14:00:42] yeah [14:00:44] thanks :) [14:02:12] ObjectCache::getInstance( $wgMainCacheType )->get( $wgWBRepoSettings['sharedCacheKeyPrefix'] . ':CacheAwarePropertyInfoStore' ); [14:02:15] did the trick [14:03:29] s/get/delete/ obviously [14:03:48] !log zfilipin@tin Synchronized php-1.31.0-wmf.5/extensions/Wikibase: SWAT: [[gerrit:386548|Make search for titles be always uppercase (T179045)]] (duration: 01m 36s) [14:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:56] T179045: Wikibase prefix search for IDs is case sensitive - https://phabricator.wikimedia.org/T179045 [14:04:07] Amir1: 386548 is deployed, thanks for deploying with #releng ;) [14:04:13] !log EU SWAT finished [14:04:15] Thank you! [14:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:24] (03PS3) 10Marostegui: Revert "db-eqiad.php: Depool db1101" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386609 [14:05:53] (03PS2) 10Filippo Giunchedi: hieradata: expand SMART health check rollout in codfw [puppet] - 10https://gerrit.wikimedia.org/r/386603 (https://phabricator.wikimedia.org/T86552) [14:06:11] ACKNOWLEDGEMENT - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 Brandon Black fallout from cp4024 death in https://phabricator.wikimedia.org/T174891 [14:06:12] ACKNOWLEDGEMENT - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 Brandon Black fallout from cp4024 death in https://phabricator.wikimedia.org/T174891 [14:06:12] ACKNOWLEDGEMENT - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 Brandon Black fallout from cp4024 death in https://phabricator.wikimedia.org/T174891 [14:06:12] ACKNOWLEDGEMENT - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 Brandon Black fallout from cp4024 death in https://phabricator.wikimedia.org/T174891 [14:06:12] ACKNOWLEDGEMENT - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 Brandon Black fallout from cp4024 death in https://phabricator.wikimedia.org/T174891 [14:06:31] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: expand SMART health check rollout in codfw [puppet] - 10https://gerrit.wikimedia.org/r/386603 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [14:07:14] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3712359 (10BBlack) I left this down because @RobH was due on-site a short while later. He observed no SEL entry while on-site. Powercycled back to the OS this morning with traffic depooled. No useful obs... [14:07:23] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1101" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386609 (owner: 10Marostegui) [14:07:35] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3712361 (10BBlack) a:05BBlack>03RobH [14:07:37] (03PS1) 10Ema: varnish: ensure consistent CL [puppet] - 10https://gerrit.wikimedia.org/r/386616 [14:08:03] (03CR) 10jerkins-bot: [V: 04-1] varnish: ensure consistent CL [puppet] - 10https://gerrit.wikimedia.org/r/386616 (owner: 10Ema) [14:09:05] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386609 (owner: 10Marostegui) [14:09:40] 10Operations, 10ops-eqiad, 10DBA: db1101 crashed - memory errors - https://phabricator.wikimedia.org/T178383#3712363 (10Marostegui) I have repooled db1101 [14:10:15] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1101 - T178383 (duration: 00m 49s) [14:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:22] T178383: db1101 crashed - memory errors - https://phabricator.wikimedia.org/T178383 [14:10:39] (03PS2) 10Ema: varnish: ensure consistent CL [puppet] - 10https://gerrit.wikimedia.org/r/386616 [14:11:50] (03CR) 10BBlack: [C: 031] varnish: ensure consistent CL [puppet] - 10https://gerrit.wikimedia.org/r/386616 (owner: 10Ema) [14:12:17] (03PS1) 10Muehlenhoff: Ship a dummy config since IncludeOptional isn't really optional [puppet] - 10https://gerrit.wikimedia.org/r/386617 [14:12:51] PROBLEM - Check systemd state on maps-test2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:13:55] (03CR) 10Ema: [C: 032] varnish: ensure consistent CL [puppet] - 10https://gerrit.wikimedia.org/r/386616 (owner: 10Ema) [14:14:56] (03PS2) 10Muehlenhoff: graphite: Use systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/385994 [14:15:07] (03CR) 10Elukey: [C: 031] Ship a dummy config since IncludeOptional isn't really optional [puppet] - 10https://gerrit.wikimedia.org/r/386617 (owner: 10Muehlenhoff) [14:15:21] maps-test2004 is smartd not starting, I'll take a look [14:15:45] (03PS1) 10Alexandros Kosiaris: Disable notifications for the ORES stresstests hosts [puppet] - 10https://gerrit.wikimedia.org/r/386619 [14:16:06] (03CR) 10Muehlenhoff: [C: 032] graphite: Use systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/385994 (owner: 10Muehlenhoff) [14:16:44] (03PS2) 10Alexandros Kosiaris: Disable notifications for the ORES stresstests hosts [puppet] - 10https://gerrit.wikimedia.org/r/386619 [14:16:47] (03CR) 10Alexandros Kosiaris: [C: 032] Disable notifications for the ORES stresstests hosts [puppet] - 10https://gerrit.wikimedia.org/r/386619 (owner: 10Alexandros Kosiaris) [14:17:12] PROBLEM - Check systemd state on maps-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:17:15] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Disable notifications for the ORES stresstests hosts [puppet] - 10https://gerrit.wikimedia.org/r/386619 (owner: 10Alexandros Kosiaris) [14:22:21] (03PS1) 10Muehlenhoff: uwsgi: Use systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/386620 [14:24:12] (03PS1) 10Muehlenhoff: keyholder: Use systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/386621 [14:27:20] 10Operations, 10Puppet, 10User-Joe: Puppet: Use of 'import' has been discontinued in favor of a manifest directory. - https://phabricator.wikimedia.org/T179023#3712379 (10Joe) Everything that moves to puppet 4 is "environment future" now until WMCS moves at least to the future parser. [14:27:33] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3712381 (10Joe) [14:27:35] 10Operations, 10Puppet, 10User-Joe: Puppet: Use of 'import' has been discontinued in favor of a manifest directory. - https://phabricator.wikimedia.org/T179023#3712380 (10Joe) 05Open>03Invalid [14:28:21] PROBLEM - Check systemd state on maps-test2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:29:57] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8477/ all noops AFAICS" [puppet] - 10https://gerrit.wikimedia.org/r/386614 (https://phabricator.wikimedia.org/T179033) (owner: 10Giuseppe Lavagetto) [14:31:06] (03PS2) 10Giuseppe Lavagetto: require_package: puppet 4.x compatibility [puppet] - 10https://gerrit.wikimedia.org/r/386614 (https://phabricator.wikimedia.org/T179033) [14:31:12] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] require_package: puppet 4.x compatibility [puppet] - 10https://gerrit.wikimedia.org/r/386614 (https://phabricator.wikimedia.org/T179033) (owner: 10Giuseppe Lavagetto) [14:31:28] (03PS14) 10Ottomata: Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) [14:32:00] PROBLEM - Check systemd state on maps-test2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:35:30] PROBLEM - Check Varnish expiry mailbox lag on cp4026 is CRITICAL: CRITICAL: expiry mailbox lag is 2088480 [14:49:38] (03CR) 10BBlack: [C: 04-1] "In addition to the two code-comments: I think we're missing a change to modules/profile/manifests/cache/misc.pp ?" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/386192 (https://phabricator.wikimedia.org/T144187) (owner: 10Ema) [14:50:17] (03CR) 10Volans: "Is this a noop or requires the restart of all keyholders?" [puppet] - 10https://gerrit.wikimedia.org/r/386621 (owner: 10Muehlenhoff) [14:51:47] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [14:51:47] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [14:51:47] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [14:51:47] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK [14:51:56] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 70 ESP OK [14:51:57] RECOVERY - Host cp4024 is UP: PING OK - Packet loss = 0%, RTA = 78.56 ms [14:51:57] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [14:51:57] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 114 ESP OK [14:52:08] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [14:52:08] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 70 ESP OK [14:52:08] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 70 ESP OK [14:52:08] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 70 ESP OK [14:52:08] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 70 ESP OK [14:52:16] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 114 ESP OK [14:52:17] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 114 ESP OK [14:52:17] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 114 ESP OK [14:52:26] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 70 ESP OK [14:52:26] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 70 ESP OK [14:52:26] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK [14:52:27] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [14:52:36] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 114 ESP OK [14:52:37] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [14:52:37] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [14:52:37] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 70 ESP OK [14:52:37] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [14:54:57] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [14:55:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] Use thirdparty/k8s repository in docker class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/386395 (owner: 10Muehlenhoff) [14:55:25] (03PS15) 10Ottomata: Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) [14:56:47] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [14:56:56] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 114 ESP OK [14:58:42] (03PS13) 10Ema: VCL: Exp cache admission policy for varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/386192 (https://phabricator.wikimedia.org/T144187) [15:02:12] (03CR) 10Elukey: Set up Kafka MirrorMaker from main -> jumbo in eqiad (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata) [15:05:02] (03CR) 10DCausse: [C: 031] logstash: explicit mappings for a few problematic fields [puppet] - 10https://gerrit.wikimedia.org/r/386608 (https://phabricator.wikimedia.org/T179058) (owner: 10Gehel) [15:07:56] (03CR) 10Muehlenhoff: "It's a NOP, just a cleanup to convert existing manual file definitions to systemd::tmpfile." [puppet] - 10https://gerrit.wikimedia.org/r/386621 (owner: 10Muehlenhoff) [15:09:53] 10Operations, 10Puppet, 10User-Joe: Puppet: Evaluation Error: Error while evaluating a Function Call, Could not autoload puppet/parser/functions/ordered_yaml: cannot load such file -- puppet/util/zaml.rb - https://phabricator.wikimedia.org/T179076#3712487 (10herron) [15:10:05] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3652273 (10herron) [15:10:43] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3652273 (10herron) [15:10:45] 10Operations, 10Puppet, 10User-Joe: Puppet: Error: Evaluation Error: Error while evaluating a Function Call, undefined local variable or method `known_resource_types' - https://phabricator.wikimedia.org/T179033#3712505 (10herron) 05Open>03Resolved [15:10:51] 10Operations, 10Puppet, 10User-Joe: Puppet: Error: Evaluation Error: Error while evaluating a Function Call, undefined local variable or method `known_resource_types' - https://phabricator.wikimedia.org/T179033#3711031 (10herron) [15:11:08] (03CR) 10Volans: [C: 031] "compiler diff looks sane to me:" [puppet] - 10https://gerrit.wikimedia.org/r/386621 (owner: 10Muehlenhoff) [15:16:16] (03PS3) 10Muehlenhoff: Use thirdparty/k8s repository for profile::docker::engine [puppet] - 10https://gerrit.wikimedia.org/r/386395 [15:16:34] (03PS4) 10DCausse: logstash: explicit mappings for a few problematic fields [puppet] - 10https://gerrit.wikimedia.org/r/386608 (https://phabricator.wikimedia.org/T179058) (owner: 10Gehel) [15:20:57] 10Operations, 10Pybal, 10Traffic, 10netops, 10Patch-For-Review: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3712530 (10elukey) Just found this nice comment in a mozilla repo (not sure if this is the core of firefox or some old thing): https://github.com/m... [15:22:35] 10Operations, 10Puppet, 10User-Joe: Puppet: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API - https://phabricator.wikimedia.org/T179077#3712540 (10herron) [15:22:55] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3652273 (10herron) [15:26:30] 10Operations, 10Services (watching), 10User-fgiunchedi: New upstream jvm-tools - https://phabricator.wikimedia.org/T178839#3712574 (10fgiunchedi) [15:26:53] (03PS5) 10Gehel: logstash: explicit mappings for a few problematic fields [puppet] - 10https://gerrit.wikimedia.org/r/386608 (https://phabricator.wikimedia.org/T179058) [15:27:59] (03CR) 10Gehel: [C: 032] logstash: explicit mappings for a few problematic fields [puppet] - 10https://gerrit.wikimedia.org/r/386608 (https://phabricator.wikimedia.org/T179058) (owner: 10Gehel) [15:28:03] 10Operations, 10Puppet, 10User-Joe: Puppet: Evaluation Error: Error while evaluating a Function Call, Could not autoload puppet/parser/functions/ordered_yaml: cannot load such file -- puppet/util/zaml.rb - https://phabricator.wikimedia.org/T179076#3712576 (10herron) [15:28:42] (03CR) 10jenkins-bot: db-eqiad.php: Clean up comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386600 (owner: 10Marostegui) [15:32:19] (03PS16) 10Ottomata: Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) [15:33:31] (03CR) 10jenkins-bot: Change threshold for slow AbuseFilter logging to 800ms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386547 (https://phabricator.wikimedia.org/T179039) (owner: 10Dmaza) [15:33:46] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386575 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [15:33:54] (03CR) 10Alexandros Kosiaris: [C: 032] "LGTM! thanks" [puppet] - 10https://gerrit.wikimedia.org/r/386395 (owner: 10Muehlenhoff) [15:33:58] (03PS4) 10Alexandros Kosiaris: Use thirdparty/k8s repository for profile::docker::engine [puppet] - 10https://gerrit.wikimedia.org/r/386395 (owner: 10Muehlenhoff) [15:34:02] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Use thirdparty/k8s repository for profile::docker::engine [puppet] - 10https://gerrit.wikimedia.org/r/386395 (owner: 10Muehlenhoff) [15:34:38] (03PS1) 10Anomie: Set $wgCentralAuthGlobalBlockInterwikiPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386627 [15:34:49] !log deleting and reimporting logs in logstash (today data, might lose some logs in the process) [15:34:54] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386609 (owner: 10Marostegui) [15:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:20] (03PS17) 10Ottomata: Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) [15:35:59] (03CR) 10Ottomata: Set up Kafka MirrorMaker from main -> jumbo in eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata) [15:36:41] (03CR) 10Ottomata: [C: 032] Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata) [15:36:49] !log reindexing logs in logstash [15:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:22] 10Operations, 10monitoring: mpt raid controller not detected as fact on maps-test2* - https://phabricator.wikimedia.org/T179078#3712606 (10fgiunchedi) [15:37:30] 10Operations, 10monitoring: mpt raid controller not detected as fact on maps-test2* - https://phabricator.wikimedia.org/T179078#3712620 (10fgiunchedi) p:05Triage>03Lowest [15:39:42] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386578 (https://phabricator.wikimedia.org/T164488) (owner: 10Marostegui) [15:41:00] !log mobrovac@tin Started deploy [restbase/deploy@522321a]: Double-process all summaries [15:41:07] (03PS1) 10DCausse: [logstash] add priority [puppet] - 10https://gerrit.wikimedia.org/r/386629 [15:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:49] !log upgrading dnsmasq on labnet1001 and labnet1002 [15:41:54] (03PS1) 10Filippo Giunchedi: hieradata: exclude maps-test hosts from smart health check [puppet] - 10https://gerrit.wikimedia.org/r/386630 (https://phabricator.wikimedia.org/T86552) [15:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:29] (03PS2) 10Gehel: [logstash] add priority [puppet] - 10https://gerrit.wikimedia.org/r/386629 (owner: 10DCausse) [15:43:29] (03CR) 10Gehel: [C: 032] [logstash] add priority [puppet] - 10https://gerrit.wikimedia.org/r/386629 (owner: 10DCausse) [15:45:15] !log restarting cassandra, restbase2001-b [15:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:50] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: exclude maps-test hosts from smart health check [puppet] - 10https://gerrit.wikimedia.org/r/386630 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [15:45:54] (03PS2) 10Filippo Giunchedi: hieradata: exclude maps-test hosts from smart health check [puppet] - 10https://gerrit.wikimedia.org/r/386630 (https://phabricator.wikimedia.org/T86552) [15:48:04] PROBLEM - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.163 and port 9042: Connection refused [15:48:20] (03PS1) 10DCausse: [logstash] add lineno [puppet] - 10https://gerrit.wikimedia.org/r/386632 [15:48:32] ok, there are exceptions here [15:48:41] gehel: ^ [15:48:46] sorry, wrong channel [15:48:59] (03PS2) 10Gehel: [logstash] add lineno [puppet] - 10https://gerrit.wikimedia.org/r/386632 (owner: 10DCausse) [15:49:04] RECOVERY - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.163 port 9042 [15:49:40] (03CR) 10Gehel: [C: 032] [logstash] add lineno [puppet] - 10https://gerrit.wikimedia.org/r/386632 (owner: 10DCausse) [15:50:09] !log mobrovac@tin Finished deploy [restbase/deploy@522321a]: Double-process all summaries (duration: 09m 09s) [15:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:53] RECOVERY - Check systemd state on maps-test2004 is OK: OK - running: The system is fully operational [15:54:03] RECOVERY - Check systemd state on maps-test2001 is OK: OK - running: The system is fully operational [15:54:14] RECOVERY - Check systemd state on maps-test2002 is OK: OK - running: The system is fully operational [15:54:24] RECOVERY - Check systemd state on maps-test2003 is OK: OK - running: The system is fully operational [15:54:53] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 53, down: 0, shutdown: 2 [15:55:53] godog: thanks for this maps-test fix! I was soo focused on other things... [15:57:29] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3712681 (10elukey) Updated status: ``` elukey@terbium:~$ /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/group1.dblist showJobs.php --group... [15:57:37] gehel: no worries! I broke it so I get to keep both pieces :)) [16:00:04] godog, moritzm, and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171026T1600). [16:00:04] hoo: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:41] * hoo waves [16:04:23] !log mobrovac@tin Started deploy [restbase/deploy@860cbfe]: Revert all summaries, back to all but WP [16:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:35] !log mobrovac@tin Started deploy [restbase/deploy@860cbfe]: Revert all summaries, back to all but WP [16:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:53] (03PS14) 10Ema: VCL: Exp cache admission policy for varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/386192 (https://phabricator.wikimedia.org/T144187) [16:05:17] (03CR) 10Ema: VCL: Exp cache admission policy for varnish-fe (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/386192 (https://phabricator.wikimedia.org/T144187) (owner: 10Ema) [16:05:39] Anyone willing to do puppet swat? [16:06:57] hi, someone is reporting that going to https://fa.wikipedia.org/wiki/ژان-پل_سارتر is failing [16:07:01] If you report this error to the Wikimedia System Administrators, please include the details below. [16:07:01] Request from 86.147.143.221 via cp1054 cp1054, Varnish XID 570564814 [16:07:01] Error: 503, Backend fetch failed at Thu, 26 Oct 2017 16:06:54 GMT [16:07:01] Error: 503, Backend fetch failed at Thu, 26 Oct 2017 16:06:54 GMT [16:08:40] (03PS1) 10Elukey: role::mediawiki::jobrunner: raise temporarily runners for refreshLinks/hmtlCacheUpdate [puppet] - 10https://gerrit.wikimedia.org/r/386636 (https://phabricator.wikimedia.org/T173710) [16:09:09] (03CR) 10jerkins-bot: [V: 04-1] role::mediawiki::jobrunner: raise temporarily runners for refreshLinks/hmtlCacheUpdate [puppet] - 10https://gerrit.wikimedia.org/r/386636 (https://phabricator.wikimedia.org/T173710) (owner: 10Elukey) [16:09:57] paladox: That is https://phabricator.wikimedia.org/T178632 [16:10:19] thanks [16:10:33] (03PS2) 10Elukey: role::mediawiki::jobrunner: inc runners for refreshLinks/hmtlCacheUpdate [puppet] - 10https://gerrit.wikimedia.org/r/386636 (https://phabricator.wikimedia.org/T173710) [16:11:37] (03CR) 10BBlack: [C: 031] VCL: Exp cache admission policy for varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/386192 (https://phabricator.wikimedia.org/T144187) (owner: 10Ema) [16:12:06] (03PS1) 10EBernhardson: Update logstash elasticsearch template to match prior deployment [puppet] - 10https://gerrit.wikimedia.org/r/386637 [16:12:32] !log mobrovac@tin Finished deploy [restbase/deploy@860cbfe]: Revert all summaries, back to all but WP (duration: 07m 57s) [16:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:26] !log truncating cassandra hints in restbase-ng cluster [16:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:52] (03CR) 10jenkins-bot: Enable signature button in Projekt namespace on sewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386567 (https://phabricator.wikimedia.org/T175363) (owner: 10Jon Harald Søby) [16:18:05] !log awight@tin Started deploy [ores/deploy@971be22]: Push ORES w/ Celery 4 support to new cluster (take 3), T178441 [16:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:14] T178441: Upgrade celery to 4.1.0 for ORES - https://phabricator.wikimedia.org/T178441 [16:18:35] (03CR) 10jenkins-bot: Add property for RDF mapping of external identifiers for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386594 (https://phabricator.wikimedia.org/T178180) (owner: 10Ladsgroup) [16:19:44] (03CR) 10jenkins-bot: Revert "db-codfw.php: Repool db2038" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386577 (owner: 10Marostegui) [16:20:06] godog: moritzm: _joe_: No one up for puppet SWAT? [16:20:39] (03CR) 10jenkins-bot: Revert "Add negative weight to disambig entities" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386552 (owner: 10Legoktm) [16:21:16] !log restarted cassandra, restbase2001-b [16:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:28] hoo: don't have time today, in an interview [16:21:42] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386581 (owner: 10Marostegui) [16:22:08] (03CR) 10jenkins-bot: db-codfw.php: Add db2088 and db2084 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386607 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [16:22:48] (03CR) 10jenkins-bot: Fix interwiki sort order for Northern Sami [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386324 (https://phabricator.wikimedia.org/T178965) (owner: 10Jon Harald Søby) [16:23:50] (03CR) 10Andrew Bogott: [C: 032] maintain-views: add additional log types [puppet] - 10https://gerrit.wikimedia.org/r/386465 (https://phabricator.wikimedia.org/T178752) (owner: 10BryanDavis) [16:23:55] (03PS2) 10Andrew Bogott: maintain-views: add additional log types [puppet] - 10https://gerrit.wikimedia.org/r/386465 (https://phabricator.wikimedia.org/T178752) (owner: 10BryanDavis) [16:24:03] hoo: I think that everybody is a bit busy, and I have to log off very soon, if you don't manage to get your changes merged today feel free to ping me or anybody else tomorrow (I'll work in the EU daytime, don't rememeber yours) [16:24:28] (03PS1) 10Ottomata: Disable kafka MirrorMaker on kafka-jumbo [puppet] - 10https://gerrit.wikimedia.org/r/386640 (https://phabricator.wikimedia.org/T177216) [16:24:43] is it blocking you or is it ok? (not ideal I know :( ) [16:26:46] very quickly, https://gerrit.wikimedia.org/r/#/c/384951/2 looks ok (and also +1ed by jaime) but https://gerrit.wikimedia.org/r/#/c/386591/2 might need some work (the cron line is a bit overloaded :) [16:26:58] (03CR) 10EBernhardson: role::mediawiki::jobrunner: inc runners for refreshLinks/hmtlCacheUpdate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/386636 (https://phabricator.wikimedia.org/T173710) (owner: 10Elukey) [16:27:08] elukey: The sql thing is not as important [16:27:24] but getting the logging in place would be really nice [16:27:41] because we keep having problems with that and would like to finally be able to diagnose them [16:27:48] (03PS1) 10Giuseppe Lavagetto: First version of scap deployment of docker-pkg [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/386641 [16:28:22] (03PS1) 10Chad: group2 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386642 [16:28:23] (03CR) 10Chad: [C: 04-2] group2 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386642 (owner: 10Chad) [16:29:31] hoo: yes I understand, so I'd do two things: 1) verify via puppet that /var/log/wikidata is created with proper perms 2) log only the script's output to /var/log/wikidata/dispatchChanges-wikidatawiki.log rather than the other things (as starter) [16:30:09] I can work on it tomorrow morning, update your patch and then see if I can merge or not [16:30:25] (03PS1) 10RobH: set bast4002.wikimedia.org production dns entries [dns] - 10https://gerrit.wikimedia.org/r/386643 (https://phabricator.wikimedia.org/T179050) [16:31:54] (03PS2) 10RobH: set bast4002.wikimedia.org dns entries [dns] - 10https://gerrit.wikimedia.org/r/386643 (https://phabricator.wikimedia.org/T179050) [16:31:54] Having the other stuff in there is quite important IMO [16:32:00] would be really nice to have this [16:32:36] (03CR) 10RobH: [C: 032] set bast4002.wikimedia.org dns entries [dns] - 10https://gerrit.wikimedia.org/r/386643 (https://phabricator.wikimedia.org/T179050) (owner: 10RobH) [16:32:52] ah wait I am dumb, /var/log/wikidata is already ensured [16:32:54] nevermind [16:33:39] (03PS2) 10Ottomata: Disable kafka MirrorMaker on kafka-jumbo [puppet] - 10https://gerrit.wikimedia.org/r/386640 (https://phabricator.wikimedia.org/T177216) [16:33:39] !log awight@tin Finished deploy [ores/deploy@971be22]: Push ORES w/ Celery 4 support to new cluster (take 3), T178441 (duration: 15m 33s) [16:33:42] (03CR) 10Ottomata: [V: 032 C: 032] Disable kafka MirrorMaker on kafka-jumbo [puppet] - 10https://gerrit.wikimedia.org/r/386640 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata) [16:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:48] T178441: Upgrade celery to 4.1.0 for ORES - https://phabricator.wikimedia.org/T178441 [16:34:05] hoo: ok so if you could at least assign /var/log/wikidata/dispatchChanges-wikidatawiki.log to a variable and reuse it [16:34:20] Sure thing [16:34:27] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: setup bast4001/WMF7218 - https://phabricator.wikimedia.org/T179050#3712808 (10RobH) [16:34:28] I can also create it and depend on it, if you want [16:34:57] 10Operations, 10Puppet, 10User-Joe: Puppet: Failed to parse template authdns/discovery-statefile.tpl.erb - https://phabricator.wikimedia.org/T179084#3712809 (10herron) [16:35:14] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3652273 (10herron) [16:35:26] hoo: so the /var/log/wikidata is created below, would be great if the cron depended on it but it should also be done to the other ones :) [16:35:41] anyhow, I think that with the variable it could work [16:36:09] 10Operations, 10Puppet, 10User-Joe: Puppet4: Evaluation Error: Error while evaluating a Function Call, Could not autoload puppet/parser/functions/ordered_yaml: cannot load such file -- puppet/util/zaml.rb - https://phabricator.wikimedia.org/T179076#3712831 (10herron) [16:36:25] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API - https://phabricator.wikimedia.org/T179077#3712835 (10herron) [16:36:36] 10Operations, 10Puppet, 10User-Joe: Puppet4: Failed to parse template authdns/discovery-statefile.tpl.erb - https://phabricator.wikimedia.org/T179084#3712809 (10herron) [16:38:31] hoo: need to log off sorry, if you don't find anybody merging I promise I'll do it tomorrow :) [16:38:36] sorry again! [16:38:42] * elukey off [16:43:06] !log awight@tin Started deploy [ores/deploy@971be22]: Push ORES w/ Celery 4 support to new cluster (try to rebuild venv), T178441 [16:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:14] T178441: Upgrade celery to 4.1.0 for ORES - https://phabricator.wikimedia.org/T178441 [16:44:08] !log awight@tin Finished deploy [ores/deploy@971be22]: Push ORES w/ Celery 4 support to new cluster (try to rebuild venv), T178441 (duration: 01m 02s) [16:44:11] (03PS1) 10Ottomata: Run MirrorMaker on analytics Kafka hosts to mirror main -> jumbo [puppet] - 10https://gerrit.wikimedia.org/r/386648 (https://phabricator.wikimedia.org/T177216) [16:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:44] (03CR) 10jerkins-bot: [V: 04-1] Run MirrorMaker on analytics Kafka hosts to mirror main -> jumbo [puppet] - 10https://gerrit.wikimedia.org/r/386648 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata) [16:44:53] (03CR) 10Elukey: role::mediawiki::jobrunner: inc runners for refreshLinks/hmtlCacheUpdate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/386636 (https://phabricator.wikimedia.org/T173710) (owner: 10Elukey) [16:46:49] (03CR) 10DCausse: [C: 031] Update logstash elasticsearch template to match prior deployment [puppet] - 10https://gerrit.wikimedia.org/r/386637 (owner: 10EBernhardson) [16:49:14] PROBLEM - Apache HTTP on mw2105 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:03] RECOVERY - Apache HTTP on mw2105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.121 second response time [16:50:37] (03PS1) 10RobH: setting bast4001 params [puppet] - 10https://gerrit.wikimedia.org/r/386652 (https://phabricator.wikimedia.org/T179050) [16:51:11] (03CR) 10jerkins-bot: [V: 04-1] setting bast4001 params [puppet] - 10https://gerrit.wikimedia.org/r/386652 (https://phabricator.wikimedia.org/T179050) (owner: 10RobH) [16:54:39] (03CR) 10Ottomata: [V: 032 C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8487/kafka1012.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/386648 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata) [16:57:57] (03PS2) 10RobH: setting bast4001 params [puppet] - 10https://gerrit.wikimedia.org/r/386652 (https://phabricator.wikimedia.org/T179050) [16:58:34] !log now mirroring main Kafka cluster topics to jumbo Kafka cluster, with MirrorMaker instances running on analytics-eqiad broker nodes. https://phabricator.wikimedia.org/T177216 [16:58:39] (03PS2) 10Bearloga: R, Shiny Server, and Discovery Computing fixes [puppet] - 10https://gerrit.wikimedia.org/r/386536 (https://phabricator.wikimedia.org/T178096) [16:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:45] (03PS2) 10Gehel: Update logstash elasticsearch template to match prior deployment [puppet] - 10https://gerrit.wikimedia.org/r/386637 (owner: 10EBernhardson) [16:58:53] (03CR) 10jerkins-bot: [V: 04-1] setting bast4001 params [puppet] - 10https://gerrit.wikimedia.org/r/386652 (https://phabricator.wikimedia.org/T179050) (owner: 10RobH) [16:59:23] (03PS3) 10RobH: setting bast4001 params [puppet] - 10https://gerrit.wikimedia.org/r/386652 (https://phabricator.wikimedia.org/T179050) [16:59:53] (03CR) 10jerkins-bot: [V: 04-1] setting bast4001 params [puppet] - 10https://gerrit.wikimedia.org/r/386652 (https://phabricator.wikimedia.org/T179050) (owner: 10RobH) [17:00:05] gwicke, cscott, arlolra, subbu, halfak, and Amir1: #bothumor I � Unicode. All rise for Services – Graphoid / Parsoid / OCG / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171026T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:01:39] 10Operations, 10Puppet, 10User-Joe, 10cloud-services-team (FY2017-18): Puppet4: Error while evaluating a Function Call, Failed to parse template openstack2/mitaka/horizon/local_settings.py.erb - https://phabricator.wikimedia.org/T179086#3712883 (10herron) [17:01:58] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3652273 (10herron) [17:02:53] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [17:03:24] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0 [17:03:31] (03PS4) 10RobH: setting bast4001 params [puppet] - 10https://gerrit.wikimedia.org/r/386652 (https://phabricator.wikimedia.org/T179050) [17:03:33] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "1 - we cannot add refreshlinks to that queue and not include refreshlinksPrioritized, IMO. bumping refreshLinks out of the 'low_prio' queu" [puppet] - 10https://gerrit.wikimedia.org/r/386636 (https://phabricator.wikimedia.org/T173710) (owner: 10Elukey) [17:03:42] awight: do you want to deploy? [17:03:52] I wish... [17:04:13] (03PS3) 10Gehel: Update logstash elasticsearch template to match prior deployment [puppet] - 10https://gerrit.wikimedia.org/r/386637 (owner: 10EBernhardson) [17:04:22] I’m about 95% certain it would cause downtime. Would be an interesting experiment, though :-) [17:04:49] Amir1: I ran into yet another exciting glitch: scap seems to not run the commands to rebuilt venv on ores*. [17:05:00] (03CR) 10DCausse: [C: 031] Update logstash elasticsearch template to match prior deployment [puppet] - 10https://gerrit.wikimedia.org/r/386637 (owner: 10EBernhardson) [17:05:08] (03CR) 10Gehel: [C: 032] Update logstash elasticsearch template to match prior deployment [puppet] - 10https://gerrit.wikimedia.org/r/386637 (owner: 10EBernhardson) [17:05:17] oh okay [17:05:40] hmm [17:05:55] I see that it *is* rebuilding venv on scb* [17:06:26] nah, only 5m [17:06:28] Amir1: Perhaps I’ll do a canary deploy and see how it looks. The two problems I’m aware of are easy to test for. [17:06:32] (mispaste) [17:06:59] robh: almost relevant, though :) there’s like a 5m in 100 chance I’m about to break wikipedia. [17:07:14] (03CR) 10RobH: [C: 032] setting bast4001 params [puppet] - 10https://gerrit.wikimedia.org/r/386652 (https://phabricator.wikimedia.org/T179050) (owner: 10RobH) [17:07:16] Amir1: yeah, it’s reckless but I’ll go ahead. [17:07:19] (03PS5) 10RobH: setting bast4001 params [puppet] - 10https://gerrit.wikimedia.org/r/386652 (https://phabricator.wikimedia.org/T179050) [17:08:01] awight: the canary gets traffic [17:08:03] !log [logstash] deleting today logs to reapply mapping template [17:08:09] we are blue/green [17:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:22] Amir1: right, 11% of it. Hold me back :-) [17:08:49] !log [logstash] reimporting today logs [17:08:53] :D [17:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:06] !log Optimize ruwiki.recentchanges on dbstore1001 - T162789 [17:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:13] T162789: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789 [17:10:54] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [17:11:32] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [17:11:37] !log awight@tin Started deploy [ores/deploy@971be22]: ORES w/ revscoring 2 and Celery 4, T175180 T178441 [17:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:46] T175180: Deploy ORES (revscoring 2.0) - https://phabricator.wikimedia.org/T175180 [17:11:46] T178441: Upgrade celery to 4.1.0 for ORES - https://phabricator.wikimedia.org/T178441 [17:14:57] !log awight@tin Finished deploy [ores/deploy@971be22]: ORES w/ revscoring 2 and Celery 4, T175180 T178441 (duration: 03m 20s) [17:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:56] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089#1540070 (10debt) This is so old... @gehel, can you go through the subtasks and see if any of them are sti... [17:16:41] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:16:42] PROBLEM - ores on scb1002 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 136 bytes in 0.003 second response time [17:16:49] !log awight@tin Started deploy [ores/deploy@971be22]: Rolling back scb1002, T175180 T178441 [17:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:58] T175180: Deploy ORES (revscoring 2.0) - https://phabricator.wikimedia.org/T175180 [17:16:58] T178441: Upgrade celery to 4.1.0 for ORES - https://phabricator.wikimedia.org/T178441 [17:17:22] !log awight@tin Finished deploy [ores/deploy@971be22]: Rolling back scb1002, T175180 T178441 (duration: 00m 32s) [17:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:41] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational [17:17:42] RECOVERY - ores on scb1002 is OK: HTTP OK: HTTP/1.0 200 OK - 3666 bytes in 0.012 second response time [17:18:11] 10Operations, 10Parsoid, 10Traffic, 10VisualEditor, 10HTTPS: Parsoid, VisualEditor not working with SSL / HTTPS - https://phabricator.wikimedia.org/T178778#3702250 (10Arlolra) What does your Parsoid config.yaml look like? [17:18:57] awight: please ping me when you're done [17:19:09] arlolra: I should be done now. [17:19:13] ok, thanks [17:19:17] 10Operations, 10Pybal, 10Traffic, 10netops, 10Patch-For-Review: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3713056 (10BBlack) It looks like that commentary is still valid for even the most-recent Firefox builds. So, we may find that even modern Firefox d... [17:21:31] !log arlolra@tin Started deploy [parsoid/deploy@4882c59]: Updating Parsoid to 8e99708a [17:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:16] 10Operations, 10ops-eqiad, 10netops: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#3713069 (10faidon) p:05Triage>03Normal [17:23:11] 10Operations, 10ops-eqiad, 10netops: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#3711364 (10faidon) [17:24:58] 10Operations, 10ops-eqiad, 10netops: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#3713103 (10faidon) Image has been downloaded to the install* servers. [17:25:41] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3713111 (10herron) [17:27:41] (03PS3) 10Gehel: R, Shiny Server, and Discovery Computing fixes [puppet] - 10https://gerrit.wikimedia.org/r/386536 (https://phabricator.wikimedia.org/T178096) (owner: 10Bearloga) [17:28:34] (03CR) 10Gehel: [C: 032] R, Shiny Server, and Discovery Computing fixes [puppet] - 10https://gerrit.wikimedia.org/r/386536 (https://phabricator.wikimedia.org/T178096) (owner: 10Bearloga) [17:30:12] !log arlolra@tin Finished deploy [parsoid/deploy@4882c59]: Updating Parsoid to 8e99708a (duration: 08m 40s) [17:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:02] 10Operations, 10DBA, 10cloud-services-team, 10Scoring-platform-team (Current): Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3713192 (10bd808) >>! In T168584#3570152, @bd808 wrote: > If we do lose a disk on 1001/3 to the powercycle though it will be hard to recover so we s... [17:34:00] (03PS1) 10Madhuvishy: labsdb: Switchover dns for labsdb1001 shards to labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/386660 (https://phabricator.wikimedia.org/T168584) [17:34:39] (03CR) 10Madhuvishy: [C: 04-2] "[Don't merge yet]" [puppet] - 10https://gerrit.wikimedia.org/r/386660 (https://phabricator.wikimedia.org/T168584) (owner: 10Madhuvishy) [17:36:58] !log Updated Parsoid to 8e99708a (T176728) [17:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:08] T176728: Parsoid workers death rate increased transforming arbitrary wiki text - https://phabricator.wikimedia.org/T176728 [17:37:32] (03PS3) 10Hoo man: Log Wikidata dispatchers on terbium [puppet] - 10https://gerrit.wikimedia.org/r/386591 (https://phabricator.wikimedia.org/T179060) [17:37:35] (03PS1) 10Hoo man: Declare requirements in mediawiki::maintenance::wikidata [puppet] - 10https://gerrit.wikimedia.org/r/386662 [17:38:05] (03CR) 10jerkins-bot: [V: 04-1] Log Wikidata dispatchers on terbium [puppet] - 10https://gerrit.wikimedia.org/r/386591 (https://phabricator.wikimedia.org/T179060) (owner: 10Hoo man) [17:38:16] (03CR) 10jerkins-bot: [V: 04-1] Declare requirements in mediawiki::maintenance::wikidata [puppet] - 10https://gerrit.wikimedia.org/r/386662 (owner: 10Hoo man) [17:43:20] 10Operations, 10Puppet, 10User-Joe: puppetmaster hostcert and hostprivkey point to nonexistent files - https://phabricator.wikimedia.org/T179099#3713236 (10herron) [17:45:19] !log awight@tin Started deploy [ores/deploy@f5deb7f]: Revscoring 2 on ores1002 (non-production) [17:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:05] !log awight@tin Finished deploy [ores/deploy@f5deb7f]: Revscoring 2 on ores1002 (non-production) (duration: 01m 45s) [17:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:25] !log ulsfo cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause [17:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:11] (03PS2) 10Hoo man: Declare requirements in mediawiki::maintenance::wikidata [puppet] - 10https://gerrit.wikimedia.org/r/386662 [17:52:14] (03PS4) 10Hoo man: Log Wikidata dispatchers on terbium [puppet] - 10https://gerrit.wikimedia.org/r/386591 (https://phabricator.wikimedia.org/T179060) [17:52:44] (03CR) 10jerkins-bot: [V: 04-1] Declare requirements in mediawiki::maintenance::wikidata [puppet] - 10https://gerrit.wikimedia.org/r/386662 (owner: 10Hoo man) [17:52:54] (03CR) 10jerkins-bot: [V: 04-1] Log Wikidata dispatchers on terbium [puppet] - 10https://gerrit.wikimedia.org/r/386591 (https://phabricator.wikimedia.org/T179060) (owner: 10Hoo man) [17:54:04] (03PS1) 10Herron: Puppet: Change hostcert and hostprivkey paths on puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/386666 (https://phabricator.wikimedia.org/T179099) [17:56:39] (03PS3) 10Hoo man: Declare requirements in mediawiki::maintenance::wikidata [puppet] - 10https://gerrit.wikimedia.org/r/386662 [17:56:41] (03PS5) 10Hoo man: Log Wikidata dispatchers on terbium [puppet] - 10https://gerrit.wikimedia.org/r/386591 (https://phabricator.wikimedia.org/T179060) [17:56:43] of course that repooled the faulty cp4024 :P [17:56:59] WTB depooled-reasons stacks :) [17:57:16] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp4024.ulsfo.wmnet [17:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:17] !log codfw cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause [17:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Morning SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171026T1800). [18:00:05] Amir1: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:22] o/ [18:00:45] jouncebot is getting more and more funny [18:02:01] 10Operations, 10Puppet: Puppet4: Create empty/placeholder /etc/apache2/sites-enabled/puppet-master.conf - https://phabricator.wikimedia.org/T179102#3713307 (10herron) [18:03:37] I think it's getting passive-aggressive [18:03:37] in my day it was a t-shirt... must be inflation [18:05:12] lol [18:06:11] (03PS1) 10Bearloga: role::discovery::learner: Remove deep learning profile [puppet] - 10https://gerrit.wikimedia.org/r/386669 (https://phabricator.wikimedia.org/T178096) [18:07:28] I never got a tshirt for my accidents [18:07:46] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4024_v4, cp4024_v6 [18:07:55] I think they had run out or something [18:07:56] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4024_v4, cp4024_v6 [18:07:57] PROBLEM - Host flerovium is DOWN: PING CRITICAL - Packet loss = 100% [18:07:57] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 112 connecting: cp4024_v4, cp4024_v6 [18:07:57] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 112 connecting: cp4024_v4, cp4024_v6 [18:08:07] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 112 connecting: cp4024_v4, cp4024_v6 [18:08:07] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4024_v4, cp4024_v6 [18:08:07] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [18:08:07] RECOVERY - Disk space on flerovium is OK: DISK OK [18:08:16] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [18:08:16] RECOVERY - Host flerovium is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [18:08:19] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4024_v4, cp4024_v6 [18:08:19] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 112 connecting: cp4024_v4, cp4024_v6 [18:08:19] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 112 connecting: cp4024_v4, cp4024_v6 [18:08:19] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4024_v4, cp4024_v6 [18:08:19] yay I get to re-ack all of those [18:08:26] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4024_v4, cp4024_v6 [18:08:26] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4024_v4, cp4024_v6 [18:08:26] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [18:08:27] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4024_v4, cp4024_v6 [18:08:36] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [18:08:37] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 112 connecting: cp4024_v4, cp4024_v6 [18:08:37] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [18:08:37] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [18:08:46] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4024_v4, cp4024_v6 [18:08:46] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4024_v4, cp4024_v6 [18:08:47] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [18:08:47] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [18:08:47] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [18:08:56] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [18:08:56] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 [18:09:22] (03CR) 10Gehel: [C: 032] role::discovery::learner: Remove deep learning profile [puppet] - 10https://gerrit.wikimedia.org/r/386669 (https://phabricator.wikimedia.org/T178096) (owner: 10Bearloga) [18:10:01] ACKNOWLEDGEMENT - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 Brandon Black T174891 [18:10:01] ACKNOWLEDGEMENT - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 Brandon Black T174891 [18:10:01] ACKNOWLEDGEMENT - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 Brandon Black T174891 [18:10:01] ACKNOWLEDGEMENT - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 Brandon Black T174891 [18:10:01] ACKNOWLEDGEMENT - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4024_v4, cp4024_v6 Brandon Black T174891 [18:10:48] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3713348 (10BBlack) Powering off for now, less confusing for other software-level maintenance. [18:16:34] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review, 10Scoring-platform-team (Current): Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3713361 (10madhuvishy) Started a planning doc for the reboots here - https://etherpad.wikimedia.org/p/labsdb-reboots [18:19:09] (03PS1) 10Ottomata: Reenable raw data drop job [puppet] - 10https://gerrit.wikimedia.org/r/386671 [18:19:21] (03PS1) 10RobH: splitting bast4002 to its own entry [puppet] - 10https://gerrit.wikimedia.org/r/386672 (https://phabricator.wikimedia.org/T179050) [18:19:51] (03CR) 10jerkins-bot: [V: 04-1] splitting bast4002 to its own entry [puppet] - 10https://gerrit.wikimedia.org/r/386672 (https://phabricator.wikimedia.org/T179050) (owner: 10RobH) [18:20:16] PROBLEM - Disk space on flerovium is CRITICAL: DISK CRITICAL - free space: /mnt/2a 1248881 MB (3% inode=96%) [18:20:27] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [18:20:58] (03PS2) 10RobH: splitting bast4002 to its own entry [puppet] - 10https://gerrit.wikimedia.org/r/386672 (https://phabricator.wikimedia.org/T179050) [18:21:33] (03CR) 10jerkins-bot: [V: 04-1] splitting bast4002 to its own entry [puppet] - 10https://gerrit.wikimedia.org/r/386672 (https://phabricator.wikimedia.org/T179050) (owner: 10RobH) [18:21:46] RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms [18:22:06] RECOVERY - Disk space on furud is OK: DISK OK [18:25:21] (03CR) 10RobH: [V: 032 C: 032] "forcing the +v since the style changes dont need to block this particular patchset (refactoring will take place with bast4002 online as sp" [puppet] - 10https://gerrit.wikimedia.org/r/386672 (https://phabricator.wikimedia.org/T179050) (owner: 10RobH) [18:26:18] (03PS1) 10Ottomata: Add comment about flerovium and furud in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/386674 [18:26:45] !log T179105: Altering "enwiki_T_page__summary".data to use LZ4Compressor [18:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:54] T179105: Change new storage strategy defaults for Cassandra compression - https://phabricator.wikimedia.org/T179105 [18:27:09] (03CR) 10Ottomata: [C: 032] Add comment about flerovium and furud in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/386674 (owner: 10Ottomata) [18:27:13] (03PS2) 10Ottomata: Add comment about flerovium and furud in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/386674 [18:27:16] (03CR) 10Ottomata: [V: 032 C: 032] Add comment about flerovium and furud in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/386674 (owner: 10Ottomata) [18:27:28] !log esams cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause [18:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:56] Is SWAT happening? [18:30:28] !log Dropping and recreating keyspace enwiki_T_page__summary (restbase-ng cluster) [18:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:41] (03PS4) 10BBlack: Global: runtime disable ethernet flow on fresh install [puppet] - 10https://gerrit.wikimedia.org/r/381017 [18:31:18] zeljkof: ? [18:31:43] Amir1: sorry, I'm in a meeting [18:31:46] (03CR) 10BBlack: [C: 032] Global: runtime disable ethernet flow on fresh install [puppet] - 10https://gerrit.wikimedia.org/r/381017 (owner: 10BBlack) [18:32:15] ottomata: ok to merge yours? [18:32:26] err, assuming yes, it's just comments [18:32:26] oh ya [18:32:28] just a comment [18:32:31] I can do it on my own [18:32:35] if that's okay [18:32:36] sorry have the yes/no sitting her eon my terminal [18:32:41] go ahead bblack [18:33:07] PROBLEM - Disk space on furud is CRITICAL: DISK CRITICAL - free space: /mnt/2a 1248862 MB (3% inode=96%) [18:33:14] Amir1: sounds good to me :) not sure why you don't deploy yourself all the time ;) [18:34:28] IDK either [18:34:29] :D [18:34:36] (03PS10) 10BBlack: Caches: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379801 [18:34:56] (03PS2) 10Rush: openstack: refactor deployment specific puppetmaster code [puppet] - 10https://gerrit.wikimedia.org/r/386612 (https://phabricator.wikimedia.org/T171494) [18:35:31] (03CR) 10jerkins-bot: [V: 04-1] openstack: refactor deployment specific puppetmaster code [puppet] - 10https://gerrit.wikimedia.org/r/386612 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [18:36:07] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [18:37:16] RECOVERY - Disk space on furud is OK: DISK OK [18:37:16] RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [18:37:17] RECOVERY - Disk space on flerovium is OK: DISK OK [18:37:56] (03CR) 10Elukey: "> 1 - we cannot add refreshlinks to that queue and not include" [puppet] - 10https://gerrit.wikimedia.org/r/386636 (https://phabricator.wikimedia.org/T173710) (owner: 10Elukey) [18:38:39] (03PS2) 10Ottomata: Reenable raw data drop job [puppet] - 10https://gerrit.wikimedia.org/r/386671 [18:39:29] (03CR) 10Ottomata: [C: 032] Reenable raw data drop job [puppet] - 10https://gerrit.wikimedia.org/r/386671 (owner: 10Ottomata) [18:41:41] !log eqiad cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause [18:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:35] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:46:26] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [18:46:39] seems to be a short spike related to the ethtool stuff [18:47:26] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [18:47:44] (with the usual laggy reporting, but it's already over I think) [18:48:34] was pulled in mwdebug1002, works just fine, rolling out live [18:48:55] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [18:49:31] MW exceptions is probably not me [18:51:20] !log ladsgroup@tin Synchronized php-1.31.0-wmf.5/extensions/Wikidata/extensions/Constraints/includes/ConstraintCheck/DelegatingConstraintChecker.php: Fix sorting of NullResults (T179038) (duration: 01m 04s) [18:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:28] T179038: DomainException / NullResult holds no constraint - https://phabricator.wikimedia.org/T179038 [18:52:45] !log ladsgroup@tin Synchronized php-1.31.0-wmf.5/extensions/Wikidata/extensions/Constraints/tests/phpunit/DelegatingConstraintCheckerTest.php: Fix sorting of NullResults (T179038) (duration: 00m 49s) [18:52:47] awight: this doesn't look good: https://logstash.wikimedia.org/goto/2d0be5628310811600aa14dac54dae39 [18:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:10] bblack: I think that's ^ [18:53:29] hal [18:53:33] greg-g: Ouch! Okay I’ll make an UBN for that [18:54:01] Amir1: is what you just deployed related or ? ^^ [18:54:19] ok. it's possible my ethtool stuff isn't causing 503s either, I just assumed [18:54:33] greg-g: no, it's related to the service itself [18:54:41] awight :D [18:54:42] but there does seem to be a second spike of 503, and my ethtool stuff has been complete for minutes now [18:55:03] kk, well, either way, we need to rollback or something to get rid of those exceptions [18:55:06] PROBLEM - cassandra-a CQL 10.192.32.134:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.134 and port 9042: Connection refused [18:55:55] PROBLEM - cassandra-a SSL 10.192.32.134:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:56:00] Amir1: How about I tweak the extension to catch the exception? [18:56:15] ^ looking [18:56:52] awight: If the service is not behaving correctly, throwing error is reasonable [18:56:55] RECOVERY - cassandra-a SSL 10.192.32.134:7001 on restbase2003 is OK: SSL OK - Certificate restbase2003-a valid until 2018-08-17 16:11:49 +0000 (expires in 294 days) [18:56:57] we should fix ores.wm.o [18:57:01] not extension [18:57:06] RECOVERY - cassandra-a CQL 10.192.32.134:9042 on restbase2003 is OK: TCP OK - 0.036 second response time on 10.192.32.134 port 9042 [18:57:38] Hmm, just got an HTTP 503 from en.wikipedia.org. [18:58:01] bblack: ^ that shouldn't be due to the ORES thing, I'm not sure [18:58:19] Esther: thanks, just on a random page view or doing something else? [18:58:28] Esther: (and how long ago was "just got"?) [18:58:28] Amir1: ok I’ll start with the service [18:58:53] Logged in viewing https://en.wikipedia.org/wiki/Special:Contributions/Eggishorn like a minute ago. [18:58:55] oh right, of course, hi MZ ;) [18:58:56] Went away on refresh. [19:00:04] no_justification: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171026T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:08] you know the wikis return 503, right ? [19:00:23] matanya: Did you just get one? Which site? [19:00:31] he, en [19:00:36] greg-g: my impression is if MW is logging fatals at a decent clip, this will cause issues for anything else using MW [19:00:37] i got a few in a row [19:00:43] I could be wrong, but I think fatal means fatal [19:00:58] matanya: Glad to know it's not just me. :-) [19:01:00] (not just ores) [19:01:02] no_justification: issues ^^ [19:01:26] awight: status on rolling back ores? [19:01:32] 10Operations, 10ops-ulsfo, 10Traffic: setup bast4001/WMF7218 - https://phabricator.wikimedia.org/T179050#3713481 (10RobH) [19:02:16] greg-g: It should be rolled back to the stable version it’s been at for months, although I had to do part of the rollback manually. The extension has some recent changes, but unchanged for the past week AFAIK. [19:02:26] greg-g: fyi, I’m working on this as T179107 [19:02:26] T179107: ORES service erroring, in a way that throws exceptions in Extension:ORES - https://phabricator.wikimedia.org/T179107 [19:02:41] bblack: the timing lines up with your https://tools.wmflabs.org/sal/log/AV9Z8GaMF4fsM4DBdU73 but.... [19:03:08] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&from=now-1h&to=now&var-site=All&var-cache_type=text&var-status_type=5 [19:03:08] (exactly lines up, that is) [19:03:22] ^ that's the ongoing 503s [19:03:34] Request from (IP) via cp1065 cp1065, Varnish XID 16097821 [19:03:34] Error: 503, Backend fetch failed at Thu, 26 Oct 2017 18:59:24 GMT [19:04:32] greg-g: The root cause seems to be > requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='wikidata.org', port=443): Read timed out. (read timeout=5.0) [19:04:42] greg-g: Do you have any other reports that might be related to that? [19:05:04] not yet [19:05:07] (that show deployments button is very cool) [19:05:09] k thx [19:06:14] greg-g: Oreos was doing that for wmf.4 too, so I didn't think it was related to the train. [19:06:19] *ORES. Fucking autocomplete [19:06:34] greg-g: my SAL for esams was 18:27 (and then eqiad at 18:41), the 503 spikes were ~18:39-45, and then again now at 18:50 and ongoing [19:06:35] oreos? A kitten just died [19:06:42] Blame autocomplete. [19:07:08] both of the spikes are through all sites, though. I believe correlation to eqiad is possible for the first one, but that was long-complete before the 18:50 one.... [19:07:24] feels like i am getting most of those 503, so you can be clam, others do not :) [19:08:20] no_justification: at this scale? and going from zero to thousands? [19:09:12] Nope, not at this scale yet, that just spiked [19:09:20] fatals are still ongoing afaics [19:09:22] I was seeing it yesterday though, low-ish volume [19:09:56] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [19:12:14] I'm still refreshing a 1h view on: https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor?_g=h@7bf0c26&_a=h@d03f80d [19:12:17] still seeing them so far [19:12:24] !log ladsgroup@tin Synchronized php-1.31.0-wmf.5/extensions/WikibaseQualityConstraints/tests/phpunit/DelegatingConstraintCheckerTest.php: Fix sorting of NullResults (T179038) (duration: 00m 50s) [19:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:31] T179038: DomainException / NullResult holds no constraint - https://phabricator.wikimedia.org/T179038 [19:13:15] bblack: do a share of that url, plz [19:13:41] it's the one I got from you earlier [19:13:58] https://logstash.wikimedia.org/goto/2d0be5628310811600aa14dac54dae39 [19:14:05] !log ladsgroup@tin Synchronized php-1.31.0-wmf.5/extensions/WikibaseQualityConstraints/includes/ConstraintCheck/DelegatingConstraintChecker.php: Fix sorting of NullResults (T179038) (duration: 00m 49s) [19:14:05] kk [19:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:16] PROBLEM - Nginx local proxy to apache on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:14:21] FWIW SWAT is done [19:14:25] PROBLEM - HHVM rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:15:15] RECOVERY - Nginx local proxy to apache on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.026 second response time [19:15:15] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 76822 bytes in 0.104 second response time [19:17:15] there's a common pattern in the cache layer too though, so it very well could be a faulty cache server, maybe triggered by the ethtool work [19:17:23] I'm still digging through my end [19:18:02] (03PS1) 10Ottomata: Disable HDFS httpfs daemon, we don't use it [puppet] - 10https://gerrit.wikimedia.org/r/386684 [19:18:34] awight: How can I help [19:18:41] I'm 100% up now [19:18:48] * Amir1 grabs a coffee [19:19:16] Amir1: sorry about the hour. I’m not sure what the best approach is, but just branched ores-prod-deploy at the last stable revision. [19:19:19] ab88a74d087efff620a3eeb0e5aad1540d2a838b [19:19:27] STABLE_REVSCORING_1 [19:19:46] Amir1: Do you know how to check wikidata’s API performance? [19:19:53] I’d like to choose an informed timeout. [19:20:03] (03PS1) 10Ottomata: Add statsd port to deployment-prep/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/386687 (https://phabricator.wikimedia.org/T179019) [19:20:35] (03CR) 10Ottomata: [C: 032] Add statsd port to deployment-prep/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/386687 (https://phabricator.wikimedia.org/T179019) (owner: 10Ottomata) [19:21:44] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3713527 (10RobH) I'm going to disable its switch port and power it back into the diagnostics (also asked @bblack via irc if there was a result from him running hte output for diagnostics) [19:22:05] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [19:22:07] Anyone know how to check API performance stats for a specific wiki? [19:23:29] so, update on my end: [19:23:35] 10Operations, 10wikidiff2, 10User-Addshore, 10WMDE-QWERTY-Team-Board: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3713532 (10greg) @Addshore go forth and test on group0 [19:23:49] checking [19:23:52] I could see a pattern of most of the 503s coming through cp1065 as the MW-facing backend cache. I restarted the daemon there in case Varnish had gotten into a bad state [19:24:00] https://performance.wikimedia.org/ [19:24:07] Amir1: <3! [19:24:19] that seemed to reduce the rate, but also afterwards I could see the pattern move to another cache (cp1055), restarted that [19:24:36] now the pattern has moved cp1053... [19:24:53] I think it's just moving as I depool, whichever backend is the chash destination for the faulty traffic, basically [19:24:54] https://grafana.wikimedia.org/dashboard/db/performance-metrics?refresh=5m&orgId=1 [19:24:56] Amir1: huh. There’s been no spike. [19:25:21] (so, looking less like a cpNNNN fault... but still looking) [19:25:41] awight: merged your patch [19:26:04] Amir1: If I’m following bblack correctly, this might be a network thing and not ORES or Wikidata. [19:26:20] (03PS2) 10Ayounsi: Update prod ssh key for user mholloway-shell [puppet] - 10https://gerrit.wikimedia.org/r/386187 (https://phabricator.wikimedia.org/T178897) (owner: 10Mholloway) [19:26:50] a more accurate assessment might be that there's a possibility it's a varnish-level thing, but it's not conclusive either way yet [19:26:58] that there are still MW fatals happening.. that should be fixed too [19:27:05] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [19:27:11] (03CR) 10Ayounsi: [C: 032] Update prod ssh key for user mholloway-shell [puppet] - 10https://gerrit.wikimedia.org/r/386187 (https://phabricator.wikimedia.org/T178897) (owner: 10Mholloway) [19:27:19] !log demon@tin Synchronized php-1.31.0-wmf.5/includes/libs/filebackend/SwiftFileBackend.php: Better error grouping (duration: 00m 50s) [19:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:45] maybe we've enetered full inception mode with cause->effect if MW extensions are making API calls back through the front of varnish on public hostnames like wikidata.org :P [19:27:49] 10Operations, 10Patch-For-Review: Update prod SSH key for Michael Holloway (mholloway-shell) - https://phabricator.wikimedia.org/T178897#3713537 (10ayounsi) a:03ayounsi Merged. [19:28:31] bd808: It's completely possible [19:29:08] oh good [19:29:11] Er, bblack ^ [19:29:54] bblack: lol yeah it’s pretty scary stuff [19:31:21] all the 503s are api.php calls [19:31:45] well, the ones coming through the current singular cache responsible for reporting most of the 503s (which has changed a couple times as pooling has changed) [19:33:03] well, not all, but the vast majority anyways [19:33:09] !log awight@tin Started deploy [ores/deploy@0adae70]: Increase extractor wikidata API timeout to 15s, T179107 [19:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:16] T179107: ORES service erroring, in a way that throws exceptions in Extension:ORES - https://phabricator.wikimedia.org/T179107 [19:33:20] !log awight@tin Finished deploy [ores/deploy@0adae70]: Increase extractor wikidata API timeout to 15s, T179107 (duration: 00m 10s) [19:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:31] can we kill the fatals somehow? [19:33:32] that was fast [19:34:00] !log awight@tin Started deploy [ores/deploy@0adae70]: Increase extractor wikidata API timeout to 15s, T179107 [19:34:05] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [19:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:33] I'm depooling cp1053 without a restart, just to see it shift again... [19:34:46] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to ops for aborrero - https://phabricator.wikimedia.org/T178809#3713551 (10Dzahn) @chasemp Looks like Moritz already did that in T178807#3712012 [19:35:17] hmmm... the 503 rate may have just dropped off [19:35:23] waiting for graphs to update [19:36:08] nope, it moved to another cache server again, now cp1066 [19:36:16] grafana bug report: "show deployments" only shows mw deploys, not services/others [19:36:31] !log aaron@tin Started restart [jobrunner/jobrunner@a20d043]: (no justification provided) [19:36:36] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:03] the ORES related fatals haven't slowed down yet: https://logstash.wikimedia.org/goto/45b01ed086fc9679f7b7c27f5dbb4d17 [19:38:57] if anything gone up in the last 15 minutes [19:39:19] Amir1: A 15s timeout doesn’t help…. > awight@scb1002:~$ curl http://0.0.0.0:8081/v3/scores/wikidatawiki/123456 [19:39:35] my best hunch at this point is still that the fatals are breaking other random requests on the api servers, too [19:39:46] (by breaking hhvm in general) [19:40:47] awight: Amir1 so, just to clarify, what changed around 18:27? [19:40:50] Okay, we were discussing in #wikimedia-ai but don’t have a consensus yet. Maybe people here want to weigh in. If Extension:ORES encounters fatal server errors when hitting a MediaWiki API, we’re currently re-throwing the fatal from the extension to get attention. Should we decrease the severity? [19:41:06] greg-g: looking... [19:41:25] !log awight@tin Finished deploy [ores/deploy@0adae70]: Increase extractor wikidata API timeout to 15s, T179107 (duration: 07m 25s) [19:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:33] T179107: ORES service erroring, in a way that throws exceptions in Extension:ORES - https://phabricator.wikimedia.org/T179107 [19:41:56] awight: Yes. Re-throwing makes some noise here. [19:42:09] We'd already have API failures in the logs [19:42:25] PROBLEM - High lag on wdqs1004 is CRITICAL: CRITICAL: 31.03% of data above the critical threshold [1800.0] [19:42:28] greg-g: There’s no ORES deployment corresponding to that. [19:42:47] Currently I’m still working under the theory that the wikidata API is timing out. [19:43:14] no_justification: Even if it’s a read timeout that would only be determined by the extension? [19:43:30] (03CR) 10Aaron Schulz: role::mediawiki::jobrunner: inc runners for refreshLinks/hmtlCacheUpdate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/386636 (https://phabricator.wikimedia.org/T173710) (owner: 10Elukey) [19:43:52] no_justification: AFAIK API failures of ores doesn't show up properly in logstash [19:43:54] awight: That's a little different. "Fatal server error" to me implies you got a response like 500 or somesuch [19:43:57] so the immediate cause of the 503s that varnish is throwing, is that it has no available backend connection slots to forward the request over. it's reached a max_connections parameter that we never reach [19:43:57] A timeout is different [19:44:49] timeout sounds like it needs to stay fataling in the extension [19:44:55] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 36.67% of data above the critical threshold [1800.0] [19:45:04] greg-g: fwiw, SAL has 18:27 bblack: esams cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause [19:45:21] I don’t see anything related to wikidata, though. [19:45:21] I think that's accidentally-aligned with this [19:45:23] yeah, and he was debugging that theory since then [19:45:25] PROBLEM - High lag on wdqs1005 is CRITICAL: CRITICAL: 34.48% of data above the critical threshold [1800.0] [19:45:26] kk I can believe that [19:46:18] the ethtool changes mentioned in my earlier SALs: they caused the ethernet interfaces of the caches to go dark for ~3s while a parameter was changed, and I did some depool->sleep->change->repool->sleep around those and ran through them serially one cache machine at a time. [19:46:22] I'm looking to wdqs... [19:46:34] in theory, at worst they should've caused minor blips, but nothing sustained afterwards [19:47:29] Possible dumb explanation: some cache is Ores expired? Wikidata went to wmf.5 this time yesterday [19:47:36] FWIW, I experimented with increasing the read timeout for accessing wikidata’s API from 5 to 15 seconds, but it made no difference so I rolled back. [19:47:45] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [19:47:47] Did something expire and now we get a failure 24h later? [19:48:13] no_justification: I highly doubt that [19:48:25] the current TL;DR of what I can observe at the cache layer: varnish backends in eqiad are emitting 503s to users. it tends to be mostly from a single cache machine (URIs chash to cache backends, so that could be a faulty machine or a faulty URI). depooling the currently-503-ing cache moves the 503s to another cache (done this several times, which seems to implicate a URI pattern rather than [19:48:29] Just throwing it out there [19:48:29] we only cache scores in ores and that stays there practically forever [19:48:31] a faulty cache) [19:48:45] when looking at the bulk of the 503s, they're for api.php URIs overwhelmingly. [19:48:56] (03PS3) 10Rush: openstack: refactor deployment specific puppetmaster code [puppet] - 10https://gerrit.wikimedia.org/r/386612 (https://phabricator.wikimedia.org/T171494) [19:49:03] and our backend connections from the cache -> API appservers are currently max_connections limited (at 1K) [19:49:24] (03PS1) 10Herron: puppet: add puppet-master.conf to avoid conflict at pkg install time [puppet] - 10https://gerrit.wikimedia.org/r/386696 (https://phabricator.wikimedia.org/T179102) [19:49:25] so the reason it's throwing the 503s is it's out of max_connections to talk to the api servers, presumably because other requests are taking forever or crashing out connections quickly [19:49:52] (03CR) 10jerkins-bot: [V: 04-1] puppet: add puppet-master.conf to avoid conflict at pkg install time [puppet] - 10https://gerrit.wikimedia.org/r/386696 (https://phabricator.wikimedia.org/T179102) (owner: 10Herron) [19:49:55] (03PS4) 10Rush: openstack: refactor deployment specific puppetmaster code [puppet] - 10https://gerrit.wikimedia.org/r/386612 (https://phabricator.wikimedia.org/T171494) [19:50:39] (which probably isn't an unreasonable result of a high rate of fatal exceptions on api appservers) [19:52:31] (03PS2) 10Herron: puppet: add puppet-master.conf to avoid conflict at pkg install time [puppet] - 10https://gerrit.wikimedia.org/r/386696 (https://phabricator.wikimedia.org/T179102) [19:52:38] Amir1: Maybe we should turn off ORES for wikidata? [19:52:44] [19:53:07] I can do that [19:53:21] On it [19:53:39] in practice, we don't disable ORES, we disable ores extension [19:53:44] greg-g: [19:53:54] should I do it? [19:54:44] it would probably fix the issue, but we still don't know why, so how will we diagnose/figure out when we can re-enable? [19:54:59] it's only wikidata afawct, right? [19:55:13] seems to be the case, yes [19:55:16] the resulting 503s are causing damage to unrelated API reqs [19:55:19] I saw enwiki yesterday on wmf.4 [19:55:26] But far less volume [19:55:47] no_justification: so maybe we can diagnose when those show up... ? [19:55:48] lol I saw this in #wikidata > I'm maybe the bad guy that adds to much load to the server by changing mappings. [19:56:00] Amir1/awight: yeah, let's fix this for now. Do that. [19:56:06] kk, doing so. [19:56:10] greg-g: They're still there, just drowned out :) :( [19:56:10] sorry [19:56:15] no_justification: sweet? [19:56:17] :) [19:57:13] (03PS1) 10Ladsgroup: UBN! disbale ores for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386698 (https://phabricator.wikimedia.org/T179107) [19:57:39] (03CR) 10Awight: [V: 032 C: 032] UBN! disbale ores for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386698 (https://phabricator.wikimedia.org/T179107) (owner: 10Ladsgroup) [19:58:43] hmm well I kind of assumed it was api appservers crashing, but clearly whatever's going on here is not quite that direct in nature [19:58:51] (03Merged) 10jenkins-bot: UBN! disbale ores for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386698 (https://phabricator.wikimedia.org/T179107) (owner: 10Ladsgroup) [19:59:00] the set of mw* hosts throwing the fatals are actually in the jobrunner set, not the api_appserver set? [19:59:01] (03CR) 10jenkins-bot: UBN! disbale ores for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386698 (https://phabricator.wikimedia.org/T179107) (owner: 10Ladsgroup) [19:59:07] Amir1: merged. Want me to deploy? [19:59:09] awight: let jenkins merge it [19:59:18] nah, I deploy [19:59:22] k ty [19:59:39] bblack: weeirddd? [19:59:40] bblack: They're all failing jobs [19:59:54] rpc/RunJobs.php -> making requests to some API that fails [19:59:57] right I'm just wondering how failing jobs in the jobrunner set are tying up requests over on the api_appservers set [20:00:22] [WfIwlQpAID4AAHzRCkUAAAAW] /rpc/RunJobs.php?wiki=wikidatawiki&type=ORESFetchScoreJob&maxtime=60&maxmem=300M RuntimeException from line 82 of /srv/mediawiki/php-1.31.0-wmf.5/extensions/ORES/includes/Api.php: Failed to make ORES request to [https://ores.wikimedia.org/v3/scores/wikidatawiki/?models=damaging%7Cgoodfaith&revids=584587187&precache=true&format=json], HTTP request timed out. [20:00:41] Well, the jobs are failing, but it's due to failing API issues [20:00:44] and the ores service is making a request to wikidata.org’s API... [20:00:52] which takes 5s to time out. [20:00:53] It's not actually the *jobs* that are broken, it's that their API requests can't complete [20:00:54] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: UBN! disbale ores for wikidata (T179107) (duration: 00m 50s) [20:00:54] :) [20:00:57] :( [20:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:02] T179107: ORES service erroring, in a way that throws exceptions in Extension:ORES - https://phabricator.wikimedia.org/T179107 [20:01:36] I didn't deploy [20:01:41] forgot to fetch [20:02:13] It happens [20:02:21] :) [20:02:25] so in that error message, "ores.wikimedia.org" is mapped through the misc-web cluster in varnish to the ores service [20:02:56] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: UBN! disbale ores for wikidata (T179107) (duration: 00m 50s) [20:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:08] Amir1: I do that almost every time. [20:03:17] so I'm guessing the chain of events there is something like jobrunner -> ores extension -> request to ores via cache_misc -> request to wikidata via cache_text -> api call timeouts and trashes API for other things? [20:03:25] yes [20:03:30] Sounds about right [20:03:40] It’s a house of cards [20:03:45] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:03:47] that's a long chain of requests [20:04:06] or what awight said ..heh [20:04:19] It's deployed now [20:04:21] chain of requests makes it sound too stable :p [20:04:43] (03PS1) 10Ayounsi: Adding .gitreview [wheels/netbox] - 10https://gerrit.wikimedia.org/r/386700 [20:04:48] And errors dropped off immediately [20:05:30] yup [20:05:44] yes all the graphs at different layers I'm observing seem to return to normal now [20:05:49] (03PS2) 10Ayounsi: Adding .gitreview [wheels/netbox] - 10https://gerrit.wikimedia.org/r/386700 [20:06:04] except a couple that are expected to lag a bit [20:07:05] (03PS3) 10Ayounsi: Adding .gitreview [wheels/netbox] - 10https://gerrit.wikimedia.org/r/386700 [20:07:29] (03CR) 10Ayounsi: [V: 032 C: 032] Adding .gitreview [wheels/netbox] - 10https://gerrit.wikimedia.org/r/386700 (owner: 10Ayounsi) [20:08:35] PROBLEM - HHVM rendering on mw2126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:09:00] the varnish backend -> apiserver stuff is still taking a bit to recover [20:09:15] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [20:09:15] maybe needs a kick to get over the backlog on the connections, not sure yet [20:09:25] RECOVERY - HHVM rendering on mw2126 is OK: HTTP OK: HTTP/1.1 200 OK - 76628 bytes in 0.311 second response time [20:10:14] the loadavg on the machine question is dropping off though, letting it recover on its own so far [20:10:16] why was 2126 flapping this time? [20:10:23] that's the second time that machine flapped [20:12:43] sorry, it was a different one, one in eqiad [20:12:50] too similar of numbers :) [20:14:06] things look much better, but there's still some minor indicators, I donno. it may just take a while for things to settle out. [20:15:02] I'm going to restart the cp1066 backend, maybe that will speed the recovery up [20:16:45] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:16:53] I'll stab you icinga-wm [20:17:16] that's a big spike: https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&from=now-1h&to=now&var-site=All&var-cache_type=text&var-status_type=5 [20:17:26] yeah [20:17:43] I think that may be delayed effect, we'll see [20:17:46] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:18:14] (from a bunch of backlogged requests/connections running their timers, finally timing out?) [20:19:45] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [20:19:52] the 5xx rate on oxygen logs seems to have finally truly dropped back to a trickle now [20:20:00] (there's always a trickle) [20:20:40] (03PS1) 10Ayounsi: Adding wheels for Netbox v2.2.2 + napalm [wheels/netbox] - 10https://gerrit.wikimedia.org/r/386703 [20:21:06] (03CR) 10Ayounsi: [V: 032 C: 032] Adding wheels for Netbox v2.2.2 + napalm [wheels/netbox] - 10https://gerrit.wikimedia.org/r/386703 (owner: 10Ayounsi) [20:21:34] no more fatals since ~20:04 [20:22:03] but I don't think the 503s rate really fully recovered until about 20:20 [20:22:24] bblack: anything that you see which could explain a slowdown in WDQS updates from eqiad? [20:22:25] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:22:53] gehel: I don't know what wdqs updates are, but quite possibly related to the mitigations above [20:23:21] gehel: 20:00 < logmsgbot> !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: UBN! disbale ores for wikidata (T179107) (duration: 00m 50s) [20:23:21] T179107: ORES service erroring, in a way that throws exceptions in Extension:ORES - https://phabricator.wikimedia.org/T179107 [20:23:54] I'm guessing these ores+wikidata queries that got stopped by the above, probably affect what you're looking at in wdqs? [20:24:19] I think we're ok for the train? Or should we hold it still? [20:24:41] bblack: I’d be surprised, since we were hitting the MW API and in my limited understanding, I think wdqs is independent. [20:24:59] Ok, so ores was timing out when contacting wikidata? [20:25:08] yes [20:25:18] hitting wikidata.org/w/api.php [20:25:25] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational [20:25:26] I see huge time spikes on WDQS fetching data from wikidata [20:25:58] I mean normal data fetch is under 100ms and I see 6000-9000ms [20:26:00] yep, we see something similar, WDQS taking more time than usual to contact wikidata.org [20:26:21] any idea why wikdiata is slower than usual? [20:26:23] "more than usual" being 60x-90x more.... [20:26:35] I'm unsure if you've seen backscroll gehel but there have been a lot of issues, may want to loop in greg or brandon [20:26:40] looks like some servers though only - some are still under 100ms [20:26:41] note that I can kill the WDQS updater to remove some of the load on wikidata if needed [20:26:47] I'm unsure if this is on it's way to healing or a new/continuing issue [20:27:24] gehel: I think I know why codfw ones are fine... these probably feed from cache, at least 3 of them [20:27:27] 2 of them [20:27:51] yeah, I'm pretty much lost in that backlog... [20:27:58] FWIW, we disabled the ORES job which was hitting the wikidata api.php and timing out, so that shouldn’t contribute to load any more. [20:28:00] but since equiad ones are now behind, they fetch old ones, and different ones for each of the three, so they don't benefit from caching [20:28:02] Maybe there's a perf issue in Wikidata...generally? [20:28:08] wmf.5 did go out yesterday [20:28:22] well, the big public-facing 503s seem to have been healed [20:28:25] no_justification: looks like it's only on some servers. since some still respond fast [20:28:27] could it be https://gerrit.wikimedia.org/r/#/c/386594 ? [20:28:31] wdqs eqiad is starting to catch up on updates [20:28:33] that went out during morning swat [20:28:35] but yeah, there's some underlying cause related to wikidata slowness that I'm sure is still unsolved [20:29:07] Wikidata being slow would explain why two unrelated services -- WDQS and ORES -- both had issues talking to it [20:29:11] Oct 26 20:28:47 wdqs1003 bash[32609]: org.wikidata.query.rdf.tool.exception.ContainedException: Unexpected status code fetching RDF for https://www.wikidata.org/wiki/Special:EntityData/Q40423011.ttl?nocache=1509049726199&flavor=dump: 503 [20:29:13] greg-g: Good question [20:29:18] hmm probably not so much [20:29:33] Amir1: are these runs done? https://phabricator.wikimedia.org/T178180 [20:29:55] greg-g: He’s out for the next 30 min [20:30:01] greg-g: Yes, Amir did them [20:30:07] at least something fun to pass the time in the airport waiting for the flight... :) [20:30:13] and I manually deleted the property cache earlier today [20:30:15] hoo: as in completed, right? [20:30:18] Yeah [20:30:35] (03PS11) 10BBlack: Caches: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379801 [20:30:54] So the new code is definitely effective and it certainly doesn't reduce the response time [20:31:17] (increase?) [20:31:26] ^ merging the above so I don't lose my place in what I was working on before. It just cements my earlier ethtool changes for future reboots, it's a no-op as it rolls out now. [20:31:42] (03CR) 10BBlack: [C: 032] Caches: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379801 (owner: 10BBlack) [20:32:09] hoo: so, what's the downside of reverting if that has caused this response time regression? [20:32:19] it's the birthday thing right? [20:32:21] greg-g: We loose the new feature [20:32:23] and that, yes [20:32:39] If we do it, can I please get 5-10m to get some production profiles beforehand? [20:32:47] oh 100% [20:32:48] So that I can at least know [20:32:57] Thanks, will take them ASAP :) [20:33:56] wow, this is crazy slow [20:34:57] yeah 60x to 90x is what I am seeing. Sometimes 200x+ - 15s instead of 70ms [20:35:04] something is seriously wrong somewhere [20:35:22] SMalyshev: thinking the bday present related change(s)? [20:35:29] (I'm not sure if it's that one change or more) [20:35:49] greg-g: no idea. technically it's possible since these influence rdf rendering and I do rdf rendering [20:35:55] :) [20:36:03] I could check out some slow reqs and see if they are repeatable [20:37:00] ok, at least not trivially repeatable - I request the same url that took 10s and it's fine now [20:37:29] greg-g: also looking at the contents, the url that was slow is not using the functionality that was added... [20:37:51] but it could be slowed down due to other requests taking too much cpu room? [20:37:59] for the record - https://www.wikidata.org/wiki/Special:EntityData/Q17261641.ttl?flavor=dump took 9s, and it does not have external ids [20:38:12] grh, screw our xhgui [20:38:13] Got an HTTP 503 when browsing mediawiki.org just now [20:38:33] greg-g: possible, of course, I was just going for the easy one :) [20:38:39] heh yes [20:38:39] Krinkle: Are you using Wikidata? :p [20:38:41] Request from .. via cp1053 cp1053, Varnish XID 731513773. Error: 503, Backend fetch failed at Thu, 26 Oct 2017 20:37:53 GMT. URL https://www.mediawiki.org/w/index.php?title=RFC_metadata&action=edit [20:38:42] the 503s are back [20:38:50] (GET) [20:38:52] getting them too [20:38:57] since somewhere around 20:26 [20:39:14] huh looks somebody broked wikipedia again... not just wikidata [20:40:15] Getting them about 1/5 page views. Coudl be unlucky. [20:40:33] wtf [20:40:45] Is there any reported 503 errors, I have had users in wikipedia-en irc channel complaining about 503s on wikipedia and foundationwiki [20:41:01] Zppix: yes, being investigated [20:41:09] Ok [20:41:12] https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1 [20:41:26] from 15/min to 7,000/min [20:41:32] the reason the rate doesn't seem as huge as being reported here is it mostly just affects logged-in users [20:41:38] the hot anonymous pages tend to survive this ok [20:42:07] Yeah, 503s at 7K/min is still "only" 0.03% of traffic [20:43:10] ok, so, new theories [20:43:11] ? [20:43:15] Although that grafana graph is definitely broken. It computes 7500 of 8.5 million as 20%. [20:44:07] i get them only as logged in, not as anon [20:44:10] Did anything change? [20:44:11] if that helps [20:44:34] (probably cause caching) [20:44:36] Yeah, it's only for backend fetches. So unless the page was recently edited or purged, it's fine. [20:44:38] Right? [20:45:47] We do purge a lot, but I think we have logic that prefers a cached object if a backend fetch fails with a 5xx code. But that probably applies to ttl/304 only, not after a real purge. [20:45:49] also api is a separate pool from normal pageviews [20:45:54] I think this is only killing api stuff [20:45:59] (03PS1) 10MaxSem: Enable Unicode section links on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386710 (https://phabricator.wikimedia.org/T175725) [20:46:15] bblack: Getting them about 1/5 of my page views. Could be I'm unlucky. [20:46:44] logged-in? [20:46:49] well either way, hmmm [20:47:43] Yes, logged-in. In my case. [20:48:07] I think maybe the temporary respite was just me moving varnish backend pooling around more [20:48:25] now it has settled back on cp1053 taking the brunt of it again, and maxing out its connections [20:48:43] It got it as anon as well now. After clicking Special:Random 50 times. [20:48:54] Although it happend for the Special page itself, not the target. [20:48:59] greg-g: I'm giving up for now [20:49:02] I have one profile [20:49:02] :/ [20:49:04] that ought to be enough [20:49:28] yeah the max conns for the non-api appservers are filled too [20:49:32] So probably affects any backend fetch from the text cluster. [20:49:34] so yeah it should affect regular pages [20:49:50] Special:Login is also affected, and RecentChanges. [20:49:54] as anon. [20:50:23] greg-g: If you want, we can revert now, I suppose [20:50:46] I can do it, will need some manual actions post deploy [20:51:03] Definitely holding the train. Forget what I said earlier [20:51:13] greg-g: cc ^ [20:51:14] shall we give it a shot? [20:51:15] yandex.ru is scanning us with /?curid=986117 requests, probably-unrelated. but it's the second search engine I've seen doing this, which probably means our own page output is leading them into this pattern... [20:51:20] I'm at a loss, but I suppose it'd be wise to see what we can do to speed up wikidata [20:51:32] hoo: ^ [20:51:38] Ok, let's try it at least [20:51:43] 503/min keeps raising, now up from 7000/min 10 minutes go to 10,000/min. [20:52:26] i had several 503s while editing a few minutes ago, reported a bug with how VE handles it when switching modes ;) [20:52:54] brion: good timing for a 503 :P [20:53:00] greg-g: Am I good to deploy the revert? [20:53:21] yes [20:54:18] (03PS1) 10Hoo man: Revert "Add property for RDF mapping of external identifiers for Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386712 [20:54:29] (03CR) 10Hoo man: [C: 032] Revert "Add property for RDF mapping of external identifiers for Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386712 (owner: 10Hoo man) [20:55:31] yeah both the backend connection pools to api_appservers and regular appservers are maxed out at 1K for sure, so failing reqs to both [20:56:03] (from just 1x of our eqiad caches, but if I depool that it will just move. it has to do with which URIs happen to chash towards that backend at any given time) [20:56:29] some URI that gets chashed there causes very slow requests, which ties up all the connections and makes other requests 503, basically.... [20:56:45] * hoo waiting for jenkins… [20:57:07] (03Merged) 10jenkins-bot: Revert "Add property for RDF mapping of external identifiers for Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386712 (owner: 10Hoo man) [20:57:10] mediawiki-config should be on our list for migrating to docker next :) [20:57:12] it's odd that it's both pools, though [20:57:16] (03CR) 10jenkins-bot: Revert "Add property for RDF mapping of external identifiers for Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386712 (owner: 10Hoo man) [20:58:19] (fwiw, it doesn't affect other unrelated connection pools, like the one for restbase, though) [20:58:59] !log hoo@tin Synchronized wmf-config/Wikibase.php: Revert "Add property for RDF mapping of external identifiers for Wikidata" (T178180) (duration: 00m 50s) [20:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:09] T178180: Enable RDF mapping for external identifiers for Wikidata.org - https://phabricator.wikimedia.org/T178180 [20:59:45] I'm trying to figure out how to find the slow query from this end, but maybe I won't before it goes away [21:00:04] MaxSem: (Dis)respected human, time to deploy CommTech (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171026T2100). Please do the needful. [21:00:04] No GERRIT patches in the queue for this window AFAICS. [21:00:24] MaxSem: please hold [21:00:33] !log Fully revert all changes related to T178180 [21:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:29] bblack: how are things looking? [21:03:44] doesn't seem to ahve slowed down 503s, yet [21:04:31] (03PS5) 10Rush: openstack: refactor deployment specific puppetmaster code [puppet] - 10https://gerrit.wikimedia.org/r/386612 (https://phabricator.wikimedia.org/T171494) [21:05:06] is there a way to differentiate between failures from the 1k pool overage and failures that actually hit the backend? [21:05:09] that is maybe a naive question [21:06:45] error rate seems to be improving [21:07:45] chasemp: in live investigation yes, in our normal graphing not really [21:08:10] * chasemp nods, gotcha thanks [21:08:16] there's a lot of cause of "503", it's a big bucket. [21:08:45] it would be nice, it would be a good future ticket to break that out better by using better text messages on the end of the 503 and graphing with that [21:08:47] I was mainly thinking overage issues may persist longer than direct backend failures as a sign of on-the-road-to-improvement [21:09:21] the "overage" issue where we're limited by the 1K connection pool is the bulk of the 503s we see [21:09:45] what I can't find yet is the needle in the haystack, the requests that are really timing out and/or destroying connections, leading to that conn limit. [21:10:22] ORES is still heavily throwing [21:10:37] is it possible to pull the plug there w/o major side effects? [21:10:48] Halfak awight ^^ [21:10:54] Amir1: ^ [21:10:55] hoo: we disabled ores on wikidata? [21:10:56] what's it throwing? I thought that went away earlier when we stopped the wikidata<->ores stuff? [21:11:14] greg-g: Yes, but it's still heavily throwing errors [21:11:15] hoo: we could, but ORES throwing is a downstream effect of api.php timing out. [21:11:19] greg-g: is enabled elsewhere too [21:11:23] greg-g: maybe there are leftovers of retring jobs [21:11:29] Zppix: yes, I know. [21:11:36] *retrying [21:13:15] I'm not seeing the ores fatals in fatalmonitor [21:13:23] er, logstash fatalmonitor that is [21:13:24] greg-g: it's exception [21:13:30] vah [21:14:02] Could it be the Serializer Stack overflows? [21:14:16] How is HHVM dealing with that? [21:14:33] yeah I saw that earlier trawling logstash, hmm [21:14:49] Fatal error: Stack overflow in /srv/mediawiki/php-1.31.0-wmf.4/vendor/wikimedia/remex-html/RemexHtml/Serializer/Serializer.php on line 274 [21:15:08] We can just disable that trial for now [21:15:11] shall I? [21:15:22] no_justification: ^ is that what you mentioned on monday? [21:16:16] Yes [21:16:22] will do [21:16:45] the first of those was at ~17:08 [21:17:39] and nearby in IRC the conversation is hard to follow, but includes this gem: [21:17:42] 17:06 < awight> robh: almost relevant, though :) there’s like a 5m in 100 chance I’m about to break wikipedia. [21:17:45] PROBLEM - Disk space on cp1053 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=85%) [21:17:57] (03PS1) 10Hoo man: Temp. disable remex html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386730 [21:18:02] bblack: oh uh I was just having fun. Nothing to see here. [21:18:04] ^ oops, that's me, I was logging too much rwa traffic [21:18:04] no_justification: Good to go? [21:18:22] trailing WS [21:18:24] :S [21:18:45] RECOVERY - Disk space on cp1053 is OK: DISK OK [21:18:48] (03PS2) 10Hoo man: Temp. disable remex html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386730 [21:19:05] hoo there's a task for that [21:19:13] I know, linked in the change [21:19:35] hoo you will probaly want to do [21:19:36] Bug: [21:19:52] even if it dosen't really fix it, it at least leaves a trail in the task :) [21:20:34] (03PS3) 10Hoo man: Temporary disable remex html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386730 (https://phabricator.wikimedia.org/T178632) [21:21:19] greg-g: ^? [21:22:06] <|Satan> hi there, I'm getting errors when visiting shortcuts. Already known? [21:22:06] (03CR) 10Paladox: Temporary disable remex html (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386730 (https://phabricator.wikimedia.org/T178632) (owner: 10Hoo man) [21:22:23] yup, known problem. Thanks though :) [21:22:27] <|Satan> ok :) [21:22:44] hoo: yeah, doit [21:22:59] (03PS4) 10Hoo man: Temporary disable remex html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386730 (https://phabricator.wikimedia.org/T178632) [21:23:07] (03CR) 10Hoo man: [C: 032] Temporary disable remex html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386730 (https://phabricator.wikimedia.org/T178632) (owner: 10Hoo man) [21:23:34] I think we're in need of a better theory pretty quickly, I just have no idea where the clues are pointing anymore [21:24:54] (03CR) 10jerkins-bot: [V: 04-1] Temporary disable remex html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386730 (https://phabricator.wikimedia.org/T178632) (owner: 10Hoo man) [21:25:14] phpcs m( [21:25:20] (03CR) 10jerkins-bot: [V: 04-1] Temporary disable remex html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386730 (https://phabricator.wikimedia.org/T178632) (owner: 10Hoo man) [21:25:52] (03PS5) 10Hoo man: Temporary disable remex html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386730 (https://phabricator.wikimedia.org/T178632) [21:26:06] (03CR) 10Hoo man: [C: 032] Temporary disable remex html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386730 (https://phabricator.wikimedia.org/T178632) (owner: 10Hoo man) [21:28:43] (03Merged) 10jenkins-bot: Temporary disable remex html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386730 (https://phabricator.wikimedia.org/T178632) (owner: 10Hoo man) [21:28:52] (03CR) 10jenkins-bot: Temporary disable remex html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386730 (https://phabricator.wikimedia.org/T178632) (owner: 10Hoo man) [21:30:14] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Temporary disable remex html (T178632) (duration: 00m 50s) [21:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:22] T178632: Stack overflow in remex_html serializer - https://phabricator.wikimedia.org/T178632 [21:31:47] forgot git rebase [21:32:33] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Temporary disable remex html (T178632) (duration: 00m 50s) [21:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:10] This doesn't seem to have helped [21:35:29] What info could i gather from reporting users that would be helpful? [21:35:45] not much, most of the errors being seen are unrelated to the real underlying issue [21:35:55] hoo, it fixes the error i was getting [21:36:15] 503 rate is still chugging along [21:36:16] on https://fa.wikipedia.org/w/index.php?title=ژان-پل_سارتر&oldid=21299457 [21:36:32] Well, it fixes the immediate error with the page [21:36:35] bblack: well let me know i got atleast 3 users i could get info from [21:37:10] the bulk of the 503 errors are caused by Varnish not being able to even attempt to satisfy the request, because it has run out of connections to the backend (MediaWiki) [21:37:30] it has a limit of 1K that we normally never get very close to, but now we have that limit pegged [21:37:32] Yes i understand that part :) [21:37:46] it helps me think anyways :) [21:37:50] And let me guess no way we can up the limit [21:38:07] we can up the limit, but it's just going to move the meltiness down to another layer somewhere, probably [21:38:15] still, worth seeing what happens, I guess? [21:38:26] I didn't want to suggest that earlier :) [21:39:23] bblack: i mean we can just lower it again if worse problems occour or until we find the real fix [21:40:11] !log raising backend max_connections for api.svc.eqiad.wmnet + appservers.svc.eqiad.wmnet from 1K to 10K on cp1053.eqiad.wmnet (current funnel for the bulk of the 503s) [21:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:17] parallelism is slowly climbing, as are the count of CLOSE_WAIT sockets at any given time [21:42:29] still, most of the connections are closing rather quickly [21:43:07] the 503 rate dropped off pretty sharply and quickly with that [21:43:24] huh [21:43:24] but I'm not sure what will happen next, it may take a while to find a new limit somewhere else? [21:43:45] (03PS1) 10RobH: refactoring bastion into profiles [puppet] - 10https://gerrit.wikimedia.org/r/386752 [21:44:03] (03PS2) 10RobH: refactoring bastion into profiles [puppet] - 10https://gerrit.wikimedia.org/r/386752 [21:44:31] CLOSE_WAIT socket count continues to pile up faster than they go away [21:44:33] (03CR) 10jerkins-bot: [V: 04-1] refactoring bastion into profiles [puppet] - 10https://gerrit.wikimedia.org/r/386752 (owner: 10RobH) [21:44:40] (03PS1) 10Alexandros Kosiaris: Remove $cluster_cidr from k8s::controller [puppet] - 10https://gerrit.wikimedia.org/r/386753 [21:44:42] (03PS1) 10Alexandros Kosiaris: k8s::controller: support service account token signing [puppet] - 10https://gerrit.wikimedia.org/r/386754 (https://phabricator.wikimedia.org/T177393) [21:44:43] but it will take a while before that reaches the ~10K mark, maybe several more minutes [21:44:44] (03PS1) 10Alexandros Kosiaris: Enable k8s::controller manager ServiceAccount signing [puppet] - 10https://gerrit.wikimedia.org/r/386755 (https://phabricator.wikimedia.org/T177393) [21:47:09] I wonder if having more wiggle room helps it recover easier, not just continue to pile up? [21:48:20] Should we start a task or is there one already? [21:48:37] (03CR) 10Zoranzoki21: [C: 031] Enable Unicode section links on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386710 (https://phabricator.wikimedia.org/T175725) (owner: 10MaxSem) [21:48:43] no_justification: it's possible, eventually. so far it's just going up and up slowly [21:49:14] Eh, scratch that theory :\ [21:49:25] we were holding at ~1K combined established+close_wait sockets for each of 10.2.2.2 + 10.2.2.22 (appservers+api.svc.eqiad.wmnet) [21:49:36] err typo above [21:49:40] we were holding at ~1K combined established+close_wait sockets for each of 10.2.2.1 + 10.2.2.22 (appservers+api.svc.eqiad.wmnet) [21:49:43] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 3 others: Support git-lfs files in gerrit - https://phabricator.wikimedia.org/T171758#3713875 (10awight) [21:50:05] now they're at ~2.8K and 3.0K respectively [21:50:24] And climbing if i read your messages right? [21:50:30] right [21:50:42] Maybe it will start falling? Hopefully.. [21:50:43] when they reach ~10K, it will fall into the same pattern again, if it gets there [21:51:06] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:51:26] the growth rate on those socket counts is currently ~300/min [21:51:36] so it could take another 20 minutes or so to peak [21:51:46] Gives us time to think atleast [21:51:53] unless it gets us over the hurdle of some other rate of timeouts vs reqs/sec [21:52:06] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:52:34] what would be interesting right now would probably be to see which 503s are still happening heh [21:53:05] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:53:07] bblack: i asked the reporting users that reported to me to let me know [21:53:15] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:53:25] PROBLEM - HHVM rendering on mw2120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:54:15] RECOVERY - HHVM rendering on mw2120 is OK: HTTP OK: HTTP/1.1 200 OK - 76628 bytes in 0.371 second response time [21:54:20] there are still ORES error rolling in :/ [21:54:33] wtf [21:54:40] I will join you soon [21:55:22] I'm looking at live logs of the remaining 503s on cp1053, but when I go try the same URLs myself with curl nothing bad happens [21:55:29] they seem to be just the usual spurious 503s [21:55:38] bblack: Do you see any schema? [21:55:43] * pattern [21:55:48] like Wikidata, …? [21:55:48] - FetchError http first read error: EOF [21:55:51] Could it just be cp1053 being weird bblack ? [21:56:06] Zppix: no, I've moved this problem around to other hosts, it just happens to be here now [21:56:13] we can depool cp1053, it will just move to another [21:56:29] bblack: i see okay there goes my theory i had [21:56:56] for whatever its worth, WDQS is still lagging, it looks like wikidata.org/w/api.php is still slower than usual [21:57:30] https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?refresh=1m&orgId=1&from=now-6h&to=now&panelId=8&fullscreen [21:57:32] now that has to be really wierd for the api to take that long to load. [21:57:44] Weirdly wikidata edits dropped dramatically at around the same time [21:57:50] en.wp.org is fast https://en.wikipedia.org/w/api.php [21:57:57] maybe wikidata related configs? [21:58:06] the rising connection counts seem to have mysteriously stabilized now [21:58:45] we may have overcome some hurdle here, where the rate of the "real" errors timing out connects vs the rate of legit requests lets it all "fit" within this new threshold without causing unrelated fallout. [21:59:02] bblack: black magic [21:59:22] it's still an insane rate of reconnections/failures :P [21:59:39] bblack: maybe the 1k rate was just to low :P [22:00:29] well, from a certain point of view that's certainly true :) [22:00:42] but it was ok by a healthy margin before today, something else is wrong [22:01:33] and we're only having to raise it on 1/N of chash targets, meaning the issue is localizable to some subset of URLs [22:02:17] I need to get some sleep. Not much I can do anyway... [22:02:22] bblack: i have to brb to eat but ill catch up and think on it and ill let you know what i think of [22:02:36] bblack: These varnishes don't handle non-application traffic (eg. api), right? [22:02:45] wouldn't be related to someone over on Commons repeatedly trying and failing to import content from Flickr ? [22:03:07] hoo: they do handle api traffic as well [22:03:15] NotASpy: yes, API requests might have failed randomly [22:03:32] ok, hm [22:03:35] they have separate backend sockets for api.svc and appservers.svc requests with separate 1k (now 10k) limits, and both pools are being affected [22:03:52] both pools? That sounds weird [22:03:56] yes, it does [22:04:21] because any given request URL only maps to one or the other backend (appservers or api) [22:04:46] and when I depool the currently-problematic cache, the whole problem moves to a different cache via chashing [22:05:03] it seems like that shouldn't be happening [22:05:41] We have an interesting change in Wikidata edit traffic at the same time [22:05:48] no idea whether that correlates or even causes, though [22:06:00] probably just correlate… slow API -> fewer edits [22:06:29] MaxSem: the errors reported include "the file submitted was empty", "not authorized 9bad api response [isauthok]: null)", "/files of the mime type "text/html' are not allowed to be uploaded" and "auth not ok" [22:06:52] but as Flickr has been having their own technical issues, I don't know where the issue might actually sit. [22:07:39] could an import script failing to fetch images from Flickr, or being hit by latency issues, cause problems at the WMF end ? [22:08:00] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3713945 (10RobH) Ran full test today, no errors. ``` /------------------------------------------------------------------------------\ | Build : 4239.44 ePSA Pre-boot System Assessment ServiceTag... [22:08:14] bblack: if you want to rubberduck the problem w/ me in a hangout or something I can, I'm unsure how to help, my poking didn't tell me much [22:08:38] what's annoying is even going through all the (much lower rate of) 503 errors still present on cp1053, I can't find one I can repro a problem on, and they seem semi-random [22:09:28] bblack: Have they gone back to the ones you restarted? [22:09:35] Or have they stayed "fixed" when it migrated? [22:09:48] they eventually cycled back [22:09:55] but only because there's only so many servers in the set [22:10:07] Ah ok that was my question [22:10:17] Cuz if they stayed "fixed" -- finish restarting and see if it comes back [22:10:18] But nvm [22:10:21] bblack: given it's getting to be the end of day for you (past), can I ask you to summarize your findings/thinkings before you disappear? [22:10:26] on task/something [22:11:01] bblack: what about the search engine request if we kill it server side could that help any? [22:11:02] I will before I stop, but I'm not stopping yet :) [22:11:28] Zppix: no it's pretty low rate, just something I noticed flying by that stood out [22:12:02] bblack: i was thinking if we killed any large requests maybe it help it recover [22:12:23] do we have some way to look at the api.svc and/or appserver.svc side and see if requests are totally bombing out and killing hhvm? [22:12:23] bblack: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=master&from=now-6h&to=now [22:12:28] I've really only been looking from this end [22:12:35] The throughput looks very interesting [22:13:19] the first thing that stood out to me in that graph, was a sharp jump in s5 read rate just before problematic times earlier [22:13:36] well, ~16:50 anyways [22:13:59] and then not long after, a dropoff in x1 read rate? [22:14:03] Also the edit rates on Wikidata changed at the same time [22:16:50] I could raise the max_conns to 10K for api+appservers across the board (in actual config rather than a temp hack on one host), and then this pattern would hold so long as nothing else gets worse [22:17:12] right now if anything depools/moves for any reason the problem will shift back to one of the max_conns=1K hosts [22:17:37] bblack: i mean thats the "wikimedia hacky" way but a real fix would be perferred in my opinion but im fresh outta ideas atm [22:19:06] I am too [22:19:29] I just don't want that to be the final solution. we never found the underlying cause, so 10K limits are the new normal to cope with whatever's crashing connections to MW [22:19:29] bblack: Can you check whether we have an increase in action=compare calls? [22:20:09] so that makes me think, maybe the reason I can't find the connection-crashing requests is that they're not crashing connections [22:20:25] it's just that all the requests have their usual success rate, but some of them got a lot slower, enough to tie up a lot more connections? [22:20:42] some common thing that took 1ms before now takes 100ms, that would be enough to cause this probably [22:20:51] (and to be fixed by raising the connection limit) [22:21:08] hoo: looking.... [22:21:13] bblack: that would match what we still see on wdqs... [22:21:58] are there some graphs that can be looked at on the applayer end to find out if avg response time of various API calls has slowed dramatically or something? [22:22:22] of course, we already kinda know there are earlier reports of "wikidata is slow" [22:22:28] but how is it slow and what's it still affecting? [22:22:37] I would assume so but i dont know where? Theres like a graph for everything on grafana [22:22:40] and is it the problem or just the top symptom? [22:22:50] (it being wikidata in this case) [22:23:19] But wikidata wouldnt cause all projects to be affected would it? [22:23:25] PROBLEM - Check Varnish expiry mailbox lag on cp1053 is CRITICAL: CRITICAL: expiry mailbox lag is 2075307 [22:24:16] maybe, they're all kinda interrelated in ways [22:24:21] bblack: https://grafana.wikimedia.org/dashboard/db/api-requests?refresh=5m&orgId=1 [22:24:44] heh, mailbox lag on cp1053 is a symptom of "handling" all the extra stuff through the new limits, it will probably require a depool eventually... [22:24:46] interestingly, it has a spike in editing API 10 minutes before the problem started [22:25:11] MaxSem: https://grafana.wikimedia.org/dashboard/db/wikidata-edits [22:25:16] this also looks very very weird [22:25:42] hoo: action=compare right seems fairly consistent since ~06:00UTC earlier today (the log I'm looking at) [22:25:46] s/right/rate/ [22:26:00] no huge spike/rise in it anyways [22:26:14] crap :/ [22:26:34] I don't know what "size change" is there, but it drops like a rock just when the first 503 spike started [22:26:49] bblack: It's basically how much data (bytes) are added to Wikidata [22:27:17] around the same timeframe, QuickStatements rises [22:27:40] under the Oauth graph I mean [22:28:14] I'll check on its edits [22:29:02] in the meantime, I'm going to have to eventually depool cp1053 and restart its backend. the "mailbox lag" alert above is it not dealing well with the traffic flowing through it right now. [22:29:21] so I'm gonna puppetize the 10K change for all the others first, so that things don't fall apart when the traffic moves to another cache host [22:31:03] do we have a task/bug/something somewhere yet that covers this whole incident, or? [22:31:08] nope [22:31:09] I'm guessing not yet [22:31:25] bblack: we had the ORES one, but we're way beyond that now :( [22:32:28] yeah clearly ORES was just a symptom of a wikidata slowness [22:32:39] yup [22:33:09] (03PS1) 10BBlack: cache_text: raise MW connection limits to 10K [puppet] - 10https://gerrit.wikimedia.org/r/386756 [22:34:01] there has to be, at the root of this, some explicable performance regression affecting some subset of MediaWiki requests [22:34:26] maybe the subset is "wikidata", or maybe wikidata is just the most-obvious fallout of something deeper [22:34:47] (03CR) 10BBlack: [C: 032] cache_text: raise MW connection limits to 10K [puppet] - 10https://gerrit.wikimedia.org/r/386756 (owner: 10BBlack) [22:35:49] bblack: I don't see anything regarding Wikidata that's off [22:36:27] (03PS1) 10Ayounsi: Add django ldap support wheels [wheels/netbox] - 10https://gerrit.wikimedia.org/r/386757 [22:39:11] !log restarting varnish-backend on cp1053 (mailbox lag from ongoing issues elsewhere?) [22:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:29] so the target cache is now cp1052 it seems [22:41:30] I don't have full backlog, but I assume you've tried reverting ladsgroup's deploys? [22:41:41] 18:52 ladsgroup@tin: Synchronized php-1.31.0-wmf.5/extensions/Wikidata/extensions/Constraints/tests/phpunit/DelegatingConstraintCheckerTest.php: Fix sorting of NullResults (T179038) (duration: 00m 49s) [22:41:41] 18:51 ladsgroup@tin: Synchronized php-1.31.0-wmf.5/extensions/Wikidata/extensions/Constraints/includes/ConstraintCheck/DelegatingConstraintChecker.php: Fix sorting of NullResults (T179038) (duration: 01m 04s) [22:41:41] T179038: DomainException / NullResult holds no constraint - https://phabricator.wikimedia.org/T179038 [22:41:52] they correspond with a big spike in db queries [22:42:34] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1 [22:43:07] yeah, those are also right near the time we first noticed the fatals spiking, which was ores->wikidata [22:43:26] RECOVERY - Check Varnish expiry mailbox lag on cp1053 is OK: OK: expiry mailbox lag is 0 [22:43:49] That was in response to an exception I spotted and reported this week [22:43:54] I would revert those with extreme prejudice [22:43:56] I don't think those are among the things we've tried to revert [22:44:10] No, they are not [22:44:13] how bad was the exception? [22:44:29] Annoying. [22:44:33] Non-fatal, best I could tell [22:44:38] Well, non fatal to us [22:44:39] annoying would be an improvement, I think [22:44:43] Sent users errors [22:44:56] I don't have my key, but I'd be reverting if I did [22:45:15] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [22:45:16] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [22:45:18] :) [22:45:18] we can also revert that extension to wmf4 tight away [22:45:27] * right [22:46:01] Let's just put all of Wikidata back on wmf.4 [22:46:04] Quickest fix [22:46:36] that should work [22:46:36] yes [22:46:50] Also lets us test vs testwikidatawiki @ wmf.5 [22:47:01] * no_justification should've suggested that earlier [22:47:04] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: wikidata to wmf.4 [22:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:38] (03PS1) 10Chad: Wikidatawiki to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386758 [22:47:40] (03CR) 10Chad: [C: 032] Wikidatawiki to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386758 (owner: 10Chad) [22:48:50] (03Merged) 10jenkins-bot: Wikidatawiki to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386758 (owner: 10Chad) [22:48:58] Well? [22:48:59] (03CR) 10jenkins-bot: Wikidatawiki to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386758 (owner: 10Chad) [22:49:26] it'll probably take a few minutes [22:49:31] to know, I mean [22:50:05] (03CR) 10Ayounsi: [V: 032 C: 032] Add django ldap support wheels [wheels/netbox] - 10https://gerrit.wikimedia.org/r/386757 (owner: 10Ayounsi) [22:50:30] another datapoint: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=7&fullscreen&orgId=1 [22:51:05] mysql will use a temporary table to sort [22:51:16] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:51:32] ori: The code we're talking about here is only running on wikidatawiki [22:51:43] but there might be other changes in wmf5 [22:51:46] that touch this [22:52:09] I bet it's that change [22:52:16] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:53:05] ori: The backport? [22:53:56] "Fix sorting of NullResults" [22:54:18] FWIW, so far no real improvement, but I don't know how long it might take for (a) the underlying stuff to recover or (b) then varnish to recover [22:54:20] I doubt it… that's not deployed anywhere except on wikidatawiki [22:54:31] * (test)wikidata [22:54:32] there are probably some slow-ass sorts on mysql [22:54:36] they should be killed [22:54:41] with the 10K limit we don't have the 503s to go, so I'm just looking at my connection counts on the current target varnish (cp1052), and they haven't dropped off yet [22:54:41] they'll take a while to clear out [22:55:04] s/the 503s to go/the 503s to go by/ [22:55:36] I don't see any long running evil queries on either s1 or s5 [22:57:17] Quick improvement to ORES: Rather than throw RuntimeException, log it at the error() level instead. Then we can group on URL and such in logstash and get better aggregate #s of that problem [22:58:17] (more generally, it'd be nice if our exceptions could take parameters that get passed on to logstash nicely [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171026T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:57] I'm calling it a day [23:01:43] I hope there's nothing important re Wikidata anymore. You can ping me on my phone, if it's very important [23:01:44] see you [23:01:52] bblack: What's your thoughts? How urgent is this still? [23:03:43] I don't know [23:04:06] we don't seem to be having 503 issues :) [23:04:31] we're operating under some new rules (high connection limits), which avoid the worst of that 503 cause [23:04:46] whatever's happening through the causes mailbox lag on it though, so it's not great [23:05:17] lots of things can cause that apparently, it's nothing new, but it will make that varnish fall over eventually, and then another will start its clock [23:05:41] So with the higher limit, we're not causing contention for requests not actually causing errors [23:05:50] So it's limited to the offending $whatever_it_is? [23:06:16] well, $whatever_it_is still has an effect on the cache its passing though, the effect is just more-subtle and takes longer to break things [23:06:18] are you sure that code only runs on wikidata wiki? [23:06:46] eventually cp1052 will succumb to its mailbox lag and start causing a new type of 503, and then we can restart it and the cycle will start elsewhere. [23:06:53] ori: Let me check once more, but yes [23:06:55] The ORES bits or the Wikidata fix? [23:07:05] The Wikidata thing was only coming from wikidatawiki when I reported it [23:07:08] the wikidata fix [23:07:30] The ORES bit was also happening /before/ that fix to Wikidata went out by the way [23:07:33] Just in far lower volumes [23:08:01] ori: Yeah, absolutely sure [23:08:17] what's the ORES bit? [23:08:28] https://phabricator.wikimedia.org/T179107 [23:09:46] ori: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2017.10.26/mediawiki?id=AV9a8YlfePsvZ6Lq26zO&_g=h@44136fa looks like that in logstash [23:09:51] right, so sometimes ORES was going fatal because a query to wikidata.org:443 timed out. and then when this mess started up earlier, one of the primary symptoms at the time was a suddenly-much-higher rate of that goin gon. [23:09:55] (not a sharable link sorry, it's a single doc) [23:10:15] I have to head out shortly, I have an appointment pretty soon. [23:10:21] "[mediawiki/services/ores/deploy@STABLE_REVSCORING_1] Blindly choose a timeout of 15 seconds", awesome [23:10:54] let's not worry about the fact that suddenly the queries we're sending wikidata are taking longer to process [23:10:58] instead let's increase the timeout [23:11:04] by a factor of three, because why not [23:11:42] sorry, because "Hopefully the wikidata API will respond within this window. " (https://gerrit.wikimedia.org/r/#/c/386691/) [23:11:47] so, yeah, I think one of the unanswered questions from the start of this (it has been asked) is: why/how/what is wikidata slow now? [23:12:22] because the sorting of some common query changed [23:13:21] or some other reason [23:15:08] what shard is db1051? [23:16:09] s1 [23:16:09] so pulling from some relevant logs... [23:16:22] real 0m5.717s [23:16:22] user 0m0.012s [23:16:22] sys 0m0.012s [23:16:26] bleh [23:16:43] https://www.wikidata.org/wiki/Special:EntityData/Q42330212.ttl?nocache=1509059658256&flavor=dump [23:16:55] ^ that took 39s to fetch from varnish->MW a couple minutes ago [23:17:05] and when tested it manually with curl it took 5.7s [23:17:08] which still seems a bit slow [23:17:21] it's a URL found through the logs of live requests [23:18:30] dumping with nocache? Is someone crawling us? [23:18:55] s8 is being created because growth of wikidata's revision table has been explosive. MariaDB's planner might be doing poorly with it or it may just be too damn big for where it lives today. [23:19:07] Or is that internal? [23:19:44] most of the slow requests passing through the target varnish are similar wikidata queries [23:20:37] I think [23:22:45] I'm going to restart cp1052 varnish backend again just to clear out other noise that makes it harder to see what's going on [23:24:18] I gotta go home, I have my key there, I'll help if this is still ongoing [23:24:25] cp1067 is the new target [23:24:32] thanks ori :) [23:28:37] watch it be googlebot in the end [23:29:33] well, cp1067 isn't building up the connection stats I'd expect [23:29:41] none of them seem to clearly be the target anymore... [23:30:23] problem mysteriously poofed? probably not right when I moved the traffic above. probably earlier, but varnish was still suffering until I moved it. [23:33:12] yeah, still none of the eqiad text caches are showing signs of connection problems. [23:33:41] the last action I took was restarting cp1052, which would've (at least, the several previous times I did this) shifting the load to a random new cache [23:34:34] it was ~22:40 (nearly an hour ago) when we switched from cp1053 to cp1052 [23:35:32] and it was during the earlier part of that nearly-an-hour window that the wmf.4 revert for wikidatawiki happened [23:35:47] I tihnk somewhere in there the underlying problem went away, and cp1052 was just still suffering lingering effects. [23:35:53] that's my best hypothesis right now [23:36:03] If things aren't on fire I'm going to dip out too. Its almost 5 and I have somewhere to be. [23:36:08] I'll catch up later [23:36:26] I also have to run, I've been getting evil glares from my wife for a while now, because we're late for dinner at my parents' house :) [23:36:56] but yeah, I think the revert did it, eventually [23:37:17] [plus kicking the tires of the varnish cluster to shake it out of its bad pattern afterwards] [23:38:02] so I'm running to dinner too, but I'll come back and check in and see about helping to write up some kind of report at least, later. [23:38:23] feel free to call if things somehow reverse course and start melting, but I don't think they will now.