[00:05:25] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1947 bytes in 0.112 second response time [00:18:25] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1969 bytes in 0.121 second response time [00:28:55] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1951 bytes in 0.099 second response time [00:36:45] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1967 bytes in 0.087 second response time [00:47:15] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1971 bytes in 0.086 second response time [01:06:55] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1967 bytes in 0.113 second response time [02:15:15] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1964 bytes in 0.273 second response time [02:38:45] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1962 bytes in 0.100 second response time [02:42:55] RECOVERY - Long running screen/tmux on snapshot1001 is OK: OK: No SCREEN or tmux processes detected. [02:44:05] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1960 bytes in 0.110 second response time [02:48:15] !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.3) (duration: 12m 02s) [02:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:51:46] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1975 bytes in 0.096 second response time [03:12:46] PROBLEM - Disk space on maps1001 is CRITICAL: DISK CRITICAL - free space: /srv 54059 MB (3% inode=99%) [03:31:15] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1000.28 seconds [03:32:03] !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.4) (duration: 12m 10s) [03:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:39:38] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue May 22 03:39:38 UTC 2018 (duration 7m 36s) [03:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:26] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1976 bytes in 0.108 second response time [04:00:46] RECOVERY - Disk space on maps1001 is OK: DISK OK [04:29:25] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 93.68 seconds [05:03:18] (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434434 (https://phabricator.wikimedia.org/T190148) [05:04:27] (03PS2) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434434 (https://phabricator.wikimedia.org/T190148) [05:06:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434434 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:07:53] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434434 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:09:45] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434434 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:10:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1078 for alter table (duration: 01m 44s) [05:10:32] !log Deploy schema change on db1078 - T191519 T188299 T1901482 [05:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:37] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [05:10:38] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [05:17:57] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852#4221716 (10Marostegui) ``` root@db1067:~# megacli -AdpBbuCmd -a0 | grep Temper Temperature: 47 C Temperature : OK ``` [05:19:07] (03PS1) 10Marostegui: db-eqiad.php: Repool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434435 (https://phabricator.wikimedia.org/T194852) [05:20:20] (03CR) 10A2093064: Enable "File mover" flag on zh.wikipedia (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434359 (https://phabricator.wikimedia.org/T195247) (owner: 10Zoranzoki21) [05:21:14] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434435 (https://phabricator.wikimedia.org/T194852) (owner: 10Marostegui) [05:21:36] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1971 bytes in 0.104 second response time [05:22:49] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434435 (https://phabricator.wikimedia.org/T194852) (owner: 10Marostegui) [05:23:03] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434435 (https://phabricator.wikimedia.org/T194852) (owner: 10Marostegui) [05:24:49] (03CR) 10R4q3NWnUx2CEhVyr: "The API completely changed in varnish 5.2 it is an incompatible change with 5.1 So you cannot compile this varnishkafka change against li" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/430069 (owner: 10R4q3NWnUx2CEhVyr) [05:24:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1067 - T193835 (duration: 01m 19s) [05:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:57] T193835: Move db1067 to row C - https://phabricator.wikimedia.org/T193835 [05:25:23] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852#4221721 (10Marostegui) 05Open>03Resolved I have repooled this host. It didn't have any issues after many days and many reboots. So it was probably a one time thing. Resolving... [05:33:31] (03PS5) 10Marostegui: mariadb: Depool all row C databases (except s6 master) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433014 (https://phabricator.wikimedia.org/T187962) (owner: 10Jcrespo) [05:42:52] (03PS3) 10R4q3NWnUx2CEhVyr: Allocate only the needed size for the format structure array [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/318053 [05:42:54] (03PS1) 10R4q3NWnUx2CEhVyr: Remove a compile-time check macro that was renamed/moved in varnish 6.0 [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/434436 [05:42:56] (03PS1) 10R4q3NWnUx2CEhVyr: Add comment about CLI arguments definition. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/434437 [05:46:17] (03Abandoned) 10R4q3NWnUx2CEhVyr: Add comment about CLI arguments definition. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/434437 (owner: 10R4q3NWnUx2CEhVyr) [05:46:38] (03Abandoned) 10R4q3NWnUx2CEhVyr: Remove a compile-time check macro that was renamed/moved in varnish 6.0 [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/434436 (owner: 10R4q3NWnUx2CEhVyr) [05:50:41] (03PS2) 10R4q3NWnUx2CEhVyr: Change to use VUT [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/430069 [05:50:43] (03PS4) 10R4q3NWnUx2CEhVyr: Allocate only the needed size for the format structure array [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/318053 [06:01:39] (03PS3) 10R4q3NWnUx2CEhVyr: Change varnishkafka to use the new VUT API [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/430069 [06:14:15] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1968 bytes in 0.128 second response time [06:31:05] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:32:06] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_puppetrun] [06:37:55] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1973 bytes in 0.104 second response time [06:57:15] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:26] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:06:28] (03PS2) 10Elukey: role::druid::analytics::worker: upgrade to Druid 0.11.0 [puppet] - 10https://gerrit.wikimedia.org/r/432582 (https://phabricator.wikimedia.org/T193712) [07:10:06] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434442 [07:10:11] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434442 [07:11:53] (03CR) 10Muehlenhoff: [C: 031] "That looks good, better split this into two commits, though; one which creates the repository section and one which adds it to toolforge, " [puppet] - 10https://gerrit.wikimedia.org/r/433996 (https://phabricator.wikimedia.org/T194665) (owner: 10Arturo Borrero Gonzalez) [07:13:20] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434442 (owner: 10Marostegui) [07:14:41] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434442 (owner: 10Marostegui) [07:16:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1078 after alter table (duration: 01m 19s) [07:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:19] (03CR) 10Vgutierrez: "see inline comments" (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/430337 (owner: 10Mark Bergsma) [07:19:25] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434442 (owner: 10Marostegui) [07:21:25] !log Deploy schema change on s8 codfw master (db2045) with replication, this will generate lags on codfw - T191519 T188299 T190148 [07:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:31] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [07:21:31] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [07:21:31] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [07:22:59] (03PS3) 10Elukey: role::druid::analytics::worker: upgrade to Druid 0.11.0 [puppet] - 10https://gerrit.wikimedia.org/r/432582 (https://phabricator.wikimedia.org/T193712) [07:28:22] jouncebot: next [07:28:22] In 5 hour(s) and 31 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180522T1300) [07:28:37] Hmm, I forgot to schedule my slot perhaps.... [07:30:19] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11264/druid1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/432582 (https://phabricator.wikimedia.org/T193712) (owner: 10Elukey) [07:34:21] 10Operations, 10Discovery, 10Maps, 10hardware-requests: Increase storage on maps* servers - https://phabricator.wikimedia.org/T195285#4221846 (10Gehel) [07:40:56] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1966 bytes in 0.092 second response time [07:42:48] !log stop and reimage db2057 [07:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:33] (03CR) 10Urbanecm: [C: 04-1] "See @A2093064'inline comments." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434359 (https://phabricator.wikimedia.org/T195247) (owner: 10Zoranzoki21) [07:50:12] (03PS3) 10Vgutierrez: Implement kubernetes configuration observer [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) [07:50:20] !log stop and reimage db2050 [07:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:56] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1968 bytes in 0.101 second response time [07:59:02] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1953 bytes in 0.139 second response time [08:07:57] jouncebot refresh [08:07:58] I refreshed my knowledge about deployments. [08:08:01] jouncebot: next [08:08:01] In 2 hour(s) and 51 minute(s): WikibaseLexeme wikidata.org deployment Preparation & Backports (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180522T1100) [08:08:05] =] [08:11:22] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1969 bytes in 0.121 second response time [08:15:07] (03PS2) 10Gehel: Move process-osm-data example URLs to https [puppet] - 10https://gerrit.wikimedia.org/r/433776 (https://phabricator.wikimedia.org/T190193) (owner: 10Pnorman) [08:15:59] (03CR) 10Gehel: [C: 032] Move process-osm-data example URLs to https [puppet] - 10https://gerrit.wikimedia.org/r/433776 (https://phabricator.wikimedia.org/T190193) (owner: 10Pnorman) [08:22:10] (03CR) 10Elukey: "> The API completely changed in varnish 5.2 it is an incompatible" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/430069 (owner: 10R4q3NWnUx2CEhVyr) [08:25:35] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [08:27:49] jouncebot: next [08:27:49] In 2 hour(s) and 32 minute(s): WikibaseLexeme wikidata.org deployment Preparation & Backports (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180522T1100) [08:30:44] (03PS1) 10Addshore: Add 171081 to wmgWikibaseIdBlacklist for wikibase-lexeme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434451 (https://phabricator.wikimedia.org/T194248) [08:33:02] (03PS1) 10Addshore: Set wgLexemeLanguageCodePropertyId for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434452 (https://phabricator.wikimedia.org/T194248) [08:33:38] (03PS1) 10Addshore: Enable WikibaseLexeme on wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434453 (https://phabricator.wikimedia.org/T191457) [08:36:04] (03CR) 10WMDE-leszek: [C: 031] Add 171081 to wmgWikibaseIdBlacklist for wikibase-lexeme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434451 (https://phabricator.wikimedia.org/T194248) (owner: 10Addshore) [08:36:16] (03CR) 10WMDE-leszek: [C: 031] Set wgLexemeLanguageCodePropertyId for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434452 (https://phabricator.wikimedia.org/T194248) (owner: 10Addshore) [08:40:10] (03PS1) 10Volans: Icinga: add check and retry intervals for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/434455 (https://phabricator.wikimedia.org/T173050) [08:40:12] (03PS1) 10Volans: Icinga: reduce checks frequency for SMART and EDAC [puppet] - 10https://gerrit.wikimedia.org/r/434456 (https://phabricator.wikimedia.org/T173050) [09:00:50] (03PS4) 10Vgutierrez: Implement kubernetes configuration observer [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) [09:03:15] (03CR) 10Vgutierrez: [C: 04-1] "please provide a better commit message <3" [debs/pybal] - 10https://gerrit.wikimedia.org/r/433736 (owner: 10Mark Bergsma) [09:05:43] (03CR) 10Volans: "Compiler output looks sane: https://puppet-compiler.wmflabs.org/compiler02/11265/" [puppet] - 10https://gerrit.wikimedia.org/r/434455 (https://phabricator.wikimedia.org/T173050) (owner: 10Volans) [09:13:03] (03CR) 10Volans: "Compiler available here: https://puppet-compiler.wmflabs.org/compiler02/11266/" [puppet] - 10https://gerrit.wikimedia.org/r/434456 (https://phabricator.wikimedia.org/T173050) (owner: 10Volans) [09:14:38] (03PS1) 10ArielGlenn: split up dumps temp dir into subdirs [dumps] - 10https://gerrit.wikimedia.org/r/434461 (https://phabricator.wikimedia.org/T182572) [09:15:51] (03PS5) 10Addshore: Log wikibase dispatchChanges script for testwikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/395968 [09:16:47] (03PS1) 10Gehel: cassandra: cassandra-tools always make sense to be available [puppet] - 10https://gerrit.wikimedia.org/r/434462 [09:18:14] (03PS4) 10Addshore: Wikidata dispatch, remove cron params, use values from mediawiki-config [puppet] - 10https://gerrit.wikimedia.org/r/430923 [09:18:34] o/ all [09:19:03] Morning apergos! How would you feel about giving https://gerrit.wikimedia.org/r/#/c/395968/ a quick look over and +2ing if you are happy? [09:19:22] It's just adding the same logging for testwikidatawiki as is currently used for wikidatawiki for the wikibase / wikidata dispatching cron [09:19:26] I shall at least have a look :-) [09:19:31] thanks! :) [09:20:35] I may also have another one on the same topic in a while if all goes well [09:22:11] !log upgrading mw1280-mw1290 to HHVM 3,18.5+dfsg-1+wmf8+deb9u1 [09:22:13] yep I'll merge this through [09:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:33] (03CR) 10ArielGlenn: [C: 032] Log wikibase dispatchChanges script for testwikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/395968 (owner: 10Addshore) [09:22:48] apergos: great, thanks :) [09:22:58] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS38930/IPv6: Active, AS38930/IPv4: Active [09:23:25] you'll need to wait up to 30 mins for the puppet run to go aroun [09:23:26] d [09:23:50] apergos: that's fine, I'll check the log file appears in the next hour :) [09:23:57] cool [09:23:58] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1954 bytes in 0.105 second response time [09:24:35] apergos: the second change for later would be https://gerrit.wikimedia.org/r/#/c/430923/ which is one part of allowing modifying dispatching settings from mediawiki-config [09:24:53] but that isn't ready yet as it requires the config to be merged in mediawiki-config first [09:25:25] okey dokey [09:26:10] The patch adding the settings to wikibase was actually added at the end of last year, but it has taken a while for me to find the time to push to get it deployed :) https://github.com/wikimedia/mediawiki-extensions-Wikibase/commit/d8c68e7ca156e725786334894762aa185ed60641#diff-67296fac8baafd0e1d37c172a2bbf91d [09:26:26] But, if needed I can wave my arms around in here later for that one :) [09:27:25] heh [09:27:28] ok, poke me when needed [09:27:32] thanks! [09:31:39] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1951 bytes in 0.117 second response time [09:31:49] RECOVERY - BGP status on cr1-esams is OK: BGP OK - up: 7, down: 0, shutdown: 2 [09:32:24] (03CR) 10Addshore: [C: 04-2] "Not till wednesday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434453 (https://phabricator.wikimedia.org/T191457) (owner: 10Addshore) [09:34:18] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 79, down: 1, dormant: 0, excluded: 0, unused: 0 [09:46:53] (03PS1) 10Jcrespo: mariadb: Reimage db2043 into stretch [puppet] - 10https://gerrit.wikimedia.org/r/434463 [09:47:13] (03CR) 10Jcrespo: [C: 032] mariadb: Reimage db2043 into stretch [puppet] - 10https://gerrit.wikimedia.org/r/434463 (owner: 10Jcrespo) [09:47:53] !log upgrading mw123[8-9], mw1266-mw1275 to HHVM 3,18.5+dfsg-1+wmf8+deb9u1 [09:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:29] (03CR) 10Addshore: "Looks like this went perfectly" [puppet] - 10https://gerrit.wikimedia.org/r/395968 (owner: 10Addshore) [09:48:55] the router interface alert is known mainentance [09:49:56] * addshore has a question about icinga, after reading https://wikitech.wikimedia.org/wiki/Icinga#Authentication I still can't tell, but who has the technical ability to add downtime etc and ack alerts? just ops or? [09:51:59] I don't think any of that applies nowadays [09:52:24] ah yeah [09:52:30] that's very out of date indeed [09:52:33] it's all ldap based [09:52:34] the permissions to acking alerts/downtime etc. are maintained in modules/icinga/files/cgi.cfg [09:53:00] addshore: "This page may be outdated or contain incorrect details. Please update it if you can. " [09:53:27] thanks all! *goes to look at that file* [09:54:39] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 91.198.174.245, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 [09:56:11] !log stop and reimage db2043 [09:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:35] So, I see authorized_for_all_service_command, is there nothing to authorize people for specific services? [09:58:06] oh wait "By default, users can only issue commands for hosts or services"..... [09:58:23] not so much, no [09:58:30] I don't believe I am named as a direct contact for wikidata dispatching, but instead via a mailing list? [09:59:53] https://github.com/wikimedia/puppet/blob/aa116a756e86192ab93b14860320e4c03e2081b7/modules/icinga/manifests/monitor/wikidata.pp#L15 *goes to find the wikidata contact group* [10:00:08] PROBLEM - MariaDB Slave IO: s3 on db2036 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2043.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2043.codfw.wmnet (111 Connection refused) [10:00:33] that is me, for some reason I may have skipped that downtime [10:00:58] (see log) [10:01:59] (03CR) 10Volans: [C: 04-1] "Nice! I have few comments and some questions inline." (036 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) (owner: 10Vgutierrez) [10:02:07] aah, looks like contact definitions are in the private repo? https://github.com/wikimedia/puppet/blob/e059236cdd1703046e884c278b85cb0241482589/modules/nagios_common/templates/contacts.cfg.erb [10:02:34] Would anyone be able to check if I am on the "wikidata" contact list? If not perhaps I should file a ticket [10:04:09] addshore: I can see your contact in the contact list but I don't see it added to any existing group [10:05:01] volans: I see! [10:07:13] but i guess the "wikidata-monitoring" contact is in the group, and I get those emails :) [10:07:43] yes, wikidata-monitoring is in the wikidata contactgroup [10:08:03] whats the process for getting my actual user added to that group? a phab ticket? [10:08:36] we should probably add some more wmde people that deal with dispatching and deploying to it to allow them to add downtime and make it shhh [10:08:39] (03PS1) 10Ppchelko: Switch cross-wiki posting jobs for everything. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434466 (https://phabricator.wikimedia.org/T190327) [10:08:47] if that will actually happen by adding the explicit users to the group [10:09:39] I can't even find that (the wikidata contactgroup members) [10:09:53] yes, a ticket is the way to go [10:09:56] addshore: yeah a task should be ok, add operations and monitoring tags please [10:10:06] will do, thanks! [10:11:15] yw [10:12:09] ah I see. hoo is the only person in that list, yeah [10:13:31] (03PS2) 10Ppchelko: Switch cross-wiki posting jobs for everything. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434466 (https://phabricator.wikimedia.org/T190327) [10:16:04] 10Operations, 10Wikidata, 10monitoring: Add Addshore & possibly other WMDE devs/deployers to the wikidata icinga contact list - https://phabricator.wikimedia.org/T195289#4222037 (10Addshore) [10:16:22] 10Operations, 10Wikidata, 10monitoring, 10User-Addshore: Add Addshore & possibly other WMDE devs/deployers to the wikidata icinga contact list - https://phabricator.wikimedia.org/T195289#4222047 (10Addshore) [10:16:54] 10Operations, 10Wikidata, 10monitoring, 10User-Addshore: Add Addshore & possibly other WMDE devs/deployers to the wikidata icinga contact list - https://phabricator.wikimedia.org/T195289#4222037 (10Addshore) [10:18:34] (03CR) 10Vgutierrez: Implement kubernetes configuration observer (032 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) (owner: 10Vgutierrez) [10:19:27] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#4222050 (10Pchelolo) [10:19:58] PROBLEM - puppet last run on mw1271 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hhvm-dbg] [10:23:23] !log upgrading snapshot hosts to HHVM 3,18.5+dfsg-1+wmf8+deb9u1 [10:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:09] !log upgrading video scalers in codfw to HHVM 3,18.5+dfsg-1+wmf8+deb9u1 [10:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:45] jouncebot: next [10:24:45] In 0 hour(s) and 35 minute(s): WikibaseLexeme wikidata.org deployment Preparation & Backports (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180522T1100) [10:25:02] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM, thanks <3" [puppet] - 10https://gerrit.wikimedia.org/r/433333 (https://phabricator.wikimedia.org/T194724) (owner: 10Ema) [10:26:58] (03CR) 10Mobrovac: [C: 032] Switch cross-wiki posting jobs for everything. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434466 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [10:27:21] * mobrovac taking over tin for 10 mins [10:28:15] (03Merged) 10jenkins-bot: Switch cross-wiki posting jobs for everything. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434466 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [10:30:31] (03CR) 10jenkins-bot: Switch cross-wiki posting jobs for everything. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434466 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [10:30:45] !log ppchelko@tin Started deploy [cpjobqueue/deploy@b45cd3b]: Switch cross-wiki posting jobs for everything T175210 [10:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:49] T175210: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210 [10:31:48] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@b45cd3b]: Switch cross-wiki posting jobs for everything T175210 (duration: 01m 03s) [10:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:03] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: Switch cross-wiki posting jobs to EventBus - T190327 (duration: 01m 18s) [10:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:07] T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327 [10:32:36] * mobrovac {{done}} [10:33:21] [= [10:34:14] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1976 bytes in 0.094 second response time [10:39:08] (03PS2) 10Ppchelko: Switch all jobs for everything except wikipedia, commons and wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429980 (https://phabricator.wikimedia.org/T190327) [10:40:59] (03PS1) 10ArielGlenn: keep around dump prefetch files from small wikis too [puppet] - 10https://gerrit.wikimedia.org/r/434468 (https://phabricator.wikimedia.org/T194124) [10:41:43] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1962 bytes in 0.104 second response time [10:42:35] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The patch needs a few tweaks, see the inline comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/434427 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [10:44:24] <_joe_> did someone do something with the jobqueue? [10:44:28] <_joe_> the old one I mean [10:44:47] <_joe_> we were hoarding wrong jobs, now they've almost all dropped away [10:45:00] 10:30 ppchelko@tin: Started deploy [cpjobqueue/deploy@b45cd3b]: Switch cross-wiki posting jobs for everything T175210 [10:45:01] T175210: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210 [10:45:02] <_joe_> I never had time to look into it until today [10:45:06] 10:32 mobrovac@tin: Synchronized wmf-config/jobqueue.php: Switch cross-wiki posting jobs to EventBus - T190327 (duration: 01m 18s) [10:45:07] T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327 [10:45:21] that just happened [10:45:25] <_joe_> addshore: yeah I'm not sure how that relates to what I'm seeing [10:45:33] RECOVERY - puppet last run on mw1271 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:45:54] <_joe_> as the drop happened more than one hour ago, reallly [10:46:11] <_joe_> uhm no, the graph was confusing [10:46:18] <_joe_> it happened exactly with that release [10:46:37] <_joe_> (we still have this hard limit of 55.8k jobs that seem to be stuck in the queue forever) [10:46:53] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1974 bytes in 0.115 second response time [10:47:14] <_joe_> oh such joy you gave me, redis-based jobqueue. So many deep rabbitholes you sent me into! You'll be missed [10:47:55] :D [10:48:15] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Create an LVS endpoint for jobrunners on videoscalers - https://phabricator.wikimedia.org/T188947#4222078 (10Pchelolo) [10:51:10] (03PS4) 10Giuseppe Lavagetto: mcrouter: add support for listening on the ssl port [puppet] - 10https://gerrit.wikimedia.org/r/431736 (https://phabricator.wikimedia.org/T192370) [10:52:29] (03CR) 10Giuseppe Lavagetto: [C: 032] mcrouter: add support for listening on the ssl port [puppet] - 10https://gerrit.wikimedia.org/r/431736 (https://phabricator.wikimedia.org/T192370) (owner: 10Giuseppe Lavagetto) [11:00:04] addshore: How many deployers does it take to do WikibaseLexeme wikidata.org deployment Preparation & Backports deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180522T1100). [11:00:10] o/ [11:00:40] * addshore begins [11:01:33] (03CR) 10Addshore: [C: 032] Add 171081 to wmgWikibaseIdBlacklist for wikibase-lexeme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434451 (https://phabricator.wikimedia.org/T194248) (owner: 10Addshore) [11:02:41] (03Merged) 10jenkins-bot: Add 171081 to wmgWikibaseIdBlacklist for wikibase-lexeme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434451 (https://phabricator.wikimedia.org/T194248) (owner: 10Addshore) [11:02:57] (03CR) 10jenkins-bot: Add 171081 to wmgWikibaseIdBlacklist for wikibase-lexeme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434451 (https://phabricator.wikimedia.org/T194248) (owner: 10Addshore) [11:04:00] !log ppchelko@tin Started restart [cpjobqueue/deploy@b45cd3b]: KafkaConsumer is not connected error [11:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:45] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:434451|Add 171081 to wmgWikibaseIdBlacklist for wikibase-lexeme]] T194248 T187060 (duration: 01m 19s) [11:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:51] T187060: reserved Lexeme IDs - https://phabricator.wikimedia.org/T187060 [11:06:51] T194248: Prepare WikibaseLexeme config for wikidata.org - https://phabricator.wikimedia.org/T194248 [11:10:21] !log addshore@tin Started scap: WikimediaMessages - [[gerrit:434454|wikidata-copyright, include the lexeme namespace]] T169333 [11:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:25] T169333: Update license information on Wikidata to make Lexeme namespace CC-0 - https://phabricator.wikimedia.org/T169333 [11:12:37] (03PS1) 10Alexandros Kosiaris: mathoid: Install ingress networkpolicy policy if enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/434473 [11:12:39] (03PS1) 10Alexandros Kosiaris: mathoid: Disable monitoring by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/434474 [11:12:41] (03PS1) 10Alexandros Kosiaris: Very first draft of a graphoid helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/434475 [11:12:44] (03PS1) 10Alexandros Kosiaris: scaffolding: Disabling monitoring by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/434476 [11:16:11] (03CR) 10Giuseppe Lavagetto: [C: 031] mathoid: Install ingress networkpolicy policy if enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/434473 (owner: 10Alexandros Kosiaris) [11:17:51] (03PS2) 10ArielGlenn: keep around dump prefetch files from small wikis too [puppet] - 10https://gerrit.wikimedia.org/r/434468 (https://phabricator.wikimedia.org/T194124) [11:18:36] (03CR) 10ArielGlenn: [C: 032] keep around dump prefetch files from small wikis too [puppet] - 10https://gerrit.wikimedia.org/r/434468 (https://phabricator.wikimedia.org/T194124) (owner: 10ArielGlenn) [11:18:49] (03PS1) 10Giuseppe Lavagetto: mcrouter: fix default file to work with systemd [puppet] - 10https://gerrit.wikimedia.org/r/434478 [11:20:21] <_joe_> akosiaris: why disable monitoring by default? [11:20:38] <_joe_> it's useful in development as well to have prometheus metrics exposed, I think [11:20:43] _joe_: cause insufficient CPU messages in minikube show up very very easily [11:20:48] <_joe_> but I can convinced otherwise [11:20:51] <_joe_> sigh, ok [11:21:06] <_joe_> it's a bit of a pity, but point taken [11:21:24] running 1 pod for monitoring and having a deployment doing a simple update can cause that :-( [11:22:47] I am also wondering if I should change the update strategy for minikube from RollingUpdate to Recreate [11:22:51] (03PS1) 10ArielGlenn: add addshore and aude to wikidata contact group [puppet] - 10https://gerrit.wikimedia.org/r/434479 (https://phabricator.wikimedia.org/T195289) [11:23:16] <_joe_> that might help in dev [11:23:22] yup [11:23:31] <_joe_> akosiaris: it's impossible to overcommit on cpus? [11:23:42] <_joe_> in minikube, I mean [11:23:56] heh, container limits don't work like that [11:24:09] it's not like you say "I want 1 CPU" and you get 1 cpu [11:24:16] <_joe_> yes [11:24:18] despite what it may appear like [11:24:27] 10Operations, 10Wikidata, 10monitoring, 10Patch-For-Review, 10User-Addshore: Add Addshore & possibly other WMDE devs/deployers to the wikidata icinga contact list - https://phabricator.wikimedia.org/T195289#4222125 (10ArielGlenn) Before I add @aude and @Ladsgroup let's make sure they want to be added, I... [11:24:38] <_joe_> what I mean is that I surely run docker containers "requesting" more cpus that I have [11:24:51] <_joe_> *than [11:25:00] it's kubernetes that complains, not docker [11:25:25] (03PS2) 10ArielGlenn: add addshore, aude, ladsgroup to wikidata contact group [puppet] - 10https://gerrit.wikimedia.org/r/434479 (https://phabricator.wikimedia.org/T195289) [11:26:39] the issue is kubernetes requests "absolute" quantities [11:27:12] so say you say 1 CPU. Whether you have a single-core, a quad-core or a million-core machine [11:27:30] it's the exact same thing (for kubernetes that is) [11:27:51] so, underpowered machines (like minikube running on a single CPU laptop) [11:27:58] very quickly have problems [11:28:28] of course that is a leaky abstraction cause docker right now does relative CPUs since that's what the current cgroups allow [11:28:41] (03CR) 10Giuseppe Lavagetto: [C: 032] mcrouter: fix default file to work with systemd [puppet] - 10https://gerrit.wikimedia.org/r/434478 (owner: 10Giuseppe Lavagetto) [11:28:47] (03PS2) 10Giuseppe Lavagetto: mcrouter: fix default file to work with systemd [puppet] - 10https://gerrit.wikimedia.org/r/434478 [11:28:54] but that's to be changed (don't ask for an ETA) [11:31:11] !log upgrading application servers in deployment-prep to wikidiff 1.7.0 (T190717) [11:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:15] T190717: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717 [11:39:58] * addshore can't tell if his scap has frozen [11:40:16] 11:13:15 Updating LocalisationCache for 1.32.0-wmf.4 using 10 thread(s) [11:40:31] probably shouldn't have taken the 20+ mins that it has so far... [11:41:52] addshore: I can see a bunch of subprocesses going on [11:42:09] MWScript.php rebuildLocalisationCache.php ... [11:44:01] * addshore has never looked at the sub procs of the full scap before, I wonder if --wiki is cawikibooks for a reason, of it it should be cycling through wikis? [11:47:21] addshore: it takes ~40min by the looks [11:47:22] https://phabricator.wikimedia.org/T191921 [11:47:43] (03PS3) 10Arturo Borrero Gonzalez: reprepro: fetch mono suite from upstream apt repo [puppet] - 10https://gerrit.wikimedia.org/r/433996 (https://phabricator.wikimedia.org/T194665) [11:47:57] right, apparently I just havn't done a full scap in a while then! [11:51:31] (03CR) 10Arturo Borrero Gonzalez: [C: 032] reprepro: fetch mono suite from upstream apt repo [puppet] - 10https://gerrit.wikimedia.org/r/433996 (https://phabricator.wikimedia.org/T194665) (owner: 10Arturo Borrero Gonzalez) [11:51:58] this is the scap on hhvm issue [11:52:39] yup, 39 mins up and it moved on to the next step :) [11:55:33] PROBLEM - MariaDB Slave Lag: s3 on db2074 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 581.65 seconds [11:56:49] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4222177 (10MoritzMuehlenhoff) >>! In T190717#4213084, @WMDE-Fisch wrote: > So for deployment on beta nothing special is needed and it can be... [11:58:13] RECOVERY - MariaDB Slave Lag: s3 on db2074 is OK: OK slave_sql_lag Replication lag: 0.50 seconds [12:07:07] !log addshore@tin Finished scap: WikimediaMessages - [[gerrit:434454|wikidata-copyright, include the lexeme namespace]] T169333 (duration: 56m 45s) [12:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:11] T169333: Update license information on Wikidata to make Lexeme namespace CC-0 - https://phabricator.wikimedia.org/T169333 [12:10:22] (03CR) 10Alexandros Kosiaris: [C: 031] "I would argue that even 30 minutes between checks is too often (it's not like smart data is that trustworthy that acting immediately is re" [puppet] - 10https://gerrit.wikimedia.org/r/434456 (https://phabricator.wikimedia.org/T173050) (owner: 10Volans) [12:10:35] (03PS2) 10Ema: varnish: use systemd::service instead of base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/433333 (https://phabricator.wikimedia.org/T194724) [12:10:45] (03CR) 10Alexandros Kosiaris: [C: 031] Icinga: add check and retry intervals for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/434455 (https://phabricator.wikimedia.org/T173050) (owner: 10Volans) [12:11:13] (03CR) 10Ema: [C: 032] varnish: use systemd::service instead of base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/433333 (https://phabricator.wikimedia.org/T194724) (owner: 10Ema) [12:11:14] !log installing imagemagick security updates [12:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:34] (03PS2) 10Addshore: Set wgLexemeLanguageCodePropertyId for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434452 (https://phabricator.wikimedia.org/T194248) [12:11:38] (03CR) 10Addshore: [C: 032] Set wgLexemeLanguageCodePropertyId for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434452 (https://phabricator.wikimedia.org/T194248) (owner: 10Addshore) [12:13:21] (03Merged) 10jenkins-bot: Set wgLexemeLanguageCodePropertyId for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434452 (https://phabricator.wikimedia.org/T194248) (owner: 10Addshore) [12:15:23] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:434452|Set wgLexemeLanguageCodePropertyId for wikidatawiki]] T194248 (duration: 01m 19s) [12:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:28] T194248: Prepare WikibaseLexeme config for wikidata.org - https://phabricator.wikimedia.org/T194248 [12:18:48] !log set `unchecked_tombstone_compaction=true` for maps eqiad - T194966 [12:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:53] T194966: disk usage increase on maps servers - https://phabricator.wikimedia.org/T194966 [12:28:12] (03CR) 10jenkins-bot: Set wgLexemeLanguageCodePropertyId for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434452 (https://phabricator.wikimedia.org/T194248) (owner: 10Addshore) [12:32:59] (03PS1) 10Krinkle: profiler: Document the supported XWD attributes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434487 (https://phabricator.wikimedia.org/T176916) [12:33:01] (03PS1) 10Krinkle: profiler-labs: Add experimental code to sample with xhprof (Beta Cluster) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434488 (https://phabricator.wikimedia.org/T176916) [12:33:53] !log addshore@tin Synchronized php-1.32.0-wmf.4/extensions/Wikibase/repo: [[gerrit:434459|API: when validating change op make sure the edited entity is also validated]] T190928 (duration: 01m 52s) [12:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:57] addshore: I'd like to merge the above two patches for beta cluster. let me know if/when it's safe to merge. [12:33:57] T190928: Forms via wbeditentity: Edit - https://phabricator.wikimedia.org/T190928 [12:34:17] Krinkle: feel free to hit them both now [12:34:20] (03CR) 10Krinkle: [C: 032] profiler: Document the supported XWD attributes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434487 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [12:34:25] (03CR) 10Krinkle: [C: 032] profiler-labs: Add experimental code to sample with xhprof (Beta Cluster) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434488 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [12:34:28] Thanks! [12:34:34] * addshore is only doing backports within the WikibaseLexeme extension now :) [12:35:10] * addshore goes to +2 his chain [12:36:02] (03Merged) 10jenkins-bot: profiler: Document the supported XWD attributes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434487 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [12:36:04] (03Merged) 10jenkins-bot: profiler-labs: Add experimental code to sample with xhprof (Beta Cluster) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434488 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [12:37:26] Urbanecm: you stole the whole of mid day EU swat today :D [12:38:23] (03CR) 10jenkins-bot: profiler: Document the supported XWD attributes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434487 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [12:38:26] (03CR) 10jenkins-bot: profiler-labs: Add experimental code to sample with xhprof (Beta Cluster) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434488 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [12:39:03] Krinkle: want me to sync wmf-config in prod for you for those? (if it is beta only) [12:39:14] addshore: Yeah, that's fine. thanks [12:39:34] One is a comment in a prod file, the other is beta-only. [12:39:45] * addshore double checks them [12:40:11] looks ood [12:40:12] good [12:40:27] (03PS5) 10Vgutierrez: Implement kubernetes configuration observer [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) [12:41:03] http://tardis.wikia.com/wiki/Ood [12:41:10] hehe :D [12:42:20] !log addshore@tin Synchronized wmf-config: [[gerrit:434487|#1]] [[gerrit:434488|#2]] BETA ONLY profiler stuff (duration: 01m 20s) [12:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:50] (03PS6) 10Vgutierrez: Implement kubernetes configuration observer [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) [12:46:23] (03CR) 10Vgutierrez: Implement kubernetes configuration observer (033 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) (owner: 10Vgutierrez) [12:47:14] addshore, yes, I did [12:47:16] Do you have anything urgent? [12:47:18] (or, do you wish to add something?) [12:47:29] nope, I'll probably just be doing swat :) [12:47:52] I also might slightly overrun the slot I'm currently deploying in :/ [12:48:04] If you can do the swat as well :D [12:48:06] * addshore twiddles thumbs waiting for jenkins [12:48:59] Is there any other reason why we sometimes aren't able to deploy 6 patches in one SWAT than jenkins? [12:49:29] Looking what you're doing, WikibaseLexeme. That's nice! [12:49:55] WikibaseLexeme is currently deployed anywhere though ;) [12:49:57] *isnt [12:50:13] well, thats a lie, it is on testwikidata [12:50:39] Preparing for WikibaseLexeme is nice as well :D [12:51:08] oh Urbanecm you mean with the link to the topic on gerrit? or? [12:51:27] addshore, I'm referring to the window title, "WikibaseLexeme wikidata.org deployment Preparation & Backports" [12:55:39] !log addshore@tin Synchronized php-1.32.0-wmf.4/extensions/WikibaseLexeme: [[gerrit:434445|Use ChangeOps consistently throughout API]] (duration: 01m 30s) [12:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:58] !log installing xdg-utils security updates on trusty [12:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:10] !log addshore@tin Synchronized php-1.32.0-wmf.4/extensions/WikibaseLexeme: [[gerrit:434446|Use the same language validation for representations and lemmas]] (duration: 01m 25s) [12:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:56] (03CR) 10Ema: [C: 031] mtail: Add xcachestatus to varnishrls [puppet] - 10https://gerrit.wikimedia.org/r/432712 (https://phabricator.wikimedia.org/T190978) (owner: 10Krinkle) [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180522T1300). [13:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] so, I still have 4 to go, so around 10 more mins :) [13:00:18] * Urbanecm is here, what a surprise [13:00:19] but I can SWAT :) [13:01:00] o/ [13:01:04] \o [13:01:15] addshore, I'll wait :). I'm just thinking if deploying for long 4 hours is a good idea... [13:01:30] !log addshore@tin Synchronized php-1.32.0-wmf.4/extensions/WikibaseLexeme: [[gerrit:434447|Lemma validation: language covered in deserializer]] (duration: 01m 30s) [13:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:19] addshore, Urbanecm: looks like you have swat under control, I'm around if you need me [13:02:26] zeljkof: thanks! :) [13:03:16] Urbanecm: It's so far been a lovely stress free morning :D [13:04:54] stress free? Or free stress? [13:06:38] !log addshore@tin Synchronized php-1.32.0-wmf.4/extensions/WikibaseLexeme: [[gerrit:434448|Handle invalid lexemeId in data when using wbeditentity new=form]] && [[gerrit:434449|Lexeme term languages: codes beyond MW default]] (duration: 01m 28s) [13:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:45] 1 left [13:06:53] Urbanecm: hehe, stress free [13:07:10] ack, both messages :D [13:07:28] (03PS3) 10Addshore: Revert "Temp rate limit for arwiki due to mass vandalism" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433987 (https://phabricator.wikimedia.org/T192668) (owner: 10Urbanecm) [13:07:33] (03PS2) 10Addshore: Enable $wgUseRCPatrol on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433440 (https://phabricator.wikimedia.org/T194389) (owner: 10Urbanecm) [13:07:39] (03PS2) 10Addshore: Change liwikibooks logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432551 (https://phabricator.wikimedia.org/T193680) (owner: 10Urbanecm) [13:07:43] (03PS2) 10Alexandros Kosiaris: tcpircbot: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434428 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [13:07:45] (03PS2) 10Addshore: Change logo for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432549 (https://phabricator.wikimedia.org/T194340) (owner: 10Urbanecm) [13:07:51] (03PS2) 10Addshore: Upload new logos for yiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430123 (https://phabricator.wikimedia.org/T193562) (owner: 10Urbanecm) [13:08:02] (03PS2) 10Addshore: Use uploaded HD logos for yiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430124 (https://phabricator.wikimedia.org/T193562) (owner: 10Urbanecm) [13:08:15] * addshore chains the swat patches together [13:08:27] Ok [13:10:22] (03CR) 10Alexandros Kosiaris: [C: 032] tcpircbot: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434428 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [13:10:48] ema: When could the mtail change be deployed? [13:13:01] <_joe_> akosiaris: that patch is wrong [13:13:05] <_joe_> I was about to comment [13:13:30] !log upgrading labweb servers to HHVM 3,18.5+dfsg-1+wmf8+deb9u1 [13:13:31] <_joe_> I guess I'll fix it myself :) [13:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:05] addshore, how's the SWAT going? [13:14:07] Krinkle: in ~ 1 hour if you're around? [13:14:15] Sounds good, thanks! [13:14:24] Urbanecm: just waiting for the last bit of CI for the last WikibaseLexeme patch [13:14:27] should be ~3 mins :) [13:14:33] The long-running one? :) [13:14:47] (already ~40 mins) [13:15:14] hehe, it does say "39 mins" but thats since I hit +2, there were 6 changes to merge in that time, its a shame the WikibaseLexeme stuff isn't quite using quibble yet! [13:15:25] so it had to wait for the cloudvps pool of machines for the CI [13:15:47] quibble is...what? [13:16:16] (03CR) 10Addshore: [C: 032] Revert "Temp rate limit for arwiki due to mass vandalism" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433987 (https://phabricator.wikimedia.org/T192668) (owner: 10Urbanecm) [13:17:10] (03PS1) 10Ottomata: Use same partman for stat1004 as other stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/434490 (https://phabricator.wikimedia.org/T192640) [13:17:35] (03PS1) 10Giuseppe Lavagetto: tcpircbot: restart upon config changes [puppet] - 10https://gerrit.wikimedia.org/r/434491 [13:17:50] <_joe_> akosiaris: ^^ [13:17:55] (03Merged) 10jenkins-bot: Revert "Temp rate limit for arwiki due to mass vandalism" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433987 (https://phabricator.wikimedia.org/T192668) (owner: 10Urbanecm) [13:18:05] Urbanecm: a great thing, but more importantly, it doesn't run on the disposable VMs [13:18:13] (03CR) 10Alexandros Kosiaris: [C: 032] tcpircbot: restart upon config changes [puppet] - 10https://gerrit.wikimedia.org/r/434491 (owner: 10Giuseppe Lavagetto) [13:18:30] !log addshore@tin Synchronized php-1.32.0-wmf.4/extensions/WikibaseLexeme: [[gerrit:434450|Add L171081 to clearBlacklistedLexemes]] (duration: 01m 28s) [13:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:44] Ehm...the CI thing does create machine, install everything and then it just throws again? [13:18:50] *away [13:19:21] no, containers are ephemeral by nature [13:19:40] as in they get thrown away almost on their own :-) [13:19:54] but yes the 10kft description is correct [13:20:09] Urbanecm: syncing your first one now [13:20:11] Quite expensive testing... [13:20:12] addshore, ack [13:20:20] (03CR) 10jenkins-bot: Revert "Temp rate limit for arwiki due to mass vandalism" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433987 (https://phabricator.wikimedia.org/T192668) (owner: 10Urbanecm) [13:20:27] _joe_: thanks. merged [13:21:01] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:433987|Revert: Temp rate limit for arwiki due to mass vandalism]] T192668 (duration: 01m 18s) [13:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:18] T192668: Mass vandalism in ar.wikipedia (throttle edits using wgRateLimits) - https://phabricator.wikimedia.org/T192668 [13:21:24] (03CR) 10Addshore: [C: 032] Enable $wgUseRCPatrol on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433440 (https://phabricator.wikimedia.org/T194389) (owner: 10Urbanecm) [13:21:39] Urbanecm: will you be able to verify this second one on mwdebug1002? [13:21:49] addshore, no, I don't have patroller [13:21:52] ack [13:23:11] (03Merged) 10jenkins-bot: Enable $wgUseRCPatrol on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433440 (https://phabricator.wikimedia.org/T194389) (owner: 10Urbanecm) [13:24:18] Urbanecm: I checked on mwdebug1002 and nothing looked on fire, so syncing [13:25:05] On the other hand, SWAT stands for "Setting Wikis Ablaze Team" and the SWAT team is responsible for breaking the site on a regular basis ;) [13:25:10] https://wikitech.wikimedia.org/wiki/SWAT_deploys#Humour [13:25:18] So it doesn't matter :D [13:25:26] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:433440|Enable $wgUseRCPatrol on azwiki]] T194389 (duration: 01m 20s) [13:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:37] T194389: Enable $wgUseRCPatrol on azwiki - https://phabricator.wikimedia.org/T194389 [13:26:00] !log upgrading API servers in codfw to HHVM 3,18.5+dfsg-1+wmf8+deb9u1 [13:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:19] (03CR) 10Addshore: [C: 032] Change liwikibooks logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432551 (https://phabricator.wikimedia.org/T193680) (owner: 10Urbanecm) [13:26:39] (03CR) 10Addshore: [C: 032] Change logo for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432549 (https://phabricator.wikimedia.org/T194340) (owner: 10Urbanecm) [13:27:11] (03CR) 10jenkins-bot: Enable $wgUseRCPatrol on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433440 (https://phabricator.wikimedia.org/T194389) (owner: 10Urbanecm) [13:27:40] Urbanecm: lets do these 2 lovely logo changes at the same time :) and we can check them on mwdebug :) [13:28:01] addshore, I uploaded 3 logo changes in 4 commits [13:28:03] (03Merged) 10jenkins-bot: Change liwikibooks logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432551 (https://phabricator.wikimedia.org/T193680) (owner: 10Urbanecm) [13:28:07] (03PS1) 10Elukey: role::analytics_cluster::coordinator: remove refinery-relaunch-banner-streaming [puppet] - 10https://gerrit.wikimedia.org/r/434493 (https://phabricator.wikimedia.org/T193712) [13:28:15] yup, I'll do the final 2 also together :) [13:28:21] (03Merged) 10jenkins-bot: Change logo for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432549 (https://phabricator.wikimedia.org/T194340) (owner: 10Urbanecm) [13:28:29] Ok [13:28:29] First, wikimania2018 and liwikibooks [13:28:50] both on mwdebug1002 now [13:28:57] (03PS2) 10Elukey: role::analytics_cluster::coordinator: remove refinery-relaunch-banner-streaming [puppet] - 10https://gerrit.wikimedia.org/r/434493 (https://phabricator.wikimedia.org/T193712) [13:29:03] ack [13:30:16] https://ctrlv.cz/fq2a isn't what I expected [13:30:46] Urbanecm: hah [13:30:58] Urbanecm: revert that one? [13:31:07] addshore, I can fix it. [13:31:11] So I'll upload a follow-up [13:31:15] okay! [13:31:21] Urbanecm: how does liwikibooks look?> [13:32:15] (03CR) 10Elukey: [C: 032] role::analytics_cluster::coordinator: remove refinery-relaunch-banner-streaming [puppet] - 10https://gerrit.wikimedia.org/r/434493 (https://phabricator.wikimedia.org/T193712) (owner: 10Elukey) [13:32:23] PROBLEM - puppet last run on mw2223 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm] [13:32:25] I thought they has a line in wgLogo when they have custom file... [13:32:27] But they do not [13:32:31] Follow-up needed as well [13:32:33] Sorry! [13:32:39] Urbanecm: no problem :) [13:32:59] We have made good time :) [13:33:37] (03PS1) 10Urbanecm: Use uploaded logo in wgLogo for liwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434494 (https://phabricator.wikimedia.org/T193680) [13:33:46] ^^follow up #1 uploaded ^^ [13:34:00] hmm https://meta.wikimedia.org/wiki/Privacy_policy is not loading for me [13:34:17] (03CR) 10jenkins-bot: Change liwikibooks logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432551 (https://phabricator.wikimedia.org/T193680) (owner: 10Urbanecm) [13:34:21] (03CR) 10jenkins-bot: Change logo for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432549 (https://phabricator.wikimedia.org/T194340) (owner: 10Urbanecm) [13:34:45] (03CR) 10Addshore: [C: 032] Use uploaded logo in wgLogo for liwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434494 (https://phabricator.wikimedia.org/T193680) (owner: 10Urbanecm) [13:35:24] https://meta.wikimedia.org/wiki/ loads (slowly) but https://meta.wikimedia.org/wiki/Privacy_policy just stays stuck at loading. [13:35:48] yeah, en.wp is crawling along for me. [13:35:54] (03PS1) 10Urbanecm: Use correct logo-size for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434495 (https://phabricator.wikimedia.org/T194340) [13:35:56] *looks around* [13:36:07] ah en.wp loaded slowly to for me. [13:36:10] I can see a bunch of db errors [13:36:12] paladox, confirming [13:36:15] (03Merged) 10jenkins-bot: Use uploaded logo in wgLogo for liwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434494 (https://phabricator.wikimedia.org/T193680) (owner: 10Urbanecm) [13:36:27] spike in lag or issue with replication [13:36:37] paladox, can you search on phabricator and if there's nothing, create an UBN task? [13:36:44] ok [13:36:45] I think that's better than splitting addshore's attention [13:36:57] and a proper error now. "Request from 143.117.78.137 via cp1052 cp1052, Varnish XID 396001400 [13:36:57] Error: 503, Backend fetch failed at Tue, 22 May 2018 13:36:31 GMT" [13:36:58] (03CR) 10Ottomata: [C: 032] Use same partman for stat1004 as other stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/434490 (https://phabricator.wikimedia.org/T192640) (owner: 10Ottomata) [13:37:01] (03PS2) 10Ottomata: Use same partman for stat1004 as other stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/434490 (https://phabricator.wikimedia.org/T192640) [13:37:03] (03CR) 10Ottomata: [V: 032 C: 032] Use same partman for stat1004 as other stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/434490 (https://phabricator.wikimedia.org/T192640) (owner: 10Ottomata) [13:37:20] same here (Error: 503, Backend fetch failed at Tue, 22 May 2018 13:37:03 GMT) [13:37:36] seeing things in exception log too *looks* [13:37:46] (03PS4) 10Giuseppe Lavagetto: profile::mediawiki::mcrouter_wancache: add ssl, proxy support [puppet] - 10https://gerrit.wikimedia.org/r/431737 (https://phabricator.wikimedia.org/T192370) [13:38:03] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [13:38:04] 10Operations: Wikipedia and meta wiki are loading very slowly - https://phabricator.wikimedia.org/T195293#4222298 (10Paladox) [13:38:12] Urbanecm NotASpy ^^ [13:38:14] 10Operations: Wikipedia and meta wiki are loading very slowly - https://phabricator.wikimedia.org/T195293#4222308 (10Paladox) p:05Triage>03Unbreak! [13:38:29] jynus: ^^ [13:38:39] !log upload druid 0.11 debs to jessie|stretch wikimedia [13:38:53] PROBLEM - cxserver endpoints health on scb2002 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [13:39:03] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:39:13] PROBLEM - cxserver endpoints health on scb1004 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [13:39:13] PROBLEM - cxserver endpoints health on scb2004 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [13:39:13] PROBLEM - cxserver endpoints health on scb2003 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [13:39:14] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [13:39:23] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy [13:39:23] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:39:32] 10Operations, 10DBA: Wikipedia and meta wiki are loading very slowly - https://phabricator.wikimedia.org/T195293#4222312 (10Urbanecm) Confirming from my country (Czech Republic). @addshore told there are some DB errors. [13:39:33] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [13:39:34] PROBLEM - cxserver endpoints health on scb2005 is CRITICAL: /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received [13:39:42] something tells me it is not the database [13:39:46] I assume you know that Commons ded? [13:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:51] 10Operations, 10DBA: Wikipedia and meta wiki are loading very slowly - https://phabricator.wikimedia.org/T195293#4222298 (10Xaosflux) 503 errors loading various projects: Request from 12.15.146.254 via cp1055 cp1055, Varnish XID 593143055 Error: 503, Backend fetch failed at Tue, 22 May 2018 13:39:08 GMT [13:39:53] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [13:40:13] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:40:13] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:40:14] 10Operations, 10DBA: Wikipedia and meta wiki are loading very slowly - https://phabricator.wikimedia.org/T195293#4222316 (10Paladox) Im from the United Kingdom. [13:40:24] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:40:27] started at 13:31 [13:40:32] 10Operations, 10DBA: Wikipedia and meta wiki are loading very slowly - https://phabricator.wikimedia.org/T195293#4222298 (10Marostegui) You've got some examples of the DB errors? [13:40:37] (03CR) 10jenkins-bot: Use uploaded logo in wgLogo for liwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434494 (https://phabricator.wikimedia.org/T193680) (owner: 10Urbanecm) [13:40:38] PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2685 bytes in 0.007 second response time [13:40:38] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:40:39] were you deploying something? [13:40:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [13:40:54] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [13:40:55] 10Operations, 10DBA: Wikipedia and meta wiki are loading very slowly - https://phabricator.wikimedia.org/T195293#4222319 (10Urbanecm) >>! In T195293#4222314, @Xaosflux wrote: > 503 errors loading various projects: > > Request from <> via cp1055 cp1055, Varnish XID 593143055 > Error: 503, Backend fetc... [13:41:01] I have deployed a few things prior, but nothing that should remotely cause this [13:41:01] what's up? [13:41:03] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [13:41:04] PROBLEM - HTTP availability for Varnish at eqiad on einsteinium is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:41:09] 10Operations, 10DBA: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293#4222320 (10Xaosflux) [13:41:13] hmm [13:41:13] PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:41:14] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:41:14] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [13:41:16] The last things were [13:41:20] 13:25 addshore@tin: Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable $wgUseRCPatrol on azwiki T194389 (duration: 01m 20s) [13:41:20] T194389: Enable $wgUseRCPatrol on azwiki - https://phabricator.wikimedia.org/T194389 [13:41:23] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:41:24] 13:21 addshore@tin: Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert: Temp rate limit for arwiki due to mass vandalism T192668 (duration: 01m 18s) [13:41:25] T192668: Mass vandalism in ar.wikipedia (throttle edits using wgRateLimits) - https://phabricator.wikimedia.org/T192668 [13:41:43] PROBLEM - cxserver endpoints health on scb2006 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received [13:41:46] high 5xx [13:41:58] RECOVERY - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 17400 bytes in 0.068 second response time [13:41:58] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [13:42:12] it started at 13:36 [13:42:23] PROBLEM - cxserver endpoints health on scb1003 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [13:42:23] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [13:42:25] I'd say revert if there was a deployment at that time [13:42:30] At 13:31 I see connection errors [13:42:43] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [13:42:48] 13:31 looks like the start to me [13:42:53] RECOVERY - cxserver endpoints health on scb2002 is OK: All endpoints are healthy [13:42:54] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [13:43:07] thats between "moritzm: upgrading API servers in codfw to HHVM 3,18.5+dfsg-1+wmf8+deb9u1" and "elukey: upload druid 0.11 debs to jessie|stretch wikimedia" nothing mw related [13:43:18] I don't think it is databases, my bet is on network layer [13:43:23] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [13:43:33] RECOVERY - cxserver endpoints health on scb1003 is OK: All endpoints are healthy [13:43:34] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [13:43:39] queries have dropped massively on db1094 for instance [13:44:00] after getting peaks of connections [13:44:06] can we revert whatever was deployed just to be safe? [13:44:18] we have no database alerts [13:44:30] marostegui: I will revert https://gerrit.wikimedia.org/r/#/c/433440/ [13:44:30] Request from 2606:a000:1017:83fa:ac94:bf7d:6314:f087 via cp1066 cp1066, Varnish XID 159285987 [13:44:30] Error: 503, Backend fetch failed at Tue, 22 May 2018 13:43:42 GMT [13:44:33] RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy [13:44:35] (03PS1) 10Addshore: Revert "Enable $wgUseRCPatrol on azwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434497 [13:44:39] (03PS2) 10Addshore: Revert "Enable $wgUseRCPatrol on azwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434497 [13:44:40] I can't submit edits. [13:44:45] (03CR) 10Addshore: [V: 032 C: 032] Revert "Enable $wgUseRCPatrol on azwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434497 (owner: 10Addshore) [13:44:48] Cyberpower678 hi, see https://phabricator.wikimedia.org/T195293 [13:44:53] addshore: thanks [13:44:54] RECOVERY - cxserver endpoints health on scb2005 is OK: All endpoints are healthy [13:45:13] PROBLEM - cxserver endpoints health on scb2001 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [13:45:17] 500 are high, but 5xx are higher [13:45:17] >(Cannot access the database: Cannot access the database: Unknown error (10.64.48.153)) [13:45:24] why is cxserver suffering that much ... [13:45:28] that is db1094 yes [13:45:32] https://tppr.me/8YG7y [13:45:32] 10Operations, 10DBA: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293#4222298 (10stjn) Same in Russian Wikipedia (from Russia): ``` Request from […] via cp1054 cp1054, Varnish XID 382637319 Error: 503, Backend fetch fai... [13:45:32] but it can be a consecuence [13:45:33] RECOVERY - cxserver endpoints health on scb2006 is OK: All endpoints are healthy [13:45:38] syncing [13:45:53] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy [13:45:54] RECOVERY - cxserver endpoints health on scb1004 is OK: All endpoints are healthy [13:46:09] the fatals seem to be at a pretty constant rate [13:46:15] (03CR) 10jenkins-bot: Revert "Enable $wgUseRCPatrol on azwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434497 (owner: 10Addshore) [13:46:25] it is all related to s7 [13:46:31] the 3 databases with most errors are s7 [13:46:33] {"name":"cxserver","hostname":"scb2001","pid":122,"level":50,"levelPath":"error","msg":"MT processing error: Error: Source text too long: en-fa\n at Yandex.translate [13:46:44] that's also being logged into cxserver [13:46:45] Did someone sneeze while operating the servers. [13:46:47] : [13:46:49] :p [13:46:54] <_joe_> marostegui: and ofc that's making connections pile up on servers [13:46:55] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: Revert Enable $wgUseRCPatrol on azwiki (duration: 01m 19s) [13:46:59] <_joe_> which is what I'm seeing [13:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:13] RECOVERY - cxserver endpoints health on scb2003 is OK: All endpoints are healthy [13:47:13] 10Operations, 10DBA, 10Wikimedia-log-errors: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293#4222361 (10zhuyifei1999) [13:47:18] all the connection errors I am seeing are on s7 [13:47:20] <_joe_> as I was saying in another channel [13:47:23] revert of that patch done, but nothing looks to have been changed [13:47:23] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy [13:47:31] <_joe_> all appservers have a ton of piled up connections [13:47:34] 10Operations, 10DBA, 10Wikimedia-log-errors: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293#4222298 (10fdans) @Xaosflux be aware that editing the comment doesn't erase the IP info. It's still visible in "View edit h... [13:47:43] RECOVERY - cxserver endpoints health on scb2001 is OK: All endpoints are healthy [13:47:44] <_joe_> all waiting for threads that are trying to connect to db1094 [13:47:48] <_joe_> as far as I can see [13:47:48] addshore, will we continue with the SWAT or should I postpone? [13:47:53] <_joe_> marostegui: let's depool it? [13:47:54] please wait [13:47:56] <_joe_> Urbanecm: postpone [13:47:58] on the swat [13:48:01] Urbanecm: postpone [13:48:03] Urbanecm: we won't touch anything just yet [13:48:23] db1094 has 7k connections [13:48:24] RECOVERY - cxserver endpoints health on scb2004 is OK: All endpoints are healthy [13:48:25] _joe_: we have the 3 servers that server main traffic on s7 overloaded, depooling them will make things worse probably [13:48:39] Ok, will do. Thanks for deploy and investigation on the problem! [13:48:43] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [13:48:51] s7 has ar wiki? [13:48:53] PROBLEM - cxserver endpoints health on scb2005 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [13:48:55] 10Operations, 10DBA, 10Wikimedia-log-errors: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293#4222298 (10Xaosflux) I'm aware - it's no big deal. [13:49:07] addshore: no [13:49:16] <_joe_> marostegui: db1079.eqiad.wmnet is another one, right? [13:49:16] addshore: actually yes [13:49:18] it does [13:49:23] ill revert https://gerrit.wikimedia.org/r/#/c/433987/ ? [13:49:27] related to arwiki [13:49:27] this is not databases, this is code or network [13:49:28] <_joe_> addshore: please do [13:49:32] (03PS1) 10Addshore: Revert "Revert "Temp rate limit for arwiki due to mass vandalism"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434498 [13:49:36] _joe_: db1094, db1079 and db1086 are the ones getting overlaoded [13:49:37] (03PS2) 10Addshore: Revert "Revert "Temp rate limit for arwiki due to mass vandalism"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434498 [13:49:38] <_joe_> jynus: it's the code, pretty clearly [13:49:38] all in s7 [13:49:40] s7 is only problematic because all sections [13:49:40] (03CR) 10Addshore: [V: 032 C: 032] Revert "Revert "Temp rate limit for arwiki due to mass vandalism"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434498 (owner: 10Addshore) [13:49:43] ping it [13:49:55] <_joe_> addshore: heh [13:49:56] see: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&from=now-3h&to=now [13:50:02] did anything change in MessageCollection::loadReviewInfo ? [13:50:03] RECOVERY - cxserver endpoints health on scb2005 is OK: All endpoints are healthy [13:50:11] I see select on db1094 that take long time [13:50:14] syncing [13:50:23] volans: not that I have touched [13:50:30] <_joe_> jynus: that's a consequence of all threads on all appservers waiting for a response from db servers in s7 [13:50:46] <_joe_> volans: all selects on those 3 servers are taking a ton of time [13:51:08] 800 errors/s on s7 [13:51:21] <_joe_> and a ton of reard rows too [13:51:22] 80 seconds of latency [13:51:29] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: Revert Revert Temp rate limit for arwiki due to mass vandalism (duration: 01m 18s) [13:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:34] right, thats all of swat reverted [13:51:46] <_joe_> it might take some time for everything to recover [13:51:56] <_joe_> one solution would be to aggressively restart appservers [13:52:02] connections are decreasing on db1094 [13:52:31] if the source is gone, I can p kill to speedup the recovery [13:52:32] <_joe_> yes, queues on the appservers are vanishing [13:52:36] 500s goind down [13:52:38] <_joe_> we're out of the woods I think [13:52:39] 400s not yet [13:52:43] (03CR) 10jenkins-bot: Revert "Revert "Temp rate limit for arwiki due to mass vandalism"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434498 (owner: 10Addshore) [13:52:44] RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:52:53] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:52:53] <_joe_> outage over, I guess [13:52:54] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:53:03] so... arwiki ? [13:53:04] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:53:05] <_joe_> addshore: thanks for spotting the change that caused it [13:53:08] db1094 still has 5k connections [13:53:10] <_joe_> akosiaris: yes [13:53:20] addshore: yes, nice catch [13:53:23] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:53:26] response time to manual wp page views seems back to normal, yes [13:53:28] <_joe_> marostegui: kill them [13:53:30] which revert was it? [13:53:33] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:53:42] jynus: https://gerrit.wikimedia.org/r/#/c/433987/ [13:53:43] jynus: arwiki rate limit was removed, the revert added it back again [13:53:46] <_joe_> jynus: Revert "Revert "Temp rate limit for arwiki due to mass vandalism"" [13:53:54] RECOVERY - HTTP availability for Varnish at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:53:54] I will give it a look [13:54:15] kowiki (s7) seems fine now, I had db error until arwiki was reverted :P [13:54:17] <_joe_> I think killing connections to db1094 et al will take us back to total health [13:54:23] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:54:25] _joe_: yeah, going to kill connections there [13:54:33] <_joe_> the appservers still have a ludicrous load right now [13:54:39] interesting, *goes to look at arwiki* [13:54:40] <_joe_> in terms of occupied threads [13:54:42] 4xx are still high [13:54:53] <_joe_> volans: those are due to some abuser I guess? [13:55:01] _joe_: yes [13:55:06] ah right, saw now the longer term trend [13:55:22] https://phabricator.wikimedia.org/T192668 [13:55:38] (security but I think almost all of you can see it?) [13:55:53] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:56:13] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:56:25] <_joe_> https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?panelId=17&fullscreen&orgId=1&from=now-1h&to=now and https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?panelId=19&fullscreen&orgId=1&from=now-1h&to=now show you what happened from an application server prespective [13:56:58] <_joe_> since some requests were super slow due to database overload, the appservers were not able to process more requests in parallel [13:57:07] <_joe_> and started queueing the incoming requests [13:57:21] <_joe_> all processing slots got filled up [13:57:48] so the arwiki rate limit was removed since a global limit was supposed to exist. so the global limit didn't work? [13:57:49] * addshore still has some changes on tin to clean up in one way or another [13:58:16] I think some people asked to remove global limits [13:58:18] The issue with arwiki vandalism before though wasn't causing 500s [13:58:27] or was it.... [13:58:34] jynus: https://phabricator.wikimedia.org/T194864 just for Commons [13:58:35] not sure how it endded up [13:58:39] <_joe_> addshore: this was a database overload caused by the vandalism [13:58:42] jynus, if you mean the global limit 90 edits/minute, that wasn't changed [13:58:46] ok [13:58:46] <_joe_> that caused the 500s [13:59:01] <_joe_> Urbanecm: 90 edits/minute per editor, I guess? [13:59:06] _joe_, sure [13:59:10] 10Operations, 10DBA, 10Wikimedia-log-errors: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293#4222372 (10Paladox) p:05Unbreak!>03High [13:59:15] I am not so sure about the real issue [13:59:19] _joe_: I don't see a large ammount of vandalism, so no writes? but it still caused 500s? [13:59:30] unless there is an issue with the other rate limiting [13:59:51] and when we removed the arwiki specific one, and the global one got used, things exploded? [14:00:05] addshore: My dear minions, it's time we take the moon! Just kidding. Time for Wikibase dispatch lock manager change deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180522T1400). [14:00:13] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [14:00:33] Connections are still being created [14:00:42] it is not yet back to normal values [14:00:46] (on db1094) [14:00:47] <_joe_> addshore: I don't know, but the correspondence between the patch deployment and the actual problem is quite clear [14:00:59] The log spam lexel for mw still seems to be about the same level (before the revert) [14:01:12] _joe_, you mean, the arwiki vandalism rate limit removal? [14:01:37] <_joe_> Urbanecm: yes [14:02:04] <_joe_> when addshore deployed the revert, issues started resolving right away [14:02:12] But why... [14:02:23] <_joe_> I'm not sure, we need to figure this out [14:02:23] Why was enwiki affected by arwiki patch? [14:02:23] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [14:02:25] that's what to figure out [14:02:33] <_joe_> Urbanecm: that's pretty easy to explain [14:02:47] <_joe_> we allow ~ 200 requests per appserver to run in parallel [14:03:00] enwiki is on s1, arwiki on s7 [14:03:04] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [14:03:06] <_joe_> let me finish :) [14:03:14] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [14:03:16] Sure, I just was finishing my own wondering :D [14:03:23] <_joe_> if we have each request to arwiki take 1 minute to be processed [14:03:30] <_joe_> and we get 1000 per second [14:03:54] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [14:03:54] RECOVERY - puppet last run on mw2223 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:03:56] <_joe_> in the turn of one minute, we will have all available slots dedicated to serving very slow requests to arwiki [14:04:05] <_joe_> those slots are shared by all wikis [14:04:14] <_joe_> the effect is that requests pile up in a queue [14:04:15] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=37&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1086&var-port=9104&from=1526995449165&to=1526997638474 [14:04:25] maybe it was the Revert Enable $wgUseRCPatrol on azwiki ? [14:04:29] <_joe_> https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?panelId=19&fullscreen&orgId=1&from=now-1h&to=now [14:04:46] don't know, both matches the times [14:05:05] <_joe_> marostegui: possible, but then why I saw all the connections hanging on s7? [14:05:12] <_joe_> is azwiki on s7 as well? [14:05:16] no [14:05:17] on s3 [14:05:19] <_joe_> if that's the case, that's possible [14:05:25] <_joe_> else, it seems a bit strange [14:05:34] both reverts really matches the timing as they were done within 1 minute or so [14:05:39] arwiki's more probable. [14:05:48] And RCPatrol shouldn't actually change anything [14:05:59] do we have any idea of the requests that caused the pileup, which selects they were? [14:06:14] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [14:06:23] marostegui: indeed, that graph seems to show things dropping off after the first revert [14:06:28] Just change how mediawiki behaves when saving an edit [14:06:33] (03PS1) 10Ottomata: Redirect pivot -> turnilo [puppet] - 10https://gerrit.wikimedia.org/r/434499 (https://phabricator.wikimedia.org/T194427) [14:07:00] (03CR) 10jerkins-bot: [V: 04-1] Redirect pivot -> turnilo [puppet] - 10https://gerrit.wikimedia.org/r/434499 (https://phabricator.wikimedia.org/T194427) (owner: 10Ottomata) [14:07:15] _joe_: as does your hhvm one [14:07:41] <_joe_> from what I see I'm not convinced the reverts changed something [14:07:43] marostegui: there was 5 mins between the logged time of the reverts [14:07:50] !log upgrading druid on druid100[123] to 0.11 [14:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:14] 10Operations, 10DBA, 10Wikimedia-Incident, 10Wikimedia-log-errors: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293#4222383 (10Marostegui) Things have recovered - the DBs errors, so far, look like a consequence and... [14:09:18] <_joe_> ah wait [14:09:25] <_joe_> the graph I showed you was for the api [14:09:30] <_joe_> the appservers one is different [14:09:33] <_joe_> https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?panelId=19&fullscreen&orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=All [14:10:38] mediawiki exceptions seem to have finally gone in the last 5 mins [14:10:51] back to normal levels [14:11:17] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/11267/thorium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/434499 (https://phabricator.wikimedia.org/T194427) (owner: 10Ottomata) [14:11:22] time for me to cleanup tin [14:11:33] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [14:11:44] Urbanecm: what was the status of the second folloup for the logos so that I can see if we want to continue or revert them? [14:12:01] addshore, https://gerrit.wikimedia.org/r/#/c/434495/ is to be merged [14:12:08] I'm just uploading another static file [14:12:17] IMHO it's safe to deploy it. [14:12:49] I havn't seen a logo change cause terrible problems before, but I'll wait for the room consensus :P [14:12:53] (03CR) 10Elukey: [C: 032] role::druid::analytics::worker: upgrade to Druid 0.11.0 [puppet] - 10https://gerrit.wikimedia.org/r/432582 (https://phabricator.wikimedia.org/T193712) (owner: 10Elukey) [14:12:57] (03PS4) 10Elukey: role::druid::analytics::worker: upgrade to Druid 0.11.0 [puppet] - 10https://gerrit.wikimedia.org/r/432582 (https://phabricator.wikimedia.org/T193712) [14:13:18] (03PS2) 10Ottomata: Redirect pivot -> turnilo [puppet] - 10https://gerrit.wikimedia.org/r/434499 (https://phabricator.wikimedia.org/T194427) [14:14:34] addshore, I haven't seen a rate limit change from 60 to 90 per minute per user causing terrible problems... [14:15:40] Urbanecm: I'm speculating but I think somehow the rate that mediawiki fell back to might operate in a different way, but that's to be investigated later :) [14:17:11] (03PS2) 10Addshore: Use correct logo-size for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434495 (https://phabricator.wikimedia.org/T194340) (owner: 10Urbanecm) [14:17:14] PROBLEM - DPKG on mw2262 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:17:22] Urbanecm: so that is all that is needed for the wikimania2018 logo? [14:17:35] addshore, I'm pretty sure that yes [14:17:44] PROBLEM - HHVM rendering on mw2262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [14:17:52] (03CR) 10Addshore: [C: 032] Use correct logo-size for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434495 (https://phabricator.wikimedia.org/T194340) (owner: 10Urbanecm) [14:18:03] PROBLEM - Nginx local proxy to apache on mw2262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.162 second response time [14:18:13] PROBLEM - Apache HTTP on mw2262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [14:19:04] PROBLEM - Check systemd state on mw2262 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:19:13] PROBLEM - HHVM processes on mw2262 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [14:19:38] (03Merged) 10jenkins-bot: Use correct logo-size for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434495 (https://phabricator.wikimedia.org/T194340) (owner: 10Urbanecm) [14:19:43] PROBLEM - puppet last run on mw2261 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm] [14:20:19] (03CR) 10jenkins-bot: Use correct logo-size for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434495 (https://phabricator.wikimedia.org/T194340) (owner: 10Urbanecm) [14:20:33] Urbanecm: right, the logo changes are on mwdebug1002 again [14:20:37] please check [14:20:48] addshore, working, please deploy to the whole universe [14:21:01] both wikimania and the wikibooks one? :) [14:21:09] Just wikimania [14:21:21] I just checked wikibooks as well [14:21:32] So please deploy both patches [14:21:40] ack [14:24:13] RECOVERY - HHVM rendering on mw2262 is OK: HTTP OK: HTTP/1.1 200 OK - 73126 bytes in 1.126 second response time [14:24:13] RECOVERY - Check systemd state on mw2262 is OK: OK - running: The system is fully operational [14:24:23] RECOVERY - HHVM processes on mw2262 is OK: PROCS OK: 6 processes with command name hhvm [14:24:24] RECOVERY - Nginx local proxy to apache on mw2262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.208 second response time [14:24:31] !log addshore@tin Synchronized static/images/project-logos/wikimania2018wiki.png: SWAT: [[gerrit:432549|Change logo for wikimania2018wiki]] T194340 (duration: 01m 19s) [14:24:33] RECOVERY - Apache HTTP on mw2262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.169 second response time [14:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:36] T194340: Change Wikimania2018 site logo - https://phabricator.wikimedia.org/T194340 [14:24:37] !log that last also included https://gerrit.wikimedia.org/r/#/c/434495/ - Use correct logo-size for wikimania2018wiki [14:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:54] RECOVERY - DPKG on mw2262 is OK: All packages OK [14:26:47] !log addshore@tin Synchronized static/images/project-logos/liwikibooks.png: SWAT: T193680 [[gerrit:432551|Change liwikibooks logo]] (duration: 01m 18s) [14:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:51] T193680: Logo change request li.wikibooks - https://phabricator.wikimedia.org/T193680 [14:26:51] 10Operations, 10Traffic: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4222424 (10Vgutierrez) @MaxBioHazard I'm still seeing (at least) one bot related to you using AES128-SHA, one using the account [[ https://ru.wikipedia.org/wiki/%D0%A3%D1%87%D0%B... [14:27:42] Urbanecm: syncing the last thing now [14:27:50] addshore, ack [14:27:59] I wont do the yiwikisource changes now though, please reschedule them! [14:28:48] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: T193680 [[gerrit:434494|Use uploaded logo in wgLogo for liwikibooks]] (duration: 01m 16s) [14:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:41] addshore, already rescheduled [14:29:49] COOL [14:29:52] !log SWAT done [14:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:58] And sorry for eating EU SWAT and 1/4 of your window :D [14:30:23] hehe [14:30:28] * addshore goes to grab a drink before continuing [14:30:46] Seeing as the site somewhat recovered: I asked this in #wikimedia-tech and am repeating it here – if we have to purge 350K pages in Russian Wikipedia ASAP because of a bad edit that screwed up all infoboxes in the project, is it better to do via a server script or a bot? [14:31:36] Basically, all/major parts of readers see this in thousands upon thousands of articles: https://cdn.discordapp.com/attachments/334077033032187906/448480418216411146/unknown.png [14:36:18] (03PS4) 10Addshore: Wikidata dispatch, set defaults for dispatchChanges settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430921 [14:36:34] (03CR) 10Addshore: [C: 032] Wikidata dispatch, set defaults for dispatchChanges settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430921 (owner: 10Addshore) [14:38:18] (03Merged) 10jenkins-bot: Wikidata dispatch, set defaults for dispatchChanges settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430921 (owner: 10Addshore) [14:40:41] (03CR) 10jenkins-bot: Wikidata dispatch, set defaults for dispatchChanges settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430921 (owner: 10Addshore) [14:44:57] 10Operations, 10Discovery, 10Maps: disk usage increase on maps servers - https://phabricator.wikimedia.org/T194966#4222471 (10Gehel) From conversation with @Eevans: > I would consider altering compaction settings to make it more aggressive about performing tombstone compactions. > > See https://docs.datasta... [14:45:21] 10Operations, 10Discovery, 10Maps: Track more detailed disk usage - https://phabricator.wikimedia.org/T194997#4215213 (10RobH) Are these the 'maps[12]XXX' servers specifically? If so, the entire fleet needs an upgrade by how much? Basically wondering if you had already listed off each system (somewhere), t... [14:45:54] RECOVERY - puppet last run on mw2261 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [14:47:18] !log addshore@tin Synchronized wmf-config/Wikibase.php: [[gerrit:430921|Wikidata dispatch, set defaults for dispatchChanges settings]] (duration: 01m 19s) [14:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:35] apergos: If you feel like it I would be ready for https://gerrit.wikimedia.org/r/#/c/430923/ now :) [14:51:01] (03PS1) 10Elukey: role::druid::analytics::worker: correct hdfs extension [puppet] - 10https://gerrit.wikimedia.org/r/434503 (https://phabricator.wikimedia.org/T193712) [14:51:18] I'm looking, though also fussing about the earlier connections pileup [14:51:40] (03CR) 10Elukey: [C: 032] role::druid::analytics::worker: correct hdfs extension [puppet] - 10https://gerrit.wikimedia.org/r/434503 (https://phabricator.wikimedia.org/T193712) (owner: 10Elukey) [14:52:05] apergos: understandable :) [14:52:27] 10Operations, 10Discovery, 10Maps: disk usage increase on maps servers - https://phabricator.wikimedia.org/T194966#4214254 (10RobH) Are these the 'maps[12]XXX' servers specifically? If so, the entire fleet needs an upgrade by how much? Basically wondering if you had already listed off each system (somewher... [14:52:46] 10Operations, 10Discovery, 10Maps, 10hardware-requests: Increase storage on maps* servers - https://phabricator.wikimedia.org/T195285#4222503 (10Gehel) [14:53:55] (03PS9) 10Ema: mtail: Add xcachestatus to varnishrls [puppet] - 10https://gerrit.wikimedia.org/r/432712 (https://phabricator.wikimedia.org/T190978) (owner: 10Krinkle) [14:54:44] 10Operations, 10ops-codfw, 10fundraising-tech-ops: Interface errors on pfw3a-codfw:xe-0/0/17 - https://phabricator.wikimedia.org/T195216#4222516 (10Papaul) a:05Papaul>03ayounsi optic replaced [14:55:11] (03CR) 10Ema: [C: 032] mtail: Add xcachestatus to varnishrls [puppet] - 10https://gerrit.wikimedia.org/r/432712 (https://phabricator.wikimedia.org/T190978) (owner: 10Krinkle) [14:55:23] PROBLEM - Host db2064.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:56:15] 10Operations, 10Discovery, 10Maps, 10hardware-requests: Increase storage on maps* servers - https://phabricator.wikimedia.org/T195285#4222526 (10RobH) Are these the 'maps[12]XXX' servers specifically? If so, the entire fleet needs upgrade? (I don't want to assume; ) This would make adding SSDs easier, j... [14:57:32] addshore: it looks ok, I would whine a bit about dispatchMaxTime in config/Wikibase.default.php being PHP_MAX_INT, that seems like a recipe for disaster one day [14:58:35] (03PS1) 10Pmiazga: Add fonts-noto to support priting non-latin characters [puppet] - 10https://gerrit.wikimedia.org/r/434506 (https://phabricator.wikimedia.org/T186748) [14:58:39] hmm, I could add a note there, or potentially pass it around as a string [14:59:00] (03CR) 10Vgutierrez: Avoid Deferred.cancel() induced CancelledErrors (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/433364 (owner: 10Mark Bergsma) [14:59:13] oh wait, you mean literally PHP_INT_MAX rather than the value [14:59:33] We can change that :) [14:59:56] yes I mean really you want things to run forever? prolly not :-) [15:00:15] (for smallish values of forver) [15:00:53] RECOVERY - Host db2064.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.66 ms [15:00:53] (03CR) 10Vgutierrez: [C: 031] Add minimal test cases for Skeleton and ProxyFetch [debs/pybal] - 10https://gerrit.wikimedia.org/r/430338 (owner: 10Mark Bergsma) [15:02:17] (03CR) 10Eevans: [C: 031] cassandra: cassandra-tools always make sense to be available [puppet] - 10https://gerrit.wikimedia.org/r/434462 (owner: 10Gehel) [15:03:09] regarding the earlier outage, after digging down through the logs I see a fair amount of "Memcached error for key "{memcached-key}" on server "{memcached-server}": CONNECTION FAILURE" and "Using cached lag value for {db_server} due to active transaction", not sure if that is normal or not [15:04:28] (03PS1) 10Arturo Borrero Gonzalez: reprepro: add missing UDebComponents: stanza for mono-project repos [puppet] - 10https://gerrit.wikimedia.org/r/434507 (https://phabricator.wikimedia.org/T194665) [15:04:34] nope, that looks kind of regular, hmpf [15:05:17] (03PS2) 10Arturo Borrero Gonzalez: reprepro: add missing UDebComponents: stanza for mono-project repos [puppet] - 10https://gerrit.wikimedia.org/r/434507 (https://phabricator.wikimedia.org/T194665) [15:05:48] (03PS1) 10Urbanecm: New protection level on the Hungarian Wikipedia - trusted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434508 (https://phabricator.wikimedia.org/T194568) [15:06:02] (03CR) 10Arturo Borrero Gonzalez: [C: 032] reprepro: add missing UDebComponents: stanza for mono-project repos [puppet] - 10https://gerrit.wikimedia.org/r/434507 (https://phabricator.wikimedia.org/T194665) (owner: 10Arturo Borrero Gonzalez) [15:07:01] PROBLEM - Host elastic2020 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:43] PROBLEM - Druid coordinator on druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server coordinator [15:12:13] PROBLEM - Check systemd state on druid1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:12:13] PROBLEM - Druid coordinator on druid1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server coordinator [15:12:23] PROBLEM - Check systemd state on druid1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:13:24] RECOVERY - Check systemd state on druid1002 is OK: OK - running: The system is fully operational [15:13:33] RECOVERY - Druid coordinator on druid1002 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server coordinator [15:13:43] PROBLEM - Host db2064.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:14:20] apergos: I made https://gerrit.wikimedia.org/r/#/c/434510/ in Wikibse to switch the default to 1 hour [15:14:53] RECOVERY - Check systemd state on druid1003 is OK: OK - running: The system is fully operational [15:14:56] thumbs up from me [15:15:23] do you want my +1 on it or nah? [15:15:29] druid is me sorry [15:15:33] RECOVERY - Druid coordinator on druid1003 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server coordinator [15:15:41] apergos: interestingly the default has been the max php int since the start of the script it would appear :D I523a9f79422a6da538038e4634c249412b623fe6 [15:15:44] apergos: a +1 would be great [15:15:52] thanks, elukey, that wswas the next thing I was going to have to hunt down [15:16:03] addshore: I figured it had been that way for along time [15:17:30] forgot to +1 with the comment. oops. [15:19:46] !log ppchelko@tin Started deploy [cpjobqueue/deploy@4312549]: Increase the concurrency for low traffic topics [15:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:27] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@4312549]: Increase the concurrency for low traffic topics (duration: 00m 41s) [15:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:41] 10Operations, 10MediaWiki-extensions-Translate, 10Wikimedia-Incident, 10Wikimedia-log-errors: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293#4222630 (10jcrespo) The issue seems to point to {F18473907} The query... [15:22:10] (03CR) 10Mobrovac: [C: 031] "Cherry-picked in beta, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/434506 (https://phabricator.wikimedia.org/T186748) (owner: 10Pmiazga) [15:22:56] (03CR) 10Volans: "Thanks to have addressed my comments. They look good." (032 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) (owner: 10Vgutierrez) [15:23:18] (03PS1) 10Elukey: profile::prometheus::alerts: fix druid alert [puppet] - 10https://gerrit.wikimedia.org/r/434512 [15:24:18] (03CR) 10Elukey: [C: 032] profile::prometheus::alerts: fix druid alert [puppet] - 10https://gerrit.wikimedia.org/r/434512 (owner: 10Elukey) [15:24:29] jouncebot: now [15:24:30] For the next 0 hour(s) and 35 minute(s): Wikibase dispatch lock manager change (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180522T1400) [15:24:53] RECOVERY - Host db2064.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.04 ms [15:24:58] apergos: I can actually move the puppet change to puppet swat probably [15:25:31] woks for me if it works for you [15:25:53] yup! [15:29:53] (03CR) 10Vgutierrez: [C: 04-1] "see inline comments" (036 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/433702 (owner: 10Mark Bergsma) [15:31:09] 10Operations, 10Discovery, 10Maps: Track more detailed disk usage on maps servers - https://phabricator.wikimedia.org/T194997#4222640 (10Gehel) [15:33:50] 10Operations, 10ops-codfw, 10fundraising-tech-ops: Interface errors on pfw3a-codfw:xe-0/0/17 - https://phabricator.wikimedia.org/T195216#4222651 (10ayounsi) 05Open>03Resolved node0 back to being master. [15:34:15] (03CR) 10Mark Bergsma: Extend unit testing of RunCommand (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/433702 (owner: 10Mark Bergsma) [15:35:44] (03PS1) 10Ottomata: Install stat1004 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/434517 (https://phabricator.wikimedia.org/T192640) [15:36:02] (03PS2) 10Ottomata: Install stat1004 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/434517 (https://phabricator.wikimedia.org/T192640) [15:36:05] (03CR) 10Ottomata: [V: 032 C: 032] Install stat1004 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/434517 (https://phabricator.wikimedia.org/T192640) (owner: 10Ottomata) [15:37:33] RECOVERY - Host elastic2020 is UP: PING WARNING - Packet loss = 50%, RTA = 36.84 ms [15:37:49] (03CR) 10Vgutierrez: [C: 031] Split monitor tests into separate modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/433370 (owner: 10Mark Bergsma) [15:41:55] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2064 crashed - https://phabricator.wikimedia.org/T195228#4222666 (10Papaul) a:05Papaul>03Marostegui @Marostegui using the power button on the server to power the server doesn't work. Draining the power from the server didn't help as well The serv... [15:43:13] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2064 crashed - https://phabricator.wikimedia.org/T195228#4222671 (10Marostegui) Can we try to swap its PSU with another server from the ones we've decommissioned? Are those compatibles? [15:46:35] (03CR) 10Vgutierrez: Add full unit test coverage of IdleConnection (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/433341 (owner: 10Mark Bergsma) [15:46:46] (03CR) 10Elukey: [C: 032] Add the community extension for Parquet [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/433131 (https://phabricator.wikimedia.org/T193712) (owner: 10Elukey) [15:47:51] (03CR) 10Mark Bergsma: Add full unit test coverage of IdleConnection (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/433341 (owner: 10Mark Bergsma) [15:49:10] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2064 crashed - https://phabricator.wikimedia.org/T195228#4222681 (10Marostegui) From my chat with @Papaul - We have no compatible PSUs from the servers that were decommissioned (they are different vendors) - Changing the power socket/cable didn't ha... [15:49:15] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2064 crashed - https://phabricator.wikimedia.org/T195228#4222682 (10Papaul) @Marostegui no there are not [15:53:53] (03PS1) 10Elukey: Remove mysql-metadata-storage-0.10.0.jar [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/434520 (https://phabricator.wikimedia.org/T193712) [15:54:23] (03CR) 10Elukey: [C: 032] Remove mysql-metadata-storage-0.10.0.jar [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/434520 (https://phabricator.wikimedia.org/T193712) (owner: 10Elukey) [15:59:03] (03CR) 10Vgutierrez: [C: 031] Handle HTTP status 302 and 303 as well as 301 [debs/pybal] - 10https://gerrit.wikimedia.org/r/430393 (https://phabricator.wikimedia.org/T102393) (owner: 10Mark Bergsma) [16:00:04] godog, moritzm, and _joe_: #bothumor My software never has bugs. It just develops random features. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180522T1600). [16:00:04] addshore: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:01:01] 10Operations, 10Discovery, 10Maps: Track more detailed disk usage on maps servers - https://phabricator.wikimedia.org/T194997#4222719 (10Gehel) Note that Cassandra usage is already tracked [[ https://graphite.wikimedia.org/render/?width=800&height=600&target=cassandra.maps1001.org.apache.cassandra.metrics.St... [16:01:42] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4222724 (10Papaul) @Marostegui let me know if this racking proposal works for you db2094 row A rack A6 db2095 row C rack C6 [16:01:48] o/ [16:02:01] * addshore is here for puppet swat if anyone feels like attacking my patch [16:02:36] (03PS1) 10Krinkle: profiler-labs: Use FlameGraph-compatible format for xhprof sampler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434522 (https://phabricator.wikimedia.org/T176916) [16:03:36] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2064 crashed - https://phabricator.wikimedia.org/T195228#4222726 (10Marostegui) So, looks like this server is lost for good. We have no other similar servers decommissioned we cannot replace spares pieces. Our DCOps suggestion is to basically decommi... [16:03:41] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4222728 (10jcrespo) @Papaul, that will work, only requirement is hosts being on separate rows. [16:04:28] (03CR) 10jerkins-bot: [V: 04-1] profiler-labs: Use FlameGraph-compatible format for xhprof sampler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434522 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [16:05:40] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4222733 (10Papaul) @jcrespo Thanks [16:06:36] ACKNOWLEDGEMENT - HP RAID on elastic2020 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T195306 [16:06:42] 10Operations, 10ops-codfw: Degraded RAID on elastic2020 - https://phabricator.wikimedia.org/T195306#4222734 (10ops-monitoring-bot) [16:08:02] (03CR) 10Vgutierrez: Cleanup monitor shutdown handler (invoking stop) after run (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/433369 (owner: 10Mark Bergsma) [16:08:57] addshore: o/ [16:09:06] I am reading your patch [16:09:08] \o [16:09:11] elukey: thanks! [16:09:24] so IIUC basically you are removing defaults that are now saved in mw right? [16:09:33] yup [16:09:49] it should result in 0 change to dispatching :) [16:10:16] yep the values are consistent, I checked also the other change [16:10:21] (03PS3) 10Zoranzoki21: Enable "File mover" flag on zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434359 (https://phabricator.wikimedia.org/T195247) [16:10:26] (03PS4) 10Zoranzoki21: Enable "File mover" flag on zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434359 (https://phabricator.wikimedia.org/T195247) [16:10:40] so afaics it seems fine to me, but I'd also ask to other people that are more involved or that can be affected [16:11:07] (03CR) 10Zoranzoki21: "> See @A2093064'inline comments." (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434359 (https://phabricator.wikimedia.org/T195247) (owner: 10Zoranzoki21) [16:11:21] for example, say that changing/tuning those parameters cause a db overload etc.. [16:11:31] people directly involved should be aware of this change [16:11:41] so what I'd do is [16:12:00] 1) add a comment on the crons in puppet explaining a bit where things are tuned [16:12:19] 2) add also Jaime/Manuel to the code change so they are aware [16:13:05] addshore: does it make sense? [16:13:11] yup, can do! [16:13:20] super [16:13:25] after that I think we can merge [16:13:29] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228#4222739 (10Marostegui) [16:13:38] (03CR) 10Addshore: [C: 04-1] "TODOS:" [puppet] - 10https://gerrit.wikimedia.org/r/430923 (owner: 10Addshore) [16:14:07] !log upgrading remaining video scalers to HHVM 3,18.5+dfsg-1+wmf8+deb9u1 [16:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:18] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434527 (https://phabricator.wikimedia.org/T195228) [16:19:05] (03PS5) 10Addshore: Wikidata dispatch, remove cron params, use values from mediawiki-config [puppet] - 10https://gerrit.wikimedia.org/r/430923 [16:19:53] elukey: all done :) [16:20:27] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Remove db2064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434527 (https://phabricator.wikimedia.org/T195228) (owner: 10Marostegui) [16:20:43] addshore: checking puppet compiler [16:21:28] (03CR) 10Alexandros Kosiaris: [C: 032] Add fonts-noto to support priting non-latin characters [puppet] - 10https://gerrit.wikimedia.org/r/434506 (https://phabricator.wikimedia.org/T186748) (owner: 10Pmiazga) [16:21:31] (03PS2) 10Alexandros Kosiaris: Add fonts-noto to support priting non-latin characters [puppet] - 10https://gerrit.wikimedia.org/r/434506 (https://phabricator.wikimedia.org/T186748) (owner: 10Pmiazga) [16:21:35] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add fonts-noto to support priting non-latin characters [puppet] - 10https://gerrit.wikimedia.org/r/434506 (https://phabricator.wikimedia.org/T186748) (owner: 10Pmiazga) [16:22:22] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434527 (https://phabricator.wikimedia.org/T195228) (owner: 10Marostegui) [16:24:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db2064 from config - T195228 (duration: 01m 19s) [16:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:38] T195228: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228 [16:25:07] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228#4222766 (10Marostegui) [16:25:08] (03PS6) 10Elukey: Wikidata dispatch, remove cron params, use values from mediawiki-config [puppet] - 10https://gerrit.wikimedia.org/r/430923 (owner: 10Addshore) [16:25:16] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11268/" [puppet] - 10https://gerrit.wikimedia.org/r/430923 (owner: 10Addshore) [16:25:35] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228#4220093 (10Marostegui) [16:25:59] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db2064 from config - T195228 (duration: 01m 18s) [16:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:33] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434527 (https://phabricator.wikimedia.org/T195228) (owner: 10Marostegui) [16:26:41] (03CR) 10Elukey: [C: 032] "Checked that all the args/values removed were in Ie42438911d51cf0ff598926fea9e299a5e5e61ed, everything looks good. Change looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/430923 (owner: 10Addshore) [16:27:53] (03PS1) 10Marostegui: s2.hosts: Remove db2064 [software] - 10https://gerrit.wikimedia.org/r/434530 (https://phabricator.wikimedia.org/T195228) [16:28:44] (03CR) 10Marostegui: [C: 032] s2.hosts: Remove db2064 [software] - 10https://gerrit.wikimedia.org/r/434530 (https://phabricator.wikimedia.org/T195228) (owner: 10Marostegui) [16:29:12] addshore: deployed on terbium :) [16:29:47] (03Merged) 10jenkins-bot: s2.hosts: Remove db2064 [software] - 10https://gerrit.wikimedia.org/r/434530 (https://phabricator.wikimedia.org/T195228) (owner: 10Marostegui) [16:30:55] (03PS1) 10Marostegui: mariadb: Set db2064 to spare [puppet] - 10https://gerrit.wikimedia.org/r/434531 (https://phabricator.wikimedia.org/T195228) [16:31:30] elukey: thanks1 [16:31:39] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Set db2064 to spare [puppet] - 10https://gerrit.wikimedia.org/r/434531 (https://phabricator.wikimedia.org/T195228) (owner: 10Marostegui) [16:31:43] * addshore watches dispatching [16:31:49] (03PS2) 10Marostegui: mariadb: Set db2064 to spare [puppet] - 10https://gerrit.wikimedia.org/r/434531 (https://phabricator.wikimedia.org/T195228) [16:32:27] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Set db2064 to spare [puppet] - 10https://gerrit.wikimedia.org/r/434531 (https://phabricator.wikimedia.org/T195228) (owner: 10Marostegui) [16:33:16] 10Operations, 10Community-Liaisons, 10Developer-Relations, 10Wikimedia-Mailing-lists: Rename (create anew) the TC team mailing list - https://phabricator.wikimedia.org/T155683#4222808 (10Elitre) >>! In T155683#4202015, @Dzahn wrote: > Let us know if you are still interested in list renaming. > > A renamin... [16:33:18] (03PS3) 10Marostegui: mariadb: Set db2064 to spare [puppet] - 10https://gerrit.wikimedia.org/r/434531 (https://phabricator.wikimedia.org/T195228) [16:33:55] PROBLEM - Disk space on stat1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:34:14] this is probably stat1004 coming back from the reimage --^ [16:34:49] (03PS4) 10Marostegui: mariadb: Set db2064 to spare [puppet] - 10https://gerrit.wikimedia.org/r/434531 (https://phabricator.wikimedia.org/T195228) [16:35:45] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228#4222815 (10Marostegui) [16:35:48] (03CR) 10Muehlenhoff: "Better use mediawiki::packages::fonts, this would cover all the cases for other fonts usually used on our application servers as well." [puppet] - 10https://gerrit.wikimedia.org/r/434506 (https://phabricator.wikimedia.org/T186748) (owner: 10Pmiazga) [16:37:00] (03CR) 10Marostegui: [C: 032] mariadb: Set db2064 to spare [puppet] - 10https://gerrit.wikimedia.org/r/434531 (https://phabricator.wikimedia.org/T195228) (owner: 10Marostegui) [16:37:04] PROBLEM - MD RAID on stat1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:38:08] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228#4222836 (10Marostegui) a:05Marostegui>03RobH [16:38:30] marostegui: sadness =[ [16:38:35] :-( [16:38:51] https://www.youtube.com/watch?v=rY0WxgSXdEE [16:38:58] hahaha [16:39:04] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228#4220093 (10Marostegui) This system is now ready to be decommissioned :-( [16:39:23] _joe_: marostegui volans jynus etc. started an incident documentation page for the issue earlier FYI https://wikitech.wikimedia.org/wiki/Incident_documentation/20180522-MediaWiki [16:39:49] addshore: thanks! [16:40:06] +1 [16:40:53] Urbanecm: ^^ also may be of interest to you as they were your swat patches :) [16:41:34] PROBLEM - configured eth on stat1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:43:05] PROBLEM - dhclient process on stat1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:43:28] !log upload druid debs 0.11.0-3 to jessie-wikimedia [16:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:44] PROBLEM - puppet last run on stat1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:45:15] downtimed stat1004 [16:46:24] RECOVERY - MD RAID on stat1004 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [16:46:25] RECOVERY - configured eth on stat1004 is OK: OK - interfaces up [16:46:34] RECOVERY - Disk space on stat1004 is OK: DISK OK [16:46:40] elukey: disable notifications if you don't want the recovery spam ;) [16:46:54] RECOVERY - dhclient process on stat1004 is OK: PROCS OK: 0 processes with command name dhclient [16:49:12] volans: I like recovery spam! [16:49:19] it makes me feel good [16:49:22] lol [16:51:36] !log addshore@terbium:~$ mwscript extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --wiki testwikidatawiki --reindexAndRemoveOk --indexIdentifier=now [16:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:32] !log restart zookeeper on druid100[1,3] to complete the openjdk-8 upgrade [16:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:10] <_joe_> addshore: <3 [17:00:05] cscott, arlolra, subbu, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180522T1700). [17:00:22] (03PS3) 10Dzahn: zuul: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434427 (https://phabricator.wikimedia.org/T194724) [17:00:26] (03CR) 10Dzahn: zuul: base::service_unit -> systemd::service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/434427 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [17:02:12] (03PS2) 10Dzahn: snapshot/dumps-monitor: replace base:service_unit [puppet] - 10https://gerrit.wikimedia.org/r/434431 (https://phabricator.wikimedia.org/T194724) [17:02:29] (03CR) 10Paladox: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/434427 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [17:03:36] !log Deploy schema change on s8 codfw primary master (db2045), this will generate lag on codfw - T194270 [17:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:41] T194270: Drop 'tmp1' index from wb_terms table in production - https://phabricator.wikimedia.org/T194270 [17:05:12] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:09:40] (03CR) 10Dzahn: [C: 032] "hosts affected: snapshot1007 ( role(dumps::generation::worker::dumper_misc)" [puppet] - 10https://gerrit.wikimedia.org/r/434431 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [17:11:04] (03PS1) 10Elukey: profile::prometheus::alerts: disable druid realtime alert [puppet] - 10https://gerrit.wikimedia.org/r/434533 (https://phabricator.wikimedia.org/T193712) [17:11:33] (03CR) 10Dzahn: [C: 032] "noop" [puppet] - 10https://gerrit.wikimedia.org/r/434431 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [17:11:50] (03CR) 10Elukey: [C: 032] profile::prometheus::alerts: disable druid realtime alert [puppet] - 10https://gerrit.wikimedia.org/r/434533 (https://phabricator.wikimedia.org/T193712) (owner: 10Elukey) [17:12:51] (03PS2) 10Krinkle: profiler-labs: Use FlameGraph-compatible format for xhprof sampler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434522 (https://phabricator.wikimedia.org/T176916) [17:14:12] (03CR) 10jerkins-bot: [V: 04-1] profiler-labs: Use FlameGraph-compatible format for xhprof sampler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434522 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [17:16:03] (03PS1) 10Dzahn: keyholder: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434535 (https://phabricator.wikimedia.org/T194724) [17:17:00] _joe_: ^ are you for deleting upstart files as well? [17:17:14] i wasnt 100% sure if i want to delete the templates, but i gues ss [17:17:26] so [17:18:02] (03PS1) 10Arturo Borrero Gonzalez: reprepro: use right syntax for grep-dctrl [puppet] - 10https://gerrit.wikimedia.org/r/434536 (https://phabricator.wikimedia.org/T194665) [17:18:37] (03PS2) 10Arturo Borrero Gonzalez: reprepro: use right syntax for grep-dctrl [puppet] - 10https://gerrit.wikimedia.org/r/434536 (https://phabricator.wikimedia.org/T194665) [17:19:52] (03CR) 10Arturo Borrero Gonzalez: [C: 032] reprepro: use right syntax for grep-dctrl [puppet] - 10https://gerrit.wikimedia.org/r/434536 (https://phabricator.wikimedia.org/T194665) (owner: 10Arturo Borrero Gonzalez) [17:30:01] (03PS1) 10Dzahn: k8s: remove trusty/upstart support [puppet] - 10https://gerrit.wikimedia.org/r/434537 (https://phabricator.wikimedia.org/T194724) [17:33:20] (03PS1) 10Dzahn: jenkins: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434538 (https://phabricator.wikimedia.org/T194724) [17:33:54] (03CR) 10jerkins-bot: [V: 04-1] jenkins: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434538 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [17:35:12] (03PS1) 10Dzahn: sentry: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434539 (https://phabricator.wikimedia.org/T194724) [17:37:23] <_joe_> mutante: yes, I'm all for deleting upstart files; it's a dead technology no one uses anymore [17:37:34] <_joe_> trusty is going EOS in less than one year [17:37:40] _joe_: ok :) sounds good [17:37:46] <_joe_> but in the case of k8s, you can't [17:37:55] <_joe_> I think toolsforge has trusty k8s nodes [17:38:04] <_joe_> but check with chasemp and andrewbogott [17:38:04] i was about to ask.. still trusty? .. ok [17:38:08] alright [17:39:11] with jenkins it uses the "refresh" parameter, we dont have that anymore. replacing with "restart" [17:39:23] it explicitly sets it to false for jenkins [17:40:26] (03PS2) 10Dzahn: jenkins: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434538 (https://phabricator.wikimedia.org/T194724) [17:40:27] <_joe_> so, the default behaviour changed between base::service_unit and systemd::service [17:40:42] <_joe_> in the former, by default the service would be restarted [17:40:50] <_joe_> in the latter, by default it's not restarted [17:41:28] <_joe_> for services that don't need to be restarted in a controlled way, or that do subscribe to other resources, I'd advise to declare restart => true [17:42:20] (03PS2) 10Volans: Icinga: add check and retry intervals for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/434455 (https://phabricator.wikimedia.org/T173050) [17:43:22] ah, this makes sense! yes, then i can just remove that [17:44:08] (03PS1) 10Krinkle: build: Update codesniffer and fix violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434540 [17:44:25] (03PS3) 10Krinkle: profiler-labs: Use FlameGraph-compatible format for xhprof sampler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434522 (https://phabricator.wikimedia.org/T176916) [17:44:30] (03PS3) 10Dzahn: jenkins: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434538 (https://phabricator.wikimedia.org/T194724) [17:45:49] (03CR) 10jerkins-bot: [V: 04-1] profiler-labs: Use FlameGraph-compatible format for xhprof sampler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434522 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [17:46:05] (03CR) 10Krinkle: [C: 032] build: Update codesniffer and fix violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434540 (owner: 10Krinkle) [17:47:52] (03Merged) 10jenkins-bot: build: Update codesniffer and fix violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434540 (owner: 10Krinkle) [17:48:12] i'll check where i should add a restart => true that dont have it now [17:49:18] (03PS4) 10Krinkle: profiler-labs: Use FlameGraph-compatible format for xhprof sampler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434522 (https://phabricator.wikimedia.org/T176916) [17:49:42] (03CR) 10jenkins-bot: build: Update codesniffer and fix violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434540 (owner: 10Krinkle) [17:51:19] (03CR) 10Volans: [C: 032] Icinga: add check and retry intervals for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/434455 (https://phabricator.wikimedia.org/T173050) (owner: 10Volans) [17:52:19] 10Operations, 10Discovery, 10Maps: disk usage increase on maps servers - https://phabricator.wikimedia.org/T194966#4223034 (10Pnorman) Just to note, the immediate disk space usage issues will get better when reimaging as part of the new style setup, because it will completely reset Cassandra. If we were usi... [17:53:59] (03PS2) 10Volans: Icinga: reduce checks frequency for SMART and EDAC [puppet] - 10https://gerrit.wikimedia.org/r/434456 (https://phabricator.wikimedia.org/T173050) [17:56:49] (03PS1) 10Dzahn: dumps/snapshot: add 'restart => true' to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434542 (https://phabricator.wikimedia.org/T194724) [17:59:02] (03PS2) 10Dzahn: dumps/snapshot: add 'restart => true' to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434542 (https://phabricator.wikimedia.org/T194724) [17:59:43] (03CR) 10Volans: [C: 032] "@akosiaris: I agree that 30m might even be too often. I've used the one of the IPMI check. I'll merge as is for now, we can increase it to" [puppet] - 10https://gerrit.wikimedia.org/r/434456 (https://phabricator.wikimedia.org/T173050) (owner: 10Volans) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180522T1800) [18:06:02] volans: just one drawback with long check times.. when hosts are reinstalled there is more room for icinga spam until it recovers, heh [18:06:23] i manually clicked "schedule next service check" though [18:09:04] mutante: sure, but for that we'll need anyway to find a proper solution, although it's tricky [18:09:33] but if you see the task and grafana einsteinium was quite loaded by this check and 1m check of those 2 things is actually useless [18:09:42] (03CR) 10Pmiazga: "Muehlenhoff: that sounds interesting - could you share an example how to do that? this is my first commit to operations/puppet repo" [puppet] - 10https://gerrit.wikimedia.org/r/434506 (https://phabricator.wikimedia.org/T186748) (owner: 10Pmiazga) [18:10:02] volans: yes absolutely, dont get me wrong, your change is definitely right. i saw the ticket [18:13:21] (03CR) 10Mobrovac: [C: 031] "> Better use mediawiki::packages::fonts, this would cover all the" [puppet] - 10https://gerrit.wikimedia.org/r/434506 (https://phabricator.wikimedia.org/T186748) (owner: 10Pmiazga) [18:17:34] (03PS1) 10Mobrovac: Proton: Require the standard MW fonts [puppet] - 10https://gerrit.wikimedia.org/r/434545 (https://phabricator.wikimedia.org/T186748) [18:18:11] (03CR) 10jerkins-bot: [V: 04-1] Proton: Require the standard MW fonts [puppet] - 10https://gerrit.wikimedia.org/r/434545 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [18:20:23] (03PS2) 10Mobrovac: Proton: Require the standard MW fonts [puppet] - 10https://gerrit.wikimedia.org/r/434545 (https://phabricator.wikimedia.org/T186748) [18:20:52] (03CR) 10Mobrovac: [C: 031] "Done in I6595b55c3d45e99fd7683442b45cb728796e02ac" [puppet] - 10https://gerrit.wikimedia.org/r/434506 (https://phabricator.wikimedia.org/T186748) (owner: 10Pmiazga) [18:20:59] (03CR) 10jerkins-bot: [V: 04-1] Proton: Require the standard MW fonts [puppet] - 10https://gerrit.wikimedia.org/r/434545 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [18:30:52] (03PS3) 10Dzahn: dumps/snapshot: add 'restart => true' to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434542 (https://phabricator.wikimedia.org/T194724) [18:32:51] (03PS3) 10Mobrovac: Proton: Require the standard MW fonts [puppet] - 10https://gerrit.wikimedia.org/r/434545 (https://phabricator.wikimedia.org/T186748) [18:32:56] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1969 bytes in 0.136 second response time [18:42:56] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1973 bytes in 0.126 second response time [18:44:41] !log tilerator deploying a4a3fc7b2694a469cc26a6d0cc440d4d46fa9485 [18:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:21] !log pnorman@tin Started deploy [tilerator/deploy@a4a3fc7]: (no justification provided) [18:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:55] (03CR) 10Dzahn: [C: 032] dumps/snapshot: add 'restart => true' to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434542 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [18:50:06] mutante: do you know how to edit ssh fingerprint wikitech pages? [18:50:15] https://wikitech.wikimedia.org/w/index.php?title=Help:SSH_Fingerprints/stat1004.eqiad.wmnet&action=history [18:50:18] https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/stat1004.eqiad.wmnet [18:50:21] yargh [18:50:27] oh yeah that one [18:50:28] i can't edit it [18:50:31] but i did just reinstlal the node today [18:51:07] PROBLEM - tileratorui on maps-test2001 is CRITICAL: connect to address 10.192.0.128 and port 6535: Connection refused [18:51:37] PROBLEM - tilerator on maps-test2001 is CRITICAL: connect to address 10.192.0.128 and port 6534: Connection refused [18:52:24] ^ mostly expected, we're testing a deployment with pnorman [18:56:30] ottomata: it's because quiddity protected it. i guess i must be a wikitech admin because it didnt keep me from it [18:57:58] hm mutante would you mind just editing it real quick for me with new output from stat1004? [18:58:40] ottomata: or i can make you an admin.. or a "content admin" after checking if that does it [18:58:46] i dont mind. you can pick what you prefer [18:59:00] yes, i'm doing the edit [18:59:39] either! [19:00:04] twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180522T1900). [19:03:24] ottomata: https://wikitech.wikimedia.org/w/index.php?title=Help:SSH_Fingerprints/stat1004.eqiad.wmnet&action=history [19:03:27] ottomata: Dzahn (talk | contribs | block) changed group membership for Ottomata from cloud administrator, confirmed user and shell user to cloud administrator, confirmed user, shell user and administrator (needs to be able to edit fingerprint pages) [19:03:28] !log Today's train: Promoting group2 wikis to 1.32.0-wmf.4 followed by group0 to 1.32.0-wmf.5 [19:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:52] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4223186 (10pmiazga) @mobrovac @Amire80 - Introducing NotoSans font set fixed many languages (I checked ~25 languages including Hindi, Urdu, A... [19:10:58] (03PS1) 1020after4: group2 wikis to 1.32.0-wmf.4 refs T191050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434550 [19:11:00] (03CR) 1020after4: [C: 032] group2 wikis to 1.32.0-wmf.4 refs T191050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434550 (owner: 1020after4) [19:12:18] (03Merged) 10jenkins-bot: group2 wikis to 1.32.0-wmf.4 refs T191050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434550 (owner: 1020after4) [19:12:33] (03CR) 10jenkins-bot: group2 wikis to 1.32.0-wmf.4 refs T191050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434550 (owner: 1020after4) [19:14:28] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group2 wikis to 1.32.0-wmf.4 refs T191050 [19:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:32] T191050: 1.32.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T191050 [19:17:43] !log twentyafterfour@tin Synchronized php-1.32.0-wmf.4/extensions/ContentTranslation/: sync https://gerrit.wikimedia.org/r/#/c/434529/ to fix T194810 (duration: 01m 02s) [19:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:48] T194810: Notice: Undefined index: CX_CATEGORY_METADATA in /srv/mediawiki/php-1.32.0-wmf.3/extensions/ContentTranslation/includes/CorporaLookup.php on line 95 - https://phabricator.wikimedia.org/T194810 [19:19:30] thank you mutante! [19:21:24] I wasn't seeing this until today: Notice: Array to string conversion in /srv/mediawiki/php-1.32.0-wmf.4/extensions/Scribunto/includes/engines/LuaCommon/UstringLibrary.php on line 732 [19:21:41] something got deployed to scribunto recently? [19:22:18] looks like no..hmm [19:23:48] also seeing this which is unfamiliar: Translation search server unavailable: Result window is too large, from + size must be less than or equal to: [10000] but was [11000]. [19:32:28] (03CR) 10Pmiazga: [C: 031] Proton: Require the standard MW fonts [puppet] - 10https://gerrit.wikimedia.org/r/434545 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [19:34:39] !log pnorman@tin Finished deploy [tilerator/deploy@a4a3fc7]: (no justification provided) (duration: 46m 19s) [19:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:33] !log pnorman@tin Started deploy [tilerator/deploy@18faaa6]: (no justification provided) [19:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:58] !log pnorman@tin Finished deploy [tilerator/deploy@18faaa6]: (no justification provided) (duration: 00m 25s) [19:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:39] RECOVERY - tileratorui on maps-test2001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.095 second response time [19:36:59] RECOVERY - tilerator on maps-test2001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.094 second response time [19:38:45] !log pnorman@tin Started deploy [tilerator/deploy@18faaa6]: (no justification provided) [19:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:13] !log pnorman@tin Finished deploy [tilerator/deploy@18faaa6]: (no justification provided) (duration: 00m 29s) [19:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:17] 10Operations, 10RESTBase-API, 10Traffic, 10Services (doing): Normalise the Accept-Language header for REST API requests - https://phabricator.wikimedia.org/T195327#4223311 (10mobrovac) p:05Triage>03High [19:41:38] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1973 bytes in 0.093 second response time [19:41:39] 10Operations, 10RESTBase-API, 10Traffic, 10Services (doing): Normalise the Accept-Language header for REST API requests - https://phabricator.wikimedia.org/T195327#4223325 (10mobrovac) [19:43:11] !log pnorman@tin Started deploy [tilerator/deploy@18faaa6]: Update tilerator, fix variable substitution [19:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:57] !log pnorman@tin Finished deploy [tilerator/deploy@18faaa6]: Update tilerator, fix variable substitution (duration: 02m 46s) [19:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:10] (03PS1) 10Mobrovac: VCL: Normalise the Accept-Language header for the REST API [puppet] - 10https://gerrit.wikimedia.org/r/434558 (https://phabricator.wikimedia.org/T195327) [19:51:39] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1963 bytes in 0.107 second response time [20:21:23] (03PS1) 10Addshore: testwikidata: Add Property NS to wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434591 [20:23:08] (03PS1) 10Addshore: testwikidata: Add Lexeme NS to wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434592 (https://phabricator.wikimedia.org/T191458) [20:23:29] (03CR) 10Ladsgroup: [C: 031] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/434479 (https://phabricator.wikimedia.org/T195289) (owner: 10ArielGlenn) [20:24:56] (03PS1) 10Addshore: wikidata: Add Lexeme NS to wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434593 (https://phabricator.wikimedia.org/T191457) [20:25:57] (03PS2) 10Addshore: testwikidata: Add Lexeme NS to wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434592 (https://phabricator.wikimedia.org/T191458) [20:25:57] (03PS2) 10Addshore: wikidata: Add Lexeme NS to wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434593 (https://phabricator.wikimedia.org/T191457) [20:26:09] (03PS3) 10Addshore: wikidata: Add Lexeme NS to wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434593 (https://phabricator.wikimedia.org/T191457) [20:29:44] (03PS1) 1020after4: testwikis wikis to 1.32.0-wmf.5 refs T191051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434596 [20:29:46] (03CR) 1020after4: [C: 032] testwikis wikis to 1.32.0-wmf.5 refs T191051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434596 (owner: 1020after4) [20:31:28] (03Merged) 10jenkins-bot: testwikis wikis to 1.32.0-wmf.5 refs T191051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434596 (owner: 1020after4) [20:33:17] !log twentyafterfour@tin Started scap: testwikis wikis to 1.32.0-wmf.5 refs T191051 [20:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:22] T191051: 1.32.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T191051 [20:40:51] addshore, thanks for the incident doc link. Watchlisted [20:40:56] Looking forward for futher details [20:41:06] *further [20:42:32] (03CR) 10jenkins-bot: testwikis wikis to 1.32.0-wmf.5 refs T191051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434596 (owner: 1020after4) [20:45:20] (03CR) 10Paladox: [C: 031] "bump :)" [puppet] - 10https://gerrit.wikimedia.org/r/423794 (owner: 10Chad) [20:47:55] addshore, just as I'd like to have everything ok. When I can try to schedule the problematic patches again? After the issue will be resolved? After somebody with enough authority will ping me? Sometime else? [20:48:03] I'm just not familiar with the process after such issues [20:48:17] Urbanecm: I think we need to try to look into what caused the issue a bit more first [20:49:01] That's true definitely. My question is how I will know it's ok to try it again :) [20:49:36] I just don't know what the "a bit" means :). [20:57:21] (03PS1) 10Urbanecm: Add HD logos to static for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434598 (https://phabricator.wikimedia.org/T194340) [20:57:23] (03PS1) 10Urbanecm: Change 1x logo for wikimania2018wiki with freshly generated one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434599 (https://phabricator.wikimedia.org/T194340) [20:57:25] (03PS1) 10Urbanecm: Enable HD logos for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434600 (https://phabricator.wikimedia.org/T194340) [20:59:05] (03CR) 10jerkins-bot: [V: 04-1] Enable HD logos for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434600 (https://phabricator.wikimedia.org/T194340) (owner: 10Urbanecm) [21:03:03] (03PS2) 10Urbanecm: Add HD logos to static for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434598 (https://phabricator.wikimedia.org/T194340) [21:03:53] (03PS2) 10Urbanecm: Change 1x logo for wikimania2018wiki with freshly generated one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434599 (https://phabricator.wikimedia.org/T194340) [21:04:58] (03PS3) 10Urbanecm: Change 1x logo for wikimania2018wiki with freshly generated one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434599 (https://phabricator.wikimedia.org/T194340) [21:06:43] (03PS2) 10Urbanecm: Enable HD logos for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434600 (https://phabricator.wikimedia.org/T194340) [21:07:08] (03PS3) 10Urbanecm: Enable HD logos for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434600 (https://phabricator.wikimedia.org/T194340) [21:07:20] twentyafterfour: train went all okay? [21:09:48] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1967 bytes in 0.094 second response time [21:13:12] !log twentyafterfour@tin scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="testwiki" --outdir="/tmp/scap_l10n_2833418486" --threads=10 --lang en --quiet' returned non-zero exit status 255 (duration: 39m 54s) [21:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:06] (03CR) 10DCausse: [C: 031] wikidata: Add Lexeme NS to wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434593 (https://phabricator.wikimedia.org/T191457) (owner: 10Addshore) [21:14:23] (03PS3) 10Urbanecm: Add HD logos to static for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434598 (https://phabricator.wikimedia.org/T194340) [21:14:40] hehe, thats a no [21:14:42] (03Abandoned) 10Urbanecm: Change 1x logo for wikimania2018wiki with freshly generated one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434599 (https://phabricator.wikimedia.org/T194340) (owner: 10Urbanecm) [21:14:49] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1967 bytes in 0.105 second response time [21:15:38] (03PS4) 10Urbanecm: Change logo assets for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434598 (https://phabricator.wikimedia.org/T194340) [21:16:00] (03PS4) 10Urbanecm: Enable HD logos for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434600 (https://phabricator.wikimedia.org/T194340) [21:17:20] (03PS5) 10Urbanecm: Change logo assets for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434598 (https://phabricator.wikimedia.org/T194340) [21:18:15] (03PS5) 10Urbanecm: Enable HD logos for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434600 (https://phabricator.wikimedia.org/T194340) [21:18:30] (03Restored) 10Urbanecm: Change 1x logo for wikimania2018wiki with freshly generated one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434599 (https://phabricator.wikimedia.org/T194340) (owner: 10Urbanecm) [21:18:39] (03PS4) 10Urbanecm: Change 1x logo for wikimania2018wiki with freshly generated one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434599 (https://phabricator.wikimedia.org/T194340) [21:18:58] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [21:19:19] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [21:19:25] (03Abandoned) 10Urbanecm: Change 1x logo for wikimania2018wiki with freshly generated one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434599 (https://phabricator.wikimedia.org/T194340) (owner: 10Urbanecm) [21:38:54] !log twentyafterfour@tin Started scap: testwikis wikis to 1.32.0-wmf.5 refs T191051 [21:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:58] T191051: 1.32.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T191051 [21:40:00] (03Draft1) 10Paladox: Gerrit: Fix log4j rotating files [puppet] - 10https://gerrit.wikimedia.org/r/434605 [21:40:07] no_justification ^^ [21:40:08] (03PS2) 10Paladox: Gerrit: Fix log4j rotating files [puppet] - 10https://gerrit.wikimedia.org/r/434605 [21:41:15] 10Operations, 10MediaWiki-extensions-Translate, 10Wikimedia-Incident, 10Wikimedia-log-errors: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293#4222298 (10alanajjar) Many users in Arabic Wikipedia suffer from this pr... [21:51:50] (03PS2) 10Krinkle: Drop $wgTitle usages from robots.txt and extract2.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428843 (owner: 10Chad) [21:56:48] (03CR) 10Krinkle: [C: 031] "Verified on mwdebug1001. Good to go :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428843 (owner: 10Chad) [22:00:20] (03PS3) 10Herron: logstash: add tcp tls input for syslogs [puppet] - 10https://gerrit.wikimedia.org/r/431830 (https://phabricator.wikimedia.org/T193766) [22:00:27] addshore: train went ok yes, it's still going though [22:01:10] I had to fix the CongressLookup extension again because I forgot to fix it the right way last week [22:07:39] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1981 bytes in 0.145 second response time [22:09:26] (03CR) 10Smalyshev: [C: 031] wikidata: Add Lexeme NS to wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434593 (https://phabricator.wikimedia.org/T191457) (owner: 10Addshore) [22:12:15] (03CR) 10Herron: [C: 04-2] logstash: add tcp tls input for syslogs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/431830 (https://phabricator.wikimedia.org/T193766) (owner: 10Herron) [22:12:48] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1943 bytes in 0.104 second response time [22:20:44] twentyafterfour: hehe [22:25:21] paladox: on cobalt now. in gerrit log dir [22:25:26] ok thanks. [22:25:34] there are .gz files for replication and sshd_log [22:25:44] which ones specifically [22:25:53] error log [22:25:56] error_log [22:26:29] apparently log4j does not compress files. without the log4j-extras package. [22:26:30] yes, we have error_log.$date.gz from 05-13 to 05-21 [22:26:47] ok thanks, i guess it must be gerrit that compresses them [22:26:54] error_log.2018-05-20.gz: gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT) [22:28:02] i was searching around, and all users say you have to use the log4j-extra package to gain compression. [22:29:52] (03CR) 10Paladox: "This problem only happens if you doint use the default gerrit log path which makes me think gerrit is doing the compressing." [puppet] - 10https://gerrit.wikimedia.org/r/434605 (owner: 10Paladox) [22:31:56] paladox: in the log4j config file: [22:31:56] 4 [22:32:01] yep [22:33:15] paladox: here's your answer [22:33:16] modules/gerrit/manifests/crons.pp: command => 'find /var/lib/gerrit2/review_site/logs/ -name "*.gz" -mtime +7 -delete', [22:33:24] yep [22:33:36] so it's not gerrit, it's the puppetized cron [22:33:40] seems with chad change (moving to /var/log) it's not compressing the file [22:33:53] wait.. link to change please [22:34:12] mutante https://gerrit.wikimedia.org/r/#/c/423794/ [22:34:34] looking around, log4j does indeed does not support compressing [22:34:38] with out log4j-extras [22:34:42] which is not in gerrit. [22:36:15] ok, so that's just a cron for cleaning up old logs but not one to compress logs [22:36:18] i see [22:36:33] yep [22:36:44] i think gerrit is compressing some how [22:36:47] for some reason i expected another one that does the gzip command [22:36:52] didnt we have it [22:37:08] i wonder if that's what we need one that compresses the files? [22:37:39] maybe, but what is doing it now [22:37:54] I haven't figured that out :) [22:38:12] looks at more crontabs [22:38:20] it seems likly it's gerrit as moving it from logs/ it does not compress, putting it in logs/ it compresses. [22:38:53] as you say it sounds like gerrit does it from the comment " # Gerrit rotates their own logs, but doesn't clean them out" [22:38:58] PROBLEM - Check systemd state on maps-test2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:39:09] PROBLEM - tilerator on maps-test2004 is CRITICAL: connect to address 10.192.16.35 and port 6534: Connection refused [22:39:29] rotate would mean both, gzip and delete old ones .. to me [22:39:39] PROBLEM - tileratorui on maps-test2004 is CRITICAL: connect to address 10.192.16.35 and port 6535: Connection refused [22:39:50] ah, i think log4j2 supports that. [22:39:53] is going to ignore that based on the "test' in that host name [22:40:08] oh nvm [22:40:12] log4j supports that mutante [22:40:13] https://stackoverflow.com/questions/1050256/how-can-i-get-log4j-to-delete-old-rotating-log-files [22:40:31] we use dailyrollingfileappender [22:40:35] RollingFileAppender does this. You just need to set maxBackupIndex to the highest value for the backup file. [22:40:38] hah, indeed [22:40:40] so just need to migrate to RollingFileAppender [22:41:01] could save the extra cron, yes, nice. but doesnt explain the gzip issue [22:41:24] yeh [22:42:04] on your test instance, do you need to restart gerrit service to tell it about the new log dir [22:42:11] after you applied the change [22:42:16] yeh [22:42:24] but you did and didnt fix? [22:42:29] RECOVERY - tilerator on maps-test2004 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.096 second response time [22:42:29] log4j1 does not support reloading configuation [22:42:37] it logs to the new location [22:42:40] just no compression [22:42:44] ok [22:42:57] so the cron that deletes the files and keeps 7 does not work as there wont be any files in there with .gz [22:43:07] log4j2 supports reloading though [22:48:33] i can confirm what you said, it seems that you would need log4j extras and i see the examples.. but our config doesnt have any of that [22:49:03] and his change just edits the file names in that config .. so .. eh.. [22:49:23] double check if the permissions are right for the new dir? [22:49:36] 10Operations, 10MediaWiki-extensions-Translate, 10Wikimedia-Incident, 10Wikimedia-log-errors: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293#4223734 (10Addshore) With the comment above mentioning metawiki and Priv... [22:49:55] http://2min2code.com/articles/log4j_intro/rolling_archiving_file_per_day_xml [22:51:29] RECOVERY - tileratorui on maps-test2004 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.095 second response time [22:51:49] RECOVERY - Check systemd state on maps-test2004 is OK: OK - running: The system is fully operational [22:54:56] !log twentyafterfour@tin Finished scap: testwikis wikis to 1.32.0-wmf.5 refs T191051 (duration: 76m 01s) [22:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:00] T191051: 1.32.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T191051 [22:55:20] (03PS1) 1020after4: group0 wikis to 1.32.0-wmf.5 refs T191051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434611 [22:55:22] (03CR) 1020after4: [C: 032] group0 wikis to 1.32.0-wmf.5 refs T191051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434611 (owner: 1020after4) [22:56:55] (03Merged) 10jenkins-bot: group0 wikis to 1.32.0-wmf.5 refs T191051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434611 (owner: 1020after4) [22:58:53] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group0 wikis to 1.32.0-wmf.5 refs T191051 [22:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:49] !log train for group0 1.32.0-wmf.5 completed. Tune in tomorrow for more excitement! [22:59:52] (03CR) 10jenkins-bot: group0 wikis to 1.32.0-wmf.5 refs T191051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434611 (owner: 1020after4) [22:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Evening SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180522T2300). [23:00:04] SMalyshev: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:18] wow I finished the train with mere seconds to spare [23:00:27] :D [23:00:39] * addshore runs away before swat [23:01:08] twentyafterfour: gues we need to wait a bit before deploying? [23:04:58] here I am [23:05:40] btw my patch will do nothing until weekend dump, so no interaction with current train [23:08:22] (03CR) 10Dzahn: "we tried to solve the mystery why this change apparently breaks gzip'ping of the log files (and after reverting it immediately works again" [puppet] - 10https://gerrit.wikimedia.org/r/423794 (owner: 10Chad) [23:10:03] so, is swat happening? [23:12:22] (03PS3) 10Paladox: Gerrit: Fix log4j rotating files [puppet] - 10https://gerrit.wikimedia.org/r/434605 [23:13:35] (03PS4) 10Paladox: Gerrit: Fix log4j rotating files [puppet] - 10https://gerrit.wikimedia.org/r/434605 [23:14:43] (03CR) 10Paladox: [C: 031] "This https://gerrit.wikimedia.org/r/#/c/434605/ should provided a work around solution as we carn't install log4j-extras without adding it" [puppet] - 10https://gerrit.wikimedia.org/r/423794 (owner: 10Chad) [23:19:18] ok, looks like it's not happening :( [23:21:12] SMalyshev: I can deploy [23:21:36] MaxSem: ok, thanks, please do [23:23:16] SMalyshev: I don't see any code in this repo that uses this dblist... [23:23:33] MaxSem: it's used by category rdf dumps [23:23:44] it's in puppet [23:24:53] (03PS3) 10MaxSem: Add to the list all wikis except for private ones. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433740 (https://phabricator.wikimedia.org/T194260) (owner: 10Smalyshev) [23:26:03] (03CR) 10MaxSem: [C: 032] Add to the list all wikis except for private ones. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433740 (https://phabricator.wikimedia.org/T194260) (owner: 10Smalyshev) [23:27:20] (03Merged) 10jenkins-bot: Add to the list all wikis except for private ones. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433740 (https://phabricator.wikimedia.org/T194260) (owner: 10Smalyshev) [23:27:32] MaxSem: thanks [23:29:36] (03CR) 10jenkins-bot: Add to the list all wikis except for private ones. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433740 (https://phabricator.wikimedia.org/T194260) (owner: 10Smalyshev) [23:29:52] !log maxsem@tin Synchronized dblists/categories-rdf.dblist: https://gerrit.wikimedia.org/r/#/c/433740/ (duration: 01m 17s) [23:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:12] (03CR) 10Dzahn: [C: 032] icinga-sms: use localhost as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/429344 (https://phabricator.wikimedia.org/T82937) (owner: 10Herron)