[00:22:46] 06Operations, 10Traffic: Sort out vcl_deliver vs vcl_synth mess with v4 VCL - https://phabricator.wikimedia.org/T135696#2307560 (10BBlack) [00:39:04] PROBLEM - puppet last run on mw2060 is CRITICAL: CRITICAL: puppet fail [01:01:33] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: puppet fail [01:08:04] RECOVERY - puppet last run on mw2060 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [01:13:08] preparing to upgrade phabricator, expect momentary downtime [01:15:02] !log Phabricator deployment T134443 starting momentarily. Downtime should be minimal but there will be a short interruption while the service restarts. [01:15:03] T134443: Next Phabricator Update - 2016-05-18 - https://phabricator.wikimedia.org/T134443 [01:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:16:51] yep, phab is down [01:18:47] I hate typos in topics [01:21:59] !log Phabricator upgrade completed and service restored. [01:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:22:25] Luke081515: All done [01:22:35] ok :) [01:28:23] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:53:03] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Puppet has 1 failures [02:17:44] RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [02:38:57] 06Operations, 10Traffic: Sort out vcl_deliver vs vcl_synth mess with v4 VCL - https://phabricator.wikimedia.org/T135696#2307697 (10BBlack) Actually it doesn't turn out to be as messy as it seems at first. It's mostly the analytics, debugging, etc headers in common code that need invocation in vcl_synth, but n... [02:45:10] (03PS1) 10BBlack: VCL: v4 deliver+synth refactoring [puppet] - 10https://gerrit.wikimedia.org/r/289588 (https://phabricator.wikimedia.org/T135696) [02:47:01] (03CR) 10BBlack: [C: 031] config-geo: list all DCs in failover lists for completeness [dns] - 10https://gerrit.wikimedia.org/r/289433 (owner: 10Ema) [04:15:13] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 677 [04:30:13] RECOVERY - check_mysql on lutetium is OK: Uptime: 717798 Threads: 1 Questions: 11577875 Slow queries: 13731 Opens: 86332 Flush tables: 2 Open tables: 64 Queries per second avg: 16.129 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [06:11:08] !log completed rolling restart of Elasticsearch codfw for Java update (T135499) [06:11:09] T135499: Restart elasticsearch clusters for Java update - https://phabricator.wikimedia.org/T135499 [06:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:24:59] 06Operations, 03Scap3: Scap3 calls the checker script before restarting the service, not able to restart a service if it's down. - https://phabricator.wikimedia.org/T135609#2307810 (10mobrovac) 05Open>03Resolved a:03mobrovac Thank you, @thcipriani, changing the stage makes it work correctly. [06:31:34] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:34] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:54] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:55] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:55] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:45] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 2 failures [06:48:44] (03PS2) 10Muehlenhoff: Add fonts-smc (Malayalam) to image/video scalers [puppet] - 10https://gerrit.wikimedia.org/r/287181 (https://phabricator.wikimedia.org/T33950) [06:53:03] mobrovac: around? [06:53:34] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:44] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:03] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:57:04] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:57:13] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:13] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:30] kart_: yes [06:57:53] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:53] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:05] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:55] 06Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: Add kartik to analytics-privatedata-users group - https://phabricator.wikimedia.org/T135704#2307853 (10KartikMistry) [07:02:58] mobrovac: https://gerrit.wikimedia.org/r/#/c/289421/ is OK? Will every update to cxserver/deploy also update node_modules [07:03:42] mobrovac: and what if we don't want node_modules to be update if time (like for small changes) [07:04:06] !log installed chromium security updates on osmium [07:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:05:07] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add fonts-smc (Malayalam) to image/video scalers [puppet] - 10https://gerrit.wikimedia.org/r/287181 (https://phabricator.wikimedia.org/T33950) (owner: 10Muehlenhoff) [07:10:11] 06Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: Add kartik to analytics-privatedata-users group - https://phabricator.wikimedia.org/T135704#2307872 (10jcrespo) I suggested this request so their queries can be done on analytics s... [07:11:59] 06Operations, 10Wikimedia-General-or-Unknown, 07I18n, 13Patch-For-Review, 07Upstream: Update Malayalam fonts packages - https://phabricator.wikimedia.org/T33950#2307873 (10MoritzMuehlenhoff) 05Open>03Resolved The patch has been merged and enabled on the image scalers. [07:16:01] mobrovac: meanwhile, I saw citoid updates node_modules with every patches :) [07:17:30] !log performing schema change on s4 T130692 [07:17:31] T130692: Add new indexes from eec016ece6d2b30addcdf3d3efcc2ba59b10e858 to production databases - https://phabricator.wikimedia.org/T130692 [07:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:20:04] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [07:29:06] kart_: if you use --force it will update node_modules every time, if not only when package.json changes [07:29:53] PROBLEM - MariaDB Slave Lag: s4 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 678.88 seconds [07:30:53] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:32:53] <_joe_> mobrovac: you mean that powerful information can't be retreived reading the effing manual? [07:33:43] no no, it's in there _joe_ [07:33:53] but you know, you have to find the link, and read it [07:33:58] <_joe_> oh so maybe reading the manual is what people need to do. [07:34:03] oh click the the link too [07:40:50] you can ignore s4, for some reasons toku cannot do that alter online, but dbstores are not on production [07:51:46] (03CR) 10Ladsgroup: ores: Use the new service::uwsgi define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/288618 (owner: 10Alexandros Kosiaris) [08:00:44] (03PS3) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [08:01:05] (03CR) 10jenkins-bot: [V: 04-1] keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [08:04:07] mobrovac: well, https://wikitech.wikimedia.org/wiki/Services/Deployment#Regular_Deployment is the link I've :) [08:04:46] (03PS1) 10Giuseppe Lavagetto: Add systemd support for the jessie build [debs/nutcracker] - 10https://gerrit.wikimedia.org/r/289603 [08:06:52] !log rolling restart of hhvm on mediawiki in eqiad to pick up expat security update [08:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:19:04] RECOVERY - MariaDB Slave Lag: s4 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 0.22 seconds [08:20:09] (03PS4) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [08:27:35] !log restarting apache on ytterbium (hosting gerrit.wikimedia.org) for security update [08:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:29:35] 06Operations, 10MediaWiki-Configuration, 10MediaWiki-Database, 07Performance, 05codfw-rollout: [RFC] Set mediawiki's read-only mode automatically when database masters are detected to have read_only=1 - https://phabricator.wikimedia.org/T135711#2308014 (10jcrespo) [08:31:20] (03CR) 10Filippo Giunchedi: "LGTM, I'm still investigating what's up with periodic queueing/dropping of cassandra metrics in https://phabricator.wikimedia.org/T135385 " [puppet] - 10https://gerrit.wikimedia.org/r/289564 (owner: 10Alexandros Kosiaris) [08:31:24] 06Operations, 10MediaWiki-Configuration, 10MediaWiki-Database, 07Performance, 05codfw-rollout: [RFC] Set mediawiki's read-only mode automatically when database masters are detected to have read_only=1 - https://phabricator.wikimedia.org/T135711#2308026 (10jcrespo) [08:32:55] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 3 failures [08:33:54] 06Operations, 10MediaWiki-Configuration, 10MediaWiki-Database, 07Performance, 05codfw-rollout: [RFC] Set mediawiki's read-only mode automatically when database masters are detected to have read_only=1 - https://phabricator.wikimedia.org/T135711#2308030 (10jcrespo) Only adding #performance because it is (... [08:35:09] ^ _joe_, I've created this first to start thinking about orchestration of mediawiki, first with a small scope [08:35:42] !log gallium: purging old Linux kernel packages (~2.2Gbytes) [08:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:46:06] <_joe_> jynus: I'll take a look soon [08:46:44] 07Puppet, 10Phragile, 06TCB-Team, 07Composer: Puppet fail due to composer install on Phragile instance - https://phabricator.wikimedia.org/T133967#2308032 (10Tobi_WMDE_SW) p:05Triage>03High [08:47:01] (03PS2) 10Giuseppe Lavagetto: Add systemd support for the jessie build [debs/nutcracker] - 10https://gerrit.wikimedia.org/r/289603 [08:53:15] errors on neon [08:53:57] "Error: Could not find any hostgroup matching 'labs_eqiad'" [08:54:46] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2308059 (10elukey) Re-checked Ganglia and mem_report for mc1009 seems to have reached a stable state, but the memory allocated is around 44GB rather than 80+ li... [08:57:21] jynus: iirc there was a patch to add that ganglia cluster, likely missing the icinga part heh [09:00:04] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:00:38] (03PS3) 10Giuseppe Lavagetto: Add systemd support for the jessie build [debs/nutcracker] - 10https://gerrit.wikimedia.org/r/289603 (https://phabricator.wikimedia.org/T132032) [09:08:51] !log restarting apache on silver (hosting wikitech) for security update [09:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:11:04] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM, 13Patch-For-Review: Backport nutcracker 0.4.1 to jessie - https://phabricator.wikimedia.org/T132032#2308097 (10Joe) The package is ready and works reasonably, I will wait before merging the patch for a final style check but lintian is green and I smoke-... [09:11:12] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM, 13Patch-For-Review: Backport nutcracker 0.4.1 to jessie - https://phabricator.wikimedia.org/T132032#2308098 (10Joe) 05Open>03Resolved [09:11:14] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2308099 (10Joe) [09:15:25] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 400 (expecting: 200) [09:15:34] PROBLEM - HHVM rendering on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:33] RECOVERY - HHVM rendering on mw1153 is OK: HTTP OK: HTTP/1.1 200 OK - 67874 bytes in 0.151 second response time [09:22:02] !log restarting apache on neon (hosting icinga) for security update [09:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:23:20] !log updated cxserver to 4aaec58 [09:23:23] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2308124 (10elukey) Snapshots: [[ https://phabricator.wikimedia.org/P3129 | mc1007_stats_1463649014 ]] [[ https://phabricator.wikimedia.org/P3130 | mc1007_stats... [09:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:24:32] 06Operations, 10DBA: Upgrade x1 cluster - https://phabricator.wikimedia.org/T112079#2308127 (10Volans) a:03Volans [09:25:48] 06Operations, 10MediaWiki-Configuration, 10MediaWiki-Database, 07Performance, and 2 others: Set mediawiki's read-only mode automatically when database masters are detected to have read_only=1 - https://phabricator.wikimedia.org/T135711#2308132 (10Danny_B) [09:28:12] (03CR) 10Filippo Giunchedi: [C: 04-1] "some comments on the python part, not very familiar with the puppet part tho" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [09:30:13] PROBLEM - check_mysql on fdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2033 [09:31:48] (03PS2) 10Giuseppe Lavagetto: Add arbcom-nl.m.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/289163 (https://phabricator.wikimedia.org/T135480) [09:33:48] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and set up 16 db's db1079-1094 - https://phabricator.wikimedia.org/T135253#2308142 (10jcrespo) [09:33:50] 06Operations, 10DBA: Install, configure and provision recently arrived db core machines - https://phabricator.wikimedia.org/T133398#2308141 (10jcrespo) [09:34:41] (03CR) 10Giuseppe Lavagetto: [C: 032] Add arbcom-nl.m.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/289163 (https://phabricator.wikimedia.org/T135480) (owner: 10Giuseppe Lavagetto) [09:35:53] 06Operations, 10DBA: Physical location SPOF because of database server distribution on a single rack (D1) - https://phabricator.wikimedia.org/T111992#2308146 (10jcrespo) [09:36:01] 06Operations, 10DBA: Physical location SPOF because of database server distribution on a single rack (D1) - https://phabricator.wikimedia.org/T111992#2308149 (10Volans) a:05jcrespo>03Volans With the new coredb server of T135253 we can re-distribute the load, here my proposal for the assignement of servers:... [09:36:32] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and set up 16 db's db1079-1094 - https://phabricator.wikimedia.org/T135253#2293311 (10Volans) [09:36:34] 06Operations, 10DBA: Physical location SPOF because of database server distribution on a single rack (D1) - https://phabricator.wikimedia.org/T111992#2308153 (10Volans) [09:36:44] 06Operations, 10DNS, 10Traffic, 13Patch-For-Review: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2308155 (10Joe) 05Open>03Resolved [09:36:50] 06Operations, 10DBA: Physical location SPOF because of database server distribution on a single rack (D1) - https://phabricator.wikimedia.org/T111992#2308156 (10jcrespo) [09:36:52] 06Operations, 10DBA: Install, configure and provision recently arrived db core machines - https://phabricator.wikimedia.org/T133398#2308157 (10jcrespo) [09:36:56] 06Operations, 10DNS, 10Traffic, 13Patch-For-Review: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2300317 (10Joe) The mobile site now works. [09:37:56] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and set up 16 db's db1079-1094 - https://phabricator.wikimedia.org/T135253#2308160 (10jcrespo) [09:37:58] 06Operations, 10DBA: Physical location SPOF because of database server distribution on a single rack (D1) - https://phabricator.wikimedia.org/T111992#1622158 (10jcrespo) [09:38:34] 06Operations, 10ops-eqiad: Rack and Set up new application servers mw1261-1283 - https://phabricator.wikimedia.org/T133798#2308161 (10Joe) p:05Triage>03Normal [09:38:48] 06Operations, 10ops-eqiad: Rack and Set up new application servers mw1284-1306 - https://phabricator.wikimedia.org/T134309#2308162 (10Joe) p:05Triage>03Normal [09:39:24] !log Restarting Jenkins [09:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:40:13] RECOVERY - check_mysql on fdb2001 is OK: Uptime: 1285489 Threads: 1 Questions: 25070923 Slow queries: 7574 Opens: 944 Flush tables: 2 Open tables: 575 Queries per second avg: 19.503 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:40:15] forget that, long running jobs in progress [09:46:09] 06Operations, 10DBA: db1033 (old s7 master) needs backup and reimage - https://phabricator.wikimedia.org/T134555#2269287 (10jcrespo) a:03jcrespo Doing it now. [09:46:15] 06Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: Add kartik to analytics-privatedata-users group - https://phabricator.wikimedia.org/T135704#2308172 (10Joe) p:05Triage>03Normal [09:46:48] !log restarting apache2 on strontium (will impose a few temporary puppet failures) [09:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:47:23] 06Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: Add kartik to analytics-privatedata-users group - https://phabricator.wikimedia.org/T135704#2307853 (10Joe) We will need manager's approval for this request for Kartik. In the mea... [09:51:23] PROBLEM - puppet last run on mw2130 is CRITICAL: CRITICAL: puppet fail [09:51:25] !log restarting apache2 on pallaium (will impose a few temporary puppet failures) [09:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:51:33] restarting jenkins again [09:52:14] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: puppet fail [09:52:25] PROBLEM - puppet last run on sca2002 is CRITICAL: CRITICAL: puppet fail [09:52:29] 06Operations, 10DNS, 10Traffic, 13Patch-For-Review: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2308184 (10Sjoerddebruin) Thanks! [09:52:45] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Puppet has 1 failures [09:54:34] PROBLEM - puppet last run on mw1137 is CRITICAL: CRITICAL: Puppet has 1 failures [09:54:43] PROBLEM - puppet last run on mw1206 is CRITICAL: CRITICAL: Puppet has 1 failures [09:55:44] PROBLEM - puppet last run on elastic2013 is CRITICAL: CRITICAL: Puppet has 8 failures [09:55:44] PROBLEM - puppet last run on db1071 is CRITICAL: CRITICAL: Puppet has 7 failures [09:55:44] PROBLEM - puppet last run on mx1001 is CRITICAL: CRITICAL: puppet fail [09:55:44] PROBLEM - puppet last run on mw2039 is CRITICAL: CRITICAL: Puppet has 3 failures [09:55:45] PROBLEM - puppet last run on mw2008 is CRITICAL: CRITICAL: puppet fail [09:55:45] PROBLEM - puppet last run on mw2058 is CRITICAL: CRITICAL: puppet fail [09:55:53] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Puppet has 4 failures [09:55:54] PROBLEM - puppet last run on mw1134 is CRITICAL: CRITICAL: puppet fail [09:55:54] PROBLEM - puppet last run on conf2001 is CRITICAL: CRITICAL: puppet fail [09:55:54] PROBLEM - puppet last run on ms-be2010 is CRITICAL: CRITICAL: Puppet has 8 failures [09:55:55] PROBLEM - puppet last run on analytics1052 is CRITICAL: CRITICAL: Puppet has 28 failures [09:56:04] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: puppet fail [09:56:04] PROBLEM - puppet last run on db1062 is CRITICAL: CRITICAL: Puppet has 23 failures [09:56:14] PROBLEM - puppet last run on db1010 is CRITICAL: CRITICAL: Puppet has 10 failures [09:56:18] (03PS1) 10Jcrespo: Depool db1033 (s7 old master) & db1029 (x1-slave) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289620 (https://phabricator.wikimedia.org/T112079) [09:56:24] PROBLEM - puppet last run on db2041 is CRITICAL: CRITICAL: Puppet has 2 failures [09:56:24] PROBLEM - puppet last run on db2037 is CRITICAL: CRITICAL: Puppet has 8 failures [09:56:33] PROBLEM - puppet last run on mw1230 is CRITICAL: CRITICAL: Puppet has 3 failures [09:56:33] PROBLEM - puppet last run on analytics1044 is CRITICAL: CRITICAL: Puppet has 3 failures [09:56:44] PROBLEM - puppet last run on wtp1002 is CRITICAL: CRITICAL: Puppet has 29 failures [09:56:44] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Puppet has 5 failures [09:56:44] PROBLEM - puppet last run on uranium is CRITICAL: CRITICAL: Puppet has 14 failures [09:56:44] PROBLEM - puppet last run on mw1178 is CRITICAL: CRITICAL: puppet fail [09:56:44] PROBLEM - puppet last run on mw2085 is CRITICAL: CRITICAL: Puppet has 10 failures [09:56:44] PROBLEM - puppet last run on ms-be1004 is CRITICAL: CRITICAL: puppet fail [09:56:45] PROBLEM - puppet last run on lvs2003 is CRITICAL: CRITICAL: puppet fail [09:56:45] PROBLEM - puppet last run on mw2186 is CRITICAL: CRITICAL: puppet fail [09:56:45] PROBLEM - puppet last run on db2023 is CRITICAL: CRITICAL: Puppet has 8 failures [09:56:46] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Puppet has 2 failures [09:56:53] PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: Puppet has 9 failures [09:56:54] PROBLEM - puppet last run on ms-be2013 is CRITICAL: CRITICAL: puppet fail [09:56:54] PROBLEM - puppet last run on ms-be2014 is CRITICAL: CRITICAL: Puppet has 7 failures [09:56:54] (03CR) 10Volans: [C: 031] Depool db1033 (s7 old master) & db1029 (x1-slave) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289620 (https://phabricator.wikimedia.org/T112079) (owner: 10Jcrespo) [09:56:54] PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Puppet has 8 failures [09:56:54] PROBLEM - puppet last run on mw1020 is CRITICAL: CRITICAL: Puppet has 30 failures [09:56:54] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Puppet has 54 failures [09:56:54] PROBLEM - puppet last run on cp2017 is CRITICAL: CRITICAL: puppet fail [09:56:55] PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: Puppet has 3 failures [09:56:55] PROBLEM - puppet last run on ms-fe1003 is CRITICAL: CRITICAL: puppet fail [09:56:56] PROBLEM - puppet last run on db2001 is CRITICAL: CRITICAL: Puppet has 8 failures [09:57:03] PROBLEM - puppet last run on db2004 is CRITICAL: CRITICAL: Puppet has 1 failures [09:57:04] PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: puppet fail [09:57:05] PROBLEM - puppet last run on ms-be3002 is CRITICAL: CRITICAL: Puppet has 5 failures [09:57:05] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: puppet fail [09:57:13] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Puppet has 3 failures [09:57:13] PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Puppet has 6 failures [09:57:14] PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Puppet has 25 failures [09:57:14] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Puppet has 15 failures [09:57:23] moritzm: just a few? :-) [09:57:23] PROBLEM - puppet last run on mw1233 is CRITICAL: CRITICAL: puppet fail [09:57:24] PROBLEM - puppet last run on mw1094 is CRITICAL: CRITICAL: Puppet has 20 failures [09:57:24] PROBLEM - puppet last run on ms-be1017 is CRITICAL: CRITICAL: puppet fail [09:57:34] PROBLEM - puppet last run on wtp2009 is CRITICAL: CRITICAL: Puppet has 3 failures [09:57:34] PROBLEM - puppet last run on mw2101 is CRITICAL: CRITICAL: Puppet has 10 failures [09:57:34] PROBLEM - puppet last run on db2003 is CRITICAL: CRITICAL: Puppet has 2 failures [09:57:34] PROBLEM - puppet last run on wtp1021 is CRITICAL: CRITICAL: Puppet has 25 failures [09:57:35] PROBLEM - puppet last run on ms-be2017 is CRITICAL: CRITICAL: Puppet has 8 failures [09:57:35] PROBLEM - puppet last run on ms-be2016 is CRITICAL: CRITICAL: puppet fail [09:57:35] PROBLEM - puppet last run on db1063 is CRITICAL: CRITICAL: puppet fail [09:57:35] PROBLEM - puppet last run on mw2144 is CRITICAL: CRITICAL: puppet fail [09:57:35] PROBLEM - puppet last run on elastic1016 is CRITICAL: CRITICAL: Puppet has 3 failures [09:57:36] PROBLEM - puppet last run on mw2111 is CRITICAL: CRITICAL: Puppet has 9 failures [09:57:36] PROBLEM - puppet last run on ms-be1007 is CRITICAL: CRITICAL: Puppet has 2 failures [09:57:44] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: Puppet has 4 failures [09:57:44] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Puppet has 14 failures [09:57:44] PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Puppet has 28 failures [09:57:53] PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: Puppet has 24 failures [09:57:54] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CRITICAL: Puppet has 38 failures [09:57:54] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: Puppet has 6 failures [09:57:54] PROBLEM - puppet last run on mw2168 is CRITICAL: CRITICAL: Puppet has 8 failures [09:57:55] PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Puppet has 1 failures [09:58:03] PROBLEM - puppet last run on ms-be1009 is CRITICAL: CRITICAL: Puppet has 6 failures [09:58:03] PROBLEM - puppet last run on ms-be2008 is CRITICAL: CRITICAL: Puppet has 4 failures [09:58:10] volans: out of 1200 it's still just a few :-) [09:58:13] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Puppet has 5 failures [09:58:34] PROBLEM - puppet last run on mw1180 is CRITICAL: CRITICAL: Puppet has 19 failures [09:58:44] PROBLEM - puppet last run on wasat is CRITICAL: CRITICAL: Puppet has 5 failures [09:58:44] PROBLEM - puppet last run on mw2174 is CRITICAL: CRITICAL: Puppet has 1 failures [09:58:45] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet has 6 failures [09:58:45] PROBLEM - puppet last run on mw2062 is CRITICAL: CRITICAL: Puppet has 12 failures [09:58:54] PROBLEM - puppet last run on mw1132 is CRITICAL: CRITICAL: Puppet has 11 failures [09:58:54] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures [09:59:03] PROBLEM - puppet last run on mw2106 is CRITICAL: CRITICAL: Puppet has 2 failures [09:59:04] PROBLEM - puppet last run on mw2094 is CRITICAL: CRITICAL: Puppet has 7 failures [09:59:08] (03CR) 10Jcrespo: [C: 032] Depool db1033 (s7 old master) & db1029 (x1-slave) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289620 (https://phabricator.wikimedia.org/T112079) (owner: 10Jcrespo) [09:59:24] PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: Puppet has 16 failures [09:59:25] PROBLEM - puppet last run on mw1116 is CRITICAL: CRITICAL: Puppet has 15 failures [09:59:25] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Puppet has 4 failures [09:59:33] PROBLEM - puppet last run on mw2167 is CRITICAL: CRITICAL: Puppet has 16 failures [09:59:34] PROBLEM - puppet last run on mw2098 is CRITICAL: CRITICAL: Puppet has 4 failures [09:59:34] PROBLEM - puppet last run on mw2200 is CRITICAL: CRITICAL: Puppet has 11 failures [09:59:43] PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: Puppet has 22 failures [09:59:43] PROBLEM - puppet last run on mw2172 is CRITICAL: CRITICAL: Puppet has 9 failures [09:59:44] PROBLEM - puppet last run on mw2048 is CRITICAL: CRITICAL: Puppet has 7 failures [09:59:44] PROBLEM - puppet last run on mw2027 is CRITICAL: CRITICAL: Puppet has 13 failures [09:59:54] PROBLEM - puppet last run on mw1256 is CRITICAL: CRITICAL: Puppet has 25 failures [09:59:54] PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: Puppet has 6 failures [09:59:54] PROBLEM - puppet last run on mw2107 is CRITICAL: CRITICAL: Puppet has 5 failures [09:59:55] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 5 failures [09:59:55] PROBLEM - puppet last run on mw2133 is CRITICAL: CRITICAL: Puppet has 10 failures [10:00:03] PROBLEM - puppet last run on mw2053 is CRITICAL: CRITICAL: Puppet has 5 failures [10:00:04] PROBLEM - puppet last run on mw2150 is CRITICAL: CRITICAL: Puppet has 4 failures [10:00:08] hasharyikes [10:00:12] wtf [10:00:15] PROBLEM - puppet last run on mw1198 is CRITICAL: CRITICAL: Puppet has 19 failures [10:00:51] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2308206 (10Gilles) [10:00:55] PROBLEM - puppet last run on mw2202 is CRITICAL: CRITICAL: Puppet has 12 failures [10:01:24] PROBLEM - puppet last run on mw2046 is CRITICAL: CRITICAL: Puppet has 13 failures [10:02:05] RECOVERY - puppet last run on db1063 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [10:02:23] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:02:29] sync-masters taking a lot of time, problems in mira? [10:03:00] it worked, finally [10:03:09] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1033 (s7 old master) & db1029 (x1-slave) for maintenance (duration: 02m 05s) [10:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:03:38] also, there is one, very slow application server, I will find you one time soon [10:10:13] <_joe_> sjoerddebruin: that's just a puppetmaster failing briefly [10:10:45] ;) [10:11:33] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Overall liking this. a few inline comments." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [10:14:54] RECOVERY - puppet last run on wtp1021 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:18:04] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [10:18:23] RECOVERY - puppet last run on sca2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:18:33] RECOVERY - puppet last run on mw2203 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [10:18:34] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [10:18:34] RECOVERY - puppet last run on ms-be2009 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [10:18:44] RECOVERY - puppet last run on ms-be3002 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [10:18:53] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [10:19:14] RECOVERY - puppet last run on mw2101 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [10:19:15] RECOVERY - puppet last run on mw2200 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [10:19:16] RECOVERY - puppet last run on elastic1016 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [10:19:16] RECOVERY - puppet last run on mw2130 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [10:19:16] RECOVERY - puppet last run on mw2111 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:19:23] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [10:19:24] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [10:19:25] RECOVERY - puppet last run on elastic2013 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [10:19:34] RECOVERY - puppet last run on mw2039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:19:34] RECOVERY - puppet last run on snapshot1002 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [10:19:34] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:19:34] RECOVERY - puppet last run on mw2168 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [10:19:44] RECOVERY - puppet last run on ms-be2010 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [10:19:45] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [10:20:14] RECOVERY - puppet last run on mw1180 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [10:20:14] RECOVERY - puppet last run on db2041 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [10:20:14] RECOVERY - puppet last run on db2037 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [10:20:15] RECOVERY - puppet last run on mw1230 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:20:29] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [10:20:29] RECOVERY - puppet last run on wasat is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [10:20:30] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:20:30] RECOVERY - puppet last run on mw1137 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:20:30] RECOVERY - puppet last run on mw2174 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [10:20:30] RECOVERY - puppet last run on mw2085 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:20:31] RECOVERY - puppet last run on mw2062 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:20:31] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:20:31] RECOVERY - puppet last run on db2023 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [10:20:33] RECOVERY - puppet last run on mw1206 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:20:37] \o/ [10:20:43] RECOVERY - puppet last run on mw1020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:20:43] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:20:44] RECOVERY - puppet last run on ms-be2014 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [10:20:45] RECOVERY - puppet last run on db2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:20:53] RECOVERY - puppet last run on db2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:21:04] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:21:04] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:21:13] RECOVERY - puppet last run on mw1094 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:21:13] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [10:21:14] RECOVERY - puppet last run on wtp2009 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [10:21:14] RECOVERY - puppet last run on mw2098 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:21:24] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [10:21:25] RECOVERY - puppet last run on ms-be1007 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [10:21:25] RECOVERY - puppet last run on mw2172 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:21:25] RECOVERY - puppet last run on mw2048 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [10:21:34] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [10:21:34] RECOVERY - puppet last run on db1071 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:21:34] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:21:34] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:21:43] RECOVERY - puppet last run on mw2107 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [10:21:43] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:21:43] RECOVERY - puppet last run on mw2133 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [10:21:44] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [10:21:44] RECOVERY - puppet last run on mw2053 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [10:21:44] RECOVERY - puppet last run on mw2150 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:21:44] RECOVERY - puppet last run on ms-be1009 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [10:21:54] RECOVERY - puppet last run on conf2001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [10:21:54] RECOVERY - puppet last run on analytics1052 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:21:54] RECOVERY - puppet last run on ms-be2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:03] RECOVERY - puppet last run on db1062 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:03] RECOVERY - puppet last run on db1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:24] RECOVERY - puppet last run on analytics1044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:33] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:34] RECOVERY - puppet last run on uranium is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [10:22:34] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:43] RECOVERY - puppet last run on mw2186 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [10:22:43] RECOVERY - puppet last run on lvs2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:44] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:44] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:53] RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:54] RECOVERY - puppet last run on ms-be2013 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [10:22:54] RECOVERY - puppet last run on mw2106 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:54] RECOVERY - puppet last run on ms-fe1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:54] RECOVERY - puppet last run on mw2094 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:54] RECOVERY - puppet last run on cp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:55] RECOVERY - puppet last run on mw2046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:55] RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:23:04] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:23:04] RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:23:13] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:23:14] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:23:14] RECOVERY - puppet last run on ms-be1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:23:15] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:23:23] RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [10:23:24] RECOVERY - puppet last run on db2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:23:33] RECOVERY - puppet last run on ms-be2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:23:33] RECOVERY - puppet last run on ms-be2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:23:34] RECOVERY - puppet last run on mw2144 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [10:23:34] RECOVERY - puppet last run on mw2027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:23:43] RECOVERY - puppet last run on mw1256 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:23:44] RECOVERY - puppet last run on mx1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:23:45] RECOVERY - puppet last run on mw2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:23:45] RECOVERY - puppet last run on mw2058 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [10:23:54] RECOVERY - puppet last run on mw1134 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:24:04] RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:24:05] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:24:44] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:24:44] RECOVERY - puppet last run on ms-be1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:29:03] 06Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: Add kartik to analytics-privatedata-users group - https://phabricator.wikimedia.org/T135704#2308272 (10Arrbee) This is an approved request for Kartik. Thanks. [10:30:09] (03PS1) 10Giuseppe Lavagetto: monitoring: add the two new clusters labs and labvirt [puppet] - 10https://gerrit.wikimedia.org/r/289624 [10:33:31] (03PS2) 10Giuseppe Lavagetto: monitoring: add the two new clusters labs and labvirt [puppet] - 10https://gerrit.wikimedia.org/r/289624 [10:33:52] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] monitoring: add the two new clusters labs and labvirt [puppet] - 10https://gerrit.wikimedia.org/r/289624 (owner: 10Giuseppe Lavagetto) [10:38:13] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Barack Obama page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-sections-lead returned the unexpected status 400 (expecting: 200) [10:40:13] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [10:41:36] !log Disable puppet on db1029 for reimaging T112079 [10:41:37] T112079: Upgrade x1 cluster - https://phabricator.wikimedia.org/T112079 [10:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:43:52] (03PS1) 10Volans: MariaDB: Reimage db1029 to Jessie and MariaDB 10 [puppet] - 10https://gerrit.wikimedia.org/r/289627 (https://phabricator.wikimedia.org/T112079) [10:44:04] (03CR) 10Muehlenhoff: [C: 04-1] "Two typos, but looks fine otherwise" (032 comments) [debs/nutcracker] - 10https://gerrit.wikimedia.org/r/289603 (https://phabricator.wikimedia.org/T132032) (owner: 10Giuseppe Lavagetto) [10:45:34] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1033.eqiad.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db1033.eqiad.wmnet (111 Connection refused) [10:45:36] (03CR) 10Giuseppe Lavagetto: Add systemd support for the jessie build (032 comments) [debs/nutcracker] - 10https://gerrit.wikimedia.org/r/289603 (https://phabricator.wikimedia.org/T132032) (owner: 10Giuseppe Lavagetto) [10:46:09] s7 critical? [10:46:24] (03PS4) 10Giuseppe Lavagetto: Add systemd support for the jessie build [debs/nutcracker] - 10https://gerrit.wikimedia.org/r/289603 [10:46:33] ah, not schema-change related, now related to db1033 maintenance [10:47:57] _joe_: if you're interested in fixing this properly... :) [10:48:05] there is a patch of mine in the source tree [10:48:11] that probably needs partial reverting [10:48:21] so that we can run it in the foreground with type=simple [10:48:34] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [10:48:44] !log db1033 stop, backup and reimage T134555 [10:48:45] T134555: db1033 (old s7 master) needs backup and reimage - https://phabricator.wikimedia.org/T134555 [10:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:49:22] !log db1029 stop, backup and reimage T112079 [10:49:23] T112079: Upgrade x1 cluster - https://phabricator.wikimedia.org/T112079 [10:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:52:25] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [10:53:13] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [10:54:33] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [10:54:33] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [10:54:44] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 400 (expecting: 200): /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /page/mobile-sections/{title} (Get MobileApps Foobar page) is CRITICAL: Tes [10:56:10] (03CR) 10Muehlenhoff: [C: 04-1] "This also needs --with systemd in the debian/rules file to get the unit installed and enabled." (031 comment) [debs/nutcracker] - 10https://gerrit.wikimedia.org/r/289603 (owner: 10Giuseppe Lavagetto) [10:57:04] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections-remaining/{title} (retrieve remaining sections of en.wp main page via mobile-sections-remaining) is CRITICAL: Test retrieve remaining sections of en.wp main page via mobile-sections-remaining returned the unexpected status 400 (expecting: 200) [10:59:14] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [11:01:24] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [11:02:54] <_joe_> what's up with mobileapps? [11:04:04] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [11:05:08] (03CR) 10Jcrespo: [C: 031] "Looks good, one coredb less." [puppet] - 10https://gerrit.wikimedia.org/r/289627 (https://phabricator.wikimedia.org/T112079) (owner: 10Volans) [11:05:23] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [11:07:35] (03PS5) 10Giuseppe Lavagetto: Add systemd support for the jessie build [debs/nutcracker] - 10https://gerrit.wikimedia.org/r/289603 [11:09:30] !log dropped negative values from mc_get_hits_rate ganglia metrics for eqiad memcached hosts by running https://phabricator.wikimedia.org/P3138 [11:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:10:41] (03PS1) 10Jcrespo: Upgrade db1033 to jessie and mariadb10 [puppet] - 10https://gerrit.wikimedia.org/r/289631 (https://phabricator.wikimedia.org/T134555) [11:10:57] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [debs/nutcracker] - 10https://gerrit.wikimedia.org/r/289603 (owner: 10Giuseppe Lavagetto) [11:12:08] (03CR) 10Jcrespo: "Last coredb with 5.5 (only misc is pending)." [puppet] - 10https://gerrit.wikimedia.org/r/289631 (https://phabricator.wikimedia.org/T134555) (owner: 10Jcrespo) [11:12:10] (03CR) 10Volans: [C: 031] "LGTM, another coredb less!" [puppet] - 10https://gerrit.wikimedia.org/r/289631 (https://phabricator.wikimedia.org/T134555) (owner: 10Jcrespo) [11:12:46] (03CR) 10Jcrespo: [C: 032] Upgrade db1033 to jessie and mariadb10 [puppet] - 10https://gerrit.wikimedia.org/r/289631 (https://phabricator.wikimedia.org/T134555) (owner: 10Jcrespo) [11:17:13] apergos: hi. did you get chance to test dump script for ContentTranslation? [11:18:22] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2308421 (10elukey) Also preliminary stats for mc1009 (the snapshot to be used with mc1010,mc1007 will be taken tomorrow since mc1009 has been running one day le... [11:20:15] (03PS2) 10Volans: MariaDB: Reimage db1029 to Jessie and MariaDB 10 [puppet] - 10https://gerrit.wikimedia.org/r/289627 (https://phabricator.wikimedia.org/T112079) [11:22:15] (03CR) 10Volans: [C: 032] MariaDB: Reimage db1029 to Jessie and MariaDB 10 [puppet] - 10https://gerrit.wikimedia.org/r/289627 (https://phabricator.wikimedia.org/T112079) (owner: 10Volans) [11:26:00] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [11:26:44] !log restarting salt-master on neodymium [11:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:33:15] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2308427 (10ori) Thanks for this. I am happy with the outcome. I do think we need to make an effort to be friendlier in access request ti... [11:36:14] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2308431 (10ori) For convenience, direct link to a Ganglia view showing get hits rate on the relevant hosts: https://ganglia.wikimedia.org/latest/graph_all_peri... [11:42:59] (03PS1) 10Muehlenhoff: Update server groups for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/289633 [11:45:47] (03PS1) 10Filippo Giunchedi: svc: add graphite LVS addresses [dns] - 10https://gerrit.wikimedia.org/r/289635 (https://phabricator.wikimedia.org/T85451) [11:48:29] (03PS1) 10Filippo Giunchedi: lvs: add graphite service [puppet] - 10https://gerrit.wikimedia.org/r/289636 (https://phabricator.wikimedia.org/T85451) [11:48:31] (03PS1) 10Filippo Giunchedi: graphite: add realserver class [puppet] - 10https://gerrit.wikimedia.org/r/289637 (https://phabricator.wikimedia.org/T85451) [11:50:33] (03PS1) 10Muehlenhoff: Mark old esams caches as to-be-decommed [puppet] - 10https://gerrit.wikimedia.org/r/289638 [11:51:59] (03PS2) 10Muehlenhoff: Update server groups for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/289633 [11:52:15] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update server groups for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/289633 (owner: 10Muehlenhoff) [11:53:41] (03PS1) 10Hashar: contint: bump pip 7.0.1 -> 8.1.2 [puppet] - 10https://gerrit.wikimedia.org/r/289639 [12:00:24] (03CR) 10Hashar: "puppet managed to upgrade it." [puppet] - 10https://gerrit.wikimedia.org/r/289639 (owner: 10Hashar) [12:11:38] 07Puppet, 10Phragile, 06TCB-Team, 07Composer, 03TCB-Team-Sprint-2016-05-19: Puppet fail due to composer install on Phragile instance - https://phabricator.wikimedia.org/T133967#2308475 (10Tobi_WMDE_SW) [12:19:47] kart_: I've been setting things up for that the last 2 days [12:21:37] getting the script I use for other such jobs fixed up to handle this case too [12:25:37] (03CR) 10Alexandros Kosiaris: ores: Use the new service::uwsgi define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/288618 (owner: 10Alexandros Kosiaris) [12:26:11] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2308521 (10elukey) @Eevans I stole the 2.1.13 debs in your home dir on restbase2008 and downgraded aqs100[456... [12:28:27] !log restarted hue on analytics1027 for security upgrades [12:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:36:14] 06Operations, 10DBA, 10Phabricator, 10Phabricator-Upstream: Project icon files are missing - https://phabricator.wikimedia.org/T128160#2065709 (10Qgil) Asking in relation to T128160: do we know how big is the impact / the damage? There is no question that this is a bug, but I wonder whether it is relevant... [12:37:55] 06Operations, 10netops: codfw-eqiad Zayo link is down (cr2-codfw:xe-5/0/1) - https://phabricator.wikimedia.org/T134930#2308571 (10faidon) 05stalled>03Resolved And here's the formal RFO: {F4030747} [12:40:15] 06Operations, 10netops: cr2-codfw LUCHIP/trinity_pio error messages - https://phabricator.wikimedia.org/T134932#2308582 (10faidon) The case is making progress. The latest are: - We should reboot FPC 0 and see if that will stop the error messages. - If not, we can RMA. [12:41:41] (03CR) 10Ladsgroup: ores: Use the new service::uwsgi define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/288618 (owner: 10Alexandros Kosiaris) [12:44:51] 06Operations, 10DBA, 10Phabricator, 10Phabricator-Upstream: Project icon files are missing - https://phabricator.wikimedia.org/T128160#2308595 (10Danny_B) Assuming it's standalone bug - let it be, not worth to spend WMF resources on it. Assuming it's some more global issue with db - definitely look at it t... [12:45:00] !log restarted oozie on analytics1003 for security upgrades [12:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:47:09] 06Operations: Restarts of ganglia-monitor are unreliable - https://phabricator.wikimedia.org/T135723#2308599 (10MoritzMuehlenhoff) [12:47:52] (03PS2) 10BBlack: Mark old esams caches as to-be-decommed [puppet] - 10https://gerrit.wikimedia.org/r/289638 (owner: 10Muehlenhoff) [12:49:26] (03CR) 10BBlack: [C: 032] Mark old esams caches as to-be-decommed [puppet] - 10https://gerrit.wikimedia.org/r/289638 (owner: 10Muehlenhoff) [12:52:30] (03PS1) 10BBlack: cp104[34] role spare for decom - T133614 [puppet] - 10https://gerrit.wikimedia.org/r/289644 [12:52:54] (03CR) 10BBlack: [C: 032 V: 032] cp104[34] role spare for decom - T133614 [puppet] - 10https://gerrit.wikimedia.org/r/289644 (owner: 10BBlack) [12:53:59] (03PS1) 10Huji: Add 'deletedhistory' right to "eliminator" user group on fa.wikipedia. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289645 (https://phabricator.wikimedia.org/T135725) [12:54:35] !log bounce carbon-c-relay on graphite1001, run with debug version [12:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:55:40] RECOVERY - puppet last run on maps2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:58:10] (03CR) 10Urbanecm: [C: 031] "Seems good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289645 (https://phabricator.wikimedia.org/T135725) (owner: 10Huji) [13:01:39] (03CR) 10Hashar: [C: 031] openstack: allow unprivileged users to access nova logs [puppet] - 10https://gerrit.wikimedia.org/r/289370 (https://phabricator.wikimedia.org/T133992) (owner: 10Giuseppe Lavagetto) [13:02:30] (03CR) 10Hashar: [C: 031] openstack::nova::api: allow users to access api logs [puppet] - 10https://gerrit.wikimedia.org/r/289371 (https://phabricator.wikimedia.org/T133992) (owner: 10Giuseppe Lavagetto) [13:06:08] (03PS50) 10Alexandros Kosiaris: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup) [13:07:30] PROBLEM - puppet last run on mw2181 is CRITICAL: CRITICAL: puppet fail [13:07:59] 06Operations, 10ops-eqiad, 06DC-Ops: I/O issues for /dev/sdd on analytics1047.eqiad.wmnet - https://phabricator.wikimedia.org/T134056#2308776 (10elukey) @Cmjohnson: Hi Chris, any news about when the disk could be replaced? Thanks a lot! (Sorry for the ping) [13:08:19] (03CR) 10Hashar: [C: 04-1] "I have added that with I159f20df2a44dc39a52fe15211f24663be54c671 so we can maintain documentation for a wild range of scripts and have the" [puppet] - 10https://gerrit.wikimedia.org/r/289351 (owner: 10Dzahn) [13:12:50] why does https://www.mediawiki.org/static/current/extensions/ point to wmf.1 when mediawiki.org is at wmf.2 and this expected? [13:16:22] Nikerabbit: it is switched later [13:16:29] i.e. tonight with group2 iirc [13:17:49] hashar: can we configure ULS so that it loads fonts from the correct version? [13:18:07] right now we are serving 404s for woff2 versions [13:22:40] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Puppet has 1 failures [13:26:13] (03PS1) 10Nikerabbit: ULS: Stop using /static/current [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289652 [13:26:16] (03PS1) 10Jforrester: Drop already-enabled VisualEditorNewAccountEnableProportion wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289653 [13:30:14] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2308897 (10elukey) I followed @fgiunchedi's advice and had a chat with @ema about this. His code updates initramfs only after the first time that puppet runs, meanwhi... [13:30:49] RECOVERY - puppet last run on mw2181 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [13:31:29] !log reboot labstore1003 kernel upgrade [13:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:35:03] (03CR) 10Dereckson: [C: 04-1] "Small style issue." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289645 (https://phabricator.wikimedia.org/T135725) (owner: 10Huji) [13:39:01] (03PS3) 10Andrew Bogott: openstack: allow unprivileged users to access nova logs [puppet] - 10https://gerrit.wikimedia.org/r/289370 (https://phabricator.wikimedia.org/T133992) (owner: 10Giuseppe Lavagetto) [13:41:32] (03CR) 10Andrew Bogott: [C: 032] openstack: allow unprivileged users to access nova logs [puppet] - 10https://gerrit.wikimedia.org/r/289370 (https://phabricator.wikimedia.org/T133992) (owner: 10Giuseppe Lavagetto) [13:42:43] (03PS3) 10Andrew Bogott: openstack::nova::api: allow users to access api logs [puppet] - 10https://gerrit.wikimedia.org/r/289371 (https://phabricator.wikimedia.org/T133992) (owner: 10Giuseppe Lavagetto) [13:43:09] 07Blocked-on-Operations, 06Operations, 06Services, 06WMDE-Analytics-Engineering, and 3 others: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#2308959 (10fgiunchedi) the next series of patches uses LVS instead to perform load-balancing among graphite machines, clients would... [13:43:48] PROBLEM - Host betelgeuse is DOWN: PING CRITICAL - Packet loss = 100% [13:44:27] is that fr, ganeti? [13:44:39] grr. [13:44:51] betelgeuse is fr, apparently it didn't survive a reboot [13:44:55] <_joe_> andrewbogott: are you merging it? [13:45:08] 06Operations, 10Beta-Cluster-Infrastructure, 10Traffic: deployment-cache-upload04 (m1.medium) / is almost full - https://phabricator.wikimedia.org/T135700#2308967 (10hashar) Well / is reported as full by the filesystem: ``` # df -h / Filesystem Size Used Avail Use% Mounted on /dev/vda3 19G 18G... [13:45:26] 06Operations, 10ops-eqiad, 10hardware-requests, 13Patch-For-Review: reclaim or decom: cp1043 + cp1044 - https://phabricator.wikimedia.org/T133614#2308970 (10mark) a:05mark>03None These are 2011 hosts - definitely trash. [13:45:47] _joe_: trying :) did you doit already? [13:46:15] (03CR) 10Andrew Bogott: [C: 032] openstack::nova::api: allow users to access api logs [puppet] - 10https://gerrit.wikimedia.org/r/289371 (https://phabricator.wikimedia.org/T133992) (owner: 10Giuseppe Lavagetto) [13:46:27] 06Operations, 10ops-esams, 06DC-Ops, 10Traffic, 10hardware-requests: Decomission amssq31-62 (32 hosts) - https://phabricator.wikimedia.org/T95742#2308975 (10mark) Yes, I'll remove this in the near future - likely before we refresh the other (newer) low range cp* hosts. [13:47:02] (03PS3) 10Andrew Bogott: admin: add the releng team to labnet-users [puppet] - 10https://gerrit.wikimedia.org/r/289372 (https://phabricator.wikimedia.org/T133992) (owner: 10Giuseppe Lavagetto) [13:49:01] (03CR) 10Andrew Bogott: [C: 032] admin: add the releng team to labnet-users [puppet] - 10https://gerrit.wikimedia.org/r/289372 (https://phabricator.wikimedia.org/T133992) (owner: 10Giuseppe Lavagetto) [13:50:29] hashar: want to see if your login on labnet1002 works now? [13:50:43] 06Operations, 10hardware-requests: eqiad out of warranty spares to decommission - approval request - https://phabricator.wikimedia.org/T120679#2308988 (10mark) a:05mark>03RobH >>! In T120679#1888235, @RobH wrote: > Mark, > > I'd like to get your blanket approval for the decommission of a large number of o... [13:52:45] andrewbogott: labnet1002 works (labnet1001 does not but that is probably expected) [13:52:57] hashar: I just haven't done the puppet run on 1001 yet [13:53:09] but anyway 1001 is the spare at the moment, 1002 has the good stuff on it [13:53:22] I can tail logs!!!! [13:53:35] cool [13:54:25] we will want to add legoktm and jzerebecki as well. They are not in RelEng but are definitely active on CI front [13:55:19] RECOVERY - Host betelgeuse is UP: PING OK - Packet loss = 0%, RTA = 36.69 ms [13:55:35] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Allow RelEng access to labnet servers (was: Allow RelEng nova log access) - https://phabricator.wikimedia.org/T133992#2309017 (10hashar) I can log in labnet1002 and tail files un... [13:55:42] not sure whether it needs another access request or if we can just be bold :D [13:56:14] <_joe_> the latter ffs [13:56:17] <_joe_> the latter [13:58:33] !log disable puppet on maps-test200{1,2,3,4} for enabling cassandra metrics collection selectively [13:58:34] (03CR) 10Hashar: [C: 04-1] "Cherry picked on CI puppet master and deployed on CI slaves." [puppet] - 10https://gerrit.wikimedia.org/r/289639 (owner: 10Hashar) [13:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:54] (03PS2) 10Alexandros Kosiaris: maps: Specify cassandra graphite host [puppet] - 10https://gerrit.wikimedia.org/r/289564 [13:59:02] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] maps: Specify cassandra graphite host [puppet] - 10https://gerrit.wikimedia.org/r/289564 (owner: 10Alexandros Kosiaris) [14:00:41] (03PS1) 10Hashar: admin: add jzerebecki/legoktm to labnet-users [puppet] - 10https://gerrit.wikimedia.org/r/289655 (https://phabricator.wikimedia.org/T133992) [14:00:47] andrewbogott: _joe_ ^^^:) [14:02:01] (03PS1) 10Huji: Only bureaucrats should be able to grant the "eliminator" right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289656 (https://phabricator.wikimedia.org/T135736) [14:02:10] !log enabled and ran puppet on maps-test2001 [14:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:03:28] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Traffic: deployment-cache-upload04 (m1.medium) / is almost full - https://phabricator.wikimedia.org/T135700#2309076 (10Joe) [14:03:36] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Traffic: deployment-cache-upload04 (m1.medium) / is almost full - https://phabricator.wikimedia.org/T135700#2307710 (10Joe) p:05Triage>03Normal [14:03:56] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Allow RelEng access to labnet servers (was: Allow RelEng nova log access) - https://phabricator.wikimedia.org/T133992#2309081 (10hashar) I have added a lame note on wikitech Node... [14:04:05] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Traffic: deployment-cache-upload04 (m1.medium) / is almost full - https://phabricator.wikimedia.org/T135700#2307710 (10Joe) p:05Normal>03Low [14:04:42] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Traffic: deployment-cache-upload04 (m1.medium) / is almost full - https://phabricator.wikimedia.org/T135700#2307710 (10Joe) a:03Joe [14:07:13] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Allow RelEng access to labnet servers (was: Allow RelEng nova log access) - https://phabricator.wikimedia.org/T133992#2309097 (10Joe) a:03Joe [14:09:22] (03PS1) 10Giuseppe Lavagetto: admin: add jzerebecki and legoktm to labnet-users [puppet] - 10https://gerrit.wikimedia.org/r/289657 (https://phabricator.wikimedia.org/T133992) [14:09:33] (03PS1) 10Volans: Repool db1029 (x1) with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289658 (https://phabricator.wikimedia.org/T112079) [14:10:07] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Traffic: deployment-cache-upload04 (m1.medium) / is almost full - https://phabricator.wikimedia.org/T135700#2309108 (10hashar) I have kept this one open in case #traffic want to investigate whether it can be a problem in production. For beta, the resta... [14:11:11] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 03Discovery-Search-Sprint, 07Elasticsearch: Restart elasticsearch clusters for Java update - https://phabricator.wikimedia.org/T135499#2300880 (10Gehel) Full restart completed for both eqiad and codfw. [14:15:11] (03CR) 10Giuseppe Lavagetto: [C: 032] admin: add jzerebecki and legoktm to labnet-users [puppet] - 10https://gerrit.wikimedia.org/r/289657 (https://phabricator.wikimedia.org/T133992) (owner: 10Giuseppe Lavagetto) [14:16:27] 06Operations, 10Ops-Access-Requests: Requesting access to wmf ldap group for Nschaaf - https://phabricator.wikimedia.org/T135738#2309140 (10schana) [14:17:01] (03CR) 10Volans: [C: 032] Repool db1029 (x1) with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289658 (https://phabricator.wikimedia.org/T112079) (owner: 10Volans) [14:17:38] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Allow RelEng access to labnet servers (was: Allow RelEng nova log access) - https://phabricator.wikimedia.org/T133992#2309155 (10Joe) 05Open>03Resolved [14:17:39] (03Merged) 10jenkins-bot: Repool db1029 (x1) with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289658 (https://phabricator.wikimedia.org/T112079) (owner: 10Volans) [14:18:57] !log enable puppet on maps-test200{2,3,4}. [14:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:19:16] <_joe_> sorry hashar I didn't see your commit :/ [14:19:18] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests: Requesting access to wmf ldap group for Nschaaf - https://phabricator.wikimedia.org/T135738#2309175 (10Krenair) [14:20:14] !log volans@tin Synchronized wmf-config/db-eqiad.php: Repool db1029 (x1) with low weight - T112079 (duration: 00m 40s) [14:20:15] T112079: Upgrade x1 cluster - https://phabricator.wikimedia.org/T112079 [14:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:27:41] (03Abandoned) 10Giuseppe Lavagetto: admin: add jzerebecki/legoktm to labnet-users [puppet] - 10https://gerrit.wikimedia.org/r/289655 (https://phabricator.wikimedia.org/T133992) (owner: 10Hashar) [14:28:59] 06Operations: Restarts of ganglia-monitor are unreliable - https://phabricator.wikimedia.org/T135723#2309190 (10Joe) p:05Triage>03Normal [14:30:30] 06Operations, 10cassandra: Downgrade Cassandra on apt.wikimedia.org to 2.1.13 - https://phabricator.wikimedia.org/T135673#2309194 (10Joe) p:05Triage>03Normal a:03Joe [14:31:54] 06Operations, 10MediaWiki-Configuration, 10MediaWiki-Database, 07Performance, and 2 others: Set mediawiki's read-only mode automatically when database masters are detected to have read_only=1 - https://phabricator.wikimedia.org/T135711#2309201 (10Joe) p:05Triage>03Normal [14:32:41] 06Operations, 10MediaWiki-Configuration, 10MediaWiki-Database, 07Performance, and 2 others: Set mediawiki's read-only mode automatically when database masters are detected to have read_only=1 - https://phabricator.wikimedia.org/T135711#2308014 (10Joe) I think the problem is, more in general: how can we pas... [14:42:51] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests: Requesting access to wmf ldap group for Nschaaf - https://phabricator.wikimedia.org/T135738#2309213 (10Joe) p:05Triage>03Normal a:03Joe [14:44:12] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2309216 (10Eevans) >>! In T95253#2308521, @elukey wrote: > @Eevans I stole the 2.1.13 debs in your home dir o... [14:46:07] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2309238 (10Eevans) >>! In T95253#2309216, @Eevans wrote: > @elukey Cool! Remember to upgrade the existing ma... [14:48:29] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Allow RelEng access to labnet servers (was: Allow RelEng nova log access) - https://phabricator.wikimedia.org/T133992#2309255 (10hashar) Thank you everyone, very helpful! [14:50:03] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [14:50:24] 06Operations, 10MediaWiki-Configuration, 10MediaWiki-Database, 07Performance, and 2 others: Set mediawiki's read-only mode automatically when database masters are detected to have read_only=1 - https://phabricator.wikimedia.org/T135711#2309256 (10jcrespo) > I would suggest we give it a try first This is t... [14:51:04] 06Operations, 06Discovery, 10Maps, 03Discovery-Maps-Sprint, 13Patch-For-Review: Install / configure new maps servers in codfw - https://phabricator.wikimedia.org/T134901#2309257 (10Gehel) base osm import completed. Postgres slaves are still catching up. This was done with much trial and error (I'm still... [14:51:10] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests: Requesting access to wmf ldap group for Nschaaf - https://phabricator.wikimedia.org/T135738#2309258 (10Joe) 05Open>03Resolved [14:54:59] (03CR) 10Andrew Bogott: [C: 031] "This looks fine to me -- has it been tested in labtest?" [puppet] - 10https://gerrit.wikimedia.org/r/285288 (https://phabricator.wikimedia.org/T133554) (owner: 10Alex Monk) [14:55:36] (03CR) 10Andrew Bogott: "Um... ok, obviously you answer that question in the commit message :)" [puppet] - 10https://gerrit.wikimedia.org/r/285288 (https://phabricator.wikimedia.org/T133554) (owner: 10Alex Monk) [14:55:41] (03PS6) 10Andrew Bogott: Point dynamicproxy to IPs instead of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/285288 (https://phabricator.wikimedia.org/T133554) (owner: 10Alex Monk) [15:00:04] anomie ostriches thcipriani marktraceur aude: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160519T1500). [15:00:04] Krenair: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:18] (03CR) 10Andrew Bogott: [C: 032] Point dynamicproxy to IPs instead of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/285288 (https://phabricator.wikimedia.org/T133554) (owner: 10Alex Monk) [15:01:14] 06Operations, 06Discovery, 10Maps, 03Discovery-Maps-Sprint, 13Patch-For-Review: Install / configure new maps servers in codfw - https://phabricator.wikimedia.org/T134901#2309342 (10Yurik) I think you did not import custom functions as described in https://github.com/kartotherian/osm-bright.tm2source [15:03:39] I'm around to SWAT. Ping me when you're around Krenair [15:03:44] thcipriani, ping [15:03:48] :) [15:07:04] (03PS2) 10BBlack: VCL: v4 deliver+synth refactoring [puppet] - 10https://gerrit.wikimedia.org/r/289588 (https://phabricator.wikimedia.org/T135696) [15:11:37] (03PS1) 10BBlack: Revert "cache_misc: remove all CL-sensitive stream/pass logic" [puppet] - 10https://gerrit.wikimedia.org/r/289669 [15:11:43] (03PS2) 10BBlack: Revert "cache_misc: remove all CL-sensitive stream/pass logic" [puppet] - 10https://gerrit.wikimedia.org/r/289669 [15:12:56] (03CR) 10Yuvipanda: [C: 04-1] "IMO this is a feature rather than a bug. I don't think this script should fail silently, since that indicates anomalous behavior somewhere" [puppet] - 10https://gerrit.wikimedia.org/r/286263 (https://phabricator.wikimedia.org/T133946) (owner: 10Alex Monk) [15:15:02] (03CR) 10BBlack: [C: 032] Revert "cache_misc: remove all CL-sensitive stream/pass logic" [puppet] - 10https://gerrit.wikimedia.org/r/289669 (owner: 10BBlack) [15:16:00] !log Restarted zuul-merger daemons on both gallium and scandium : file descriptors leaked [15:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:36] (03CR) 10Alex Monk: "Okay. Andrew, do you agree? If so let's abandon this and close the task." [puppet] - 10https://gerrit.wikimedia.org/r/286263 (https://phabricator.wikimedia.org/T133946) (owner: 10Alex Monk) [15:17:32] (03CR) 10Andrew Bogott: "yeah, sounds reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/286263 (https://phabricator.wikimedia.org/T133946) (owner: 10Alex Monk) [15:18:01] (03Abandoned) 10Alex Monk: labs IP aliasing: Print error and continue when not able to get instances for a project [puppet] - 10https://gerrit.wikimedia.org/r/286263 (https://phabricator.wikimedia.org/T133946) (owner: 10Alex Monk) [15:19:22] hmmm. Is there some reason we have to bump the VisualEditor extension submodule manually? I have a vague memory of this, and the update to wmf.2 didn't come down :\ [15:20:04] Yes [15:20:26] https://wikitech.wikimedia.org/wiki/How_to_deploy_code/Core_submodule_update [15:21:40] kk, I'll make the patch. [15:21:45] thanks [15:25:34] (03PS1) 10Alexandros Kosiaris: git-review: Disabling rebasing by default [puppet] - 10https://gerrit.wikimedia.org/r/289672 [15:25:38] 06Operations, 10cassandra: Downgrade Cassandra on apt.wikimedia.org to 2.1.13 - https://phabricator.wikimedia.org/T135673#2309453 (10Joe) The upstream cassandra repo is now at 2.1.14, is that still ok @Eevans ? [15:26:50] <_joe_> urandom: is 2.1.14 ok as well? [15:27:36] _joe_: we need 2.1.13 [15:27:43] <_joe_> sigh [15:27:45] is that automatically updated? [15:27:57] <_joe_> who did upload the 2.2 packages by hand? [15:27:59] <_joe_> yes [15:28:03] (03PS2) 10Ema: config-geo: list all DCs in failover lists for completeness [dns] - 10https://gerrit.wikimedia.org/r/289433 [15:28:04] 2.2? [15:28:13] * urandom is confused [15:29:09] <_joe_> you said we had 2.2.6 and we need 2.1.12 [15:29:14] <_joe_> *13 [15:29:27] the repo contains 2.1.14, and we need 2.1.13 [15:29:29] <_joe_> now, our reprepro is set up to track cassandra 2.1 upstream [15:29:47] <_joe_> urandom: oh I see [15:29:47] yeah, that's not ideal [15:30:42] (03CR) 10Ema: [C: 032 V: 032] config-geo: list all DCs in failover lists for completeness [dns] - 10https://gerrit.wikimedia.org/r/289433 (owner: 10Ema) [15:31:13] 06Operations, 10cassandra: Downgrade Cassandra on apt.wikimedia.org to 2.1.13 - https://phabricator.wikimedia.org/T135673#2309469 (10Joe) 05Open>03Resolved [15:31:49] <_joe_> urandom: {{done}} [15:31:57] _joe_: thanks! [15:38:31] (03PS1) 10Nuria: Cloning analytics.wikimedia.org repo [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) [15:39:27] (03PS2) 10Nuria: Cloning analytics.wikimedia.org repo [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) [15:41:57] !log thcipriani@tin Synchronized php-1.28.0-wmf.2/extensions/VisualEditor/ApiVisualEditor.php: SWAT: [[gerrit:289587|Debug log strange-looking ETags being sent to RB]] (duration: 00m 44s) [15:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:42:14] 06Operations, 07Puppet, 10cassandra: Pin Cassandra package version in puppet - https://phabricator.wikimedia.org/T135749#2309489 (10Joe) [15:42:15] ^ Krenair sync'd for wmf.2 [15:42:32] thanks [15:42:50] zuul still cranking along for wmf.1 [15:43:38] PROBLEM - Check size of conntrack table on kafka1013 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [15:46:21] ottomata: --^ [15:46:37] leader election I guess? :) [15:46:45] still good but worth to check [15:46:59] hm! [15:47:25] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=34&fullscreen [15:47:50] ja lookin [15:48:18] !log thcipriani@tin Synchronized php-1.28.0-wmf.1/extensions/VisualEditor/ApiVisualEditor.php: SWAT: [[gerrit:289586|Debug log strange-looking ETags being sent to RB]] (duration: 00m 29s) [15:48:25] ^ Krenair sync'd for wmf.1 [15:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:48:26] pretty weird.....that doesn't usually happen [15:48:41] looks like it isn't coming down either [15:49:14] thcipriani, already got log entries, thanks! [15:49:37] Krenair: awesome. thanks :) [15:51:23] 06Operations, 06Discovery, 10Maps: Ensure that maps server can be automatically installed (fully puppetized) - https://phabricator.wikimedia.org/T135750#2309522 (10Gehel) [15:51:57] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [15:54:49] ottomata: from what I can see in netstat there aren't such many connections, meanwhile /proc/net/ip_conntrack shows a lot of connections in TIME_WAIT with mw hosts [15:54:59] it might just be a temp thing [15:55:38] (03PS1) 10Cmjohnson: Adding dns entriesf or db1079-94. [dns] - 10https://gerrit.wikimedia.org/r/289681 [15:55:41] (not sure if I checked the right proc file ) [16:00:04] godog moritzm _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160519T1600). [16:00:04] urandom: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:22] already merged! [16:01:04] 06Operations, 10ops-eqiad, 06DC-Ops: I/O issues for /dev/sdd on analytics1047.eqiad.wmnet - https://phabricator.wikimedia.org/T134056#2309559 (10Cmjohnson) @elukey sorry I haven't looked into this yet....I did request a new disk from Dell Congratulations: Work Order SR929923300 was successfully submitted. [16:02:08] (03PS3) 10BBlack: VCL: v4 deliver+synth refactoring [puppet] - 10https://gerrit.wikimedia.org/r/289588 (https://phabricator.wikimedia.org/T135696) [16:07:45] (03PS4) 10BBlack: VCL: v4 deliver+synth refactoring [puppet] - 10https://gerrit.wikimedia.org/r/289588 (https://phabricator.wikimedia.org/T135696) [16:08:32] (03CR) 10Ema: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/289588 (https://phabricator.wikimedia.org/T135696) (owner: 10BBlack) [16:08:49] (03PS2) 10Cmjohnson: Adding dns entriesf or db1079-94. [dns] - 10https://gerrit.wikimedia.org/r/289681 [16:09:39] (03CR) 10BBlack: [C: 032] VCL: v4 deliver+synth refactoring [puppet] - 10https://gerrit.wikimedia.org/r/289588 (https://phabricator.wikimedia.org/T135696) (owner: 10BBlack) [16:10:15] (03PS1) 10Giuseppe Lavagetto: cassandra: pin cassandra version [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) [16:10:50] <_joe_> urandom, mobrovac ^^ [16:11:05] thanks _joe_! [16:11:05] <_joe_> not merging it now though, it's beer'o clock [16:11:15] (03CR) 10Cmjohnson: [C: 032] Adding dns entriesf or db1079-94. [dns] - 10https://gerrit.wikimedia.org/r/289681 (owner: 10Cmjohnson) [16:11:45] <_joe_> elukey: oh right you're part of the equation too [16:11:49] <_joe_> cassandra-slaves [16:11:55] hahahha [16:12:00] <_joe_> let's add that group to admin [16:12:04] yeah and the issue has bitten me today [16:12:04] that was fast _joe_! [16:13:38] _joe_: what will this do exactly? [16:14:07] _joe_: does that control the version that is installed if the package needs installing? [16:14:30] <_joe_> urandom: it just checks the package version [16:14:36] <_joe_> if it's different than the choice [16:14:42] <_joe_> it tries to install it [16:15:06] <_joe_> so whenever you want to test a new version you can still do it with one puppet commit [16:15:11] so if it's different, Puppet will error out [16:15:25] <_joe_> if it's different and the good one is not available, yes [16:15:38] <_joe_> so say you have a server that is on 2.1.12 [16:15:43] <_joe_> you want 2.1.13 [16:16:00] <_joe_> if 2.1.13 is in the repo, it will be upgraded [16:16:32] <_joe_> say cassandra is not installed,it will install 2.1.13 or die trying [16:17:05] _joe_: gotcha [16:17:26] so TL;DR, if the wrong thing happens, it's because someone put the wrong thing in Puppet [16:17:34] <_joe_> yes [16:17:49] the surface area for human error is reduced to cassandra::version in puppet [16:17:49] <_joe_> which is harder to do than reprepro update forgetting --restrict [16:17:55] <_joe_> yeah [16:18:00] cool [16:18:07] in the rolling upgrade case what would happen? if puppet runs a downgrade will happen? [16:18:11] yeah, that's definitely better than the status quo [16:19:00] godog: i guess you'd have to update puppet to prevent that from happening? [16:19:35] ah yeah, on a per-machine basis in that case [16:19:54] or perhaps for more than one, if puppet is disabled, and you are reenabling as you go [16:20:58] yeah [16:21:42] heh, just noticed that _joe_ addressed this in the commit message, as well [16:21:50] * urandom reads the fm [16:21:56] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and set up 16 db's db1079-1094 - https://phabricator.wikimedia.org/T135253#2309642 (10Cmjohnson) @volans @jcrespo Slight change in racking for row B db1085 and db1086 will be in B3 now. [16:22:25] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) (owner: 10Giuseppe Lavagetto) [16:26:44] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and set up 16 db's db1079-1094 - https://phabricator.wikimedia.org/T135253#2309648 (10Cmjohnson) [16:27:49] PROBLEM - puppet last run on labstore2001 is CRITICAL: CRITICAL: puppet fail [16:29:24] !log upgrading cassandra from 2.1.12 to 2.1.13 on aqs1001.eqiad.mwnet [16:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:30:30] (03PS1) 10Eevans: Upgrade xenon to Cassandra 2.2 [puppet] - 10https://gerrit.wikimedia.org/r/289685 (https://phabricator.wikimedia.org/T126629) [16:31:28] !log Disabling puppet on xenon.eqiad.wmnet in preparation for Cassandra upgrade : T126629 [16:31:29] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [16:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:31:47] !log Set runtime value for max_allowed_packet, innodb_buffer_pool_dump_at_shutdown, innodb_buffer_pool_load_at_startup to their configured values for s1-s7, es1-es3, x1 T133333 [16:31:48] T133333: Audit new eqiad masters configuration - https://phabricator.wikimedia.org/T133333 [16:31:49] (03PS1) 10Jcrespo: Repool db1033 after maintenance with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289686 (https://phabricator.wikimedia.org/T134555) [16:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:15] twentyafterfour, chasemp: So we have one user who creates a lot of accounts in Phabricator to edit-war. And I'm wondering if we can somehow block the IP - see https://phabricator.wikimedia.org/people/logs/query/Ejdov.M1Ld1x/#R [16:32:18] * andre__ has to leave soon [16:32:29] (03CR) 10Jcrespo: [C: 04-1] "Wait for slave sync + some buffer pool warmup." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289686 (https://phabricator.wikimedia.org/T134555) (owner: 10Jcrespo) [16:32:38] I've disabled all accounts so far... [16:33:56] * MatmaRex muses about anti-vandalism features [16:36:04] PROBLEM - cassandra CQL 10.64.0.123:9042 on aqs1001 is CRITICAL: Connection refused [16:36:18] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2309691 (10Gilles) [16:36:27] checkin aqs1001, probably bootstrapping [16:37:40] should be good now [16:37:47] (03PS2) 10Jcrespo: Repool db1033 after maint. with low weight; increase db1029 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289686 (https://phabricator.wikimedia.org/T134555) [16:37:48] (come on icinga give me some relief) [16:38:05] RECOVERY - cassandra CQL 10.64.0.123:9042 on aqs1001 is OK: TCP OK - 0.002 second response time on port 9042 [16:38:10] thank youuuuu [16:38:24] urandom: ---^ aqs1001 upgraded to 2.1.13 [16:38:35] (03CR) 10Krinkle: [C: 04-1] "Thanks, I was going to file a task for this since ULS is one of the last users of /static/*/ directories that point to MediaWiki source co" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289652 (owner: 10Nikerabbit) [16:38:49] elukey: \o/ [16:38:53] (03CR) 10Jcrespo: [C: 04-2] "db1033 still not ready" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289686 (https://phabricator.wikimedia.org/T134555) (owner: 10Jcrespo) [16:41:31] (03CR) 10Nikerabbit: "We append the font version in the url. Is that not sufficient or does static.php only work with file hashes?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289652 (owner: 10Nikerabbit) [16:45:08] 06Operations, 06Discovery, 10Maps, 03Discovery-Maps-Sprint, 13Patch-For-Review: Install / configure new maps servers in codfw - https://phabricator.wikimedia.org/T134901#2309727 (10Gehel) Custom function imported, no more errors in postgres logs. But tilerator, tileratorui and karthotherian are still not... [16:46:17] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2309734 (10Gilles) TODO: fix libthumbor's package description (copypasta) [16:52:34] RECOVERY - puppet last run on labstore2001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [16:54:38] (03CR) 10Krinkle: "It has to be a file hash so that it can be verified. Among other thigns this enables the fundamental part of multiversion which is to be a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289652 (owner: 10Nikerabbit) [17:00:05] yurik gwicke cscott arlolra subbu: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160519T1700). Please do the needful. [17:04:59] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: puppet fail [17:05:56] (03PS1) 10Ema: Allow ganglia to read VSM files [puppet] - 10https://gerrit.wikimedia.org/r/289696 [17:07:28] (03CR) 10jenkins-bot: [V: 04-1] Allow ganglia to read VSM files [puppet] - 10https://gerrit.wikimedia.org/r/289696 (owner: 10Ema) [17:09:51] (03CR) 10Jcrespo: [C: 032] Repool db1033 after maint. with low weight; increase db1029 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289686 (https://phabricator.wikimedia.org/T134555) (owner: 10Jcrespo) [17:10:49] andre__afk: we can block the ip, short term it would probably do what you want [17:11:13] (03PS2) 10Ema: Allow ganglia to read VSM files [puppet] - 10https://gerrit.wikimedia.org/r/289696 [17:13:10] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1033 after maintenance with low weight; increase db1029 weight (duration: 00m 29s) [17:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:18:08] (03Abandoned) 10Ema: Allow ganglia user to read VSM files [puppet] - 10https://gerrit.wikimedia.org/r/281918 (owner: 10Ema) [17:22:39] (03CR) 10Nikerabbit: "Okay I'll file a task for that. Is there code we can reuse to produce the hash?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289652 (owner: 10Nikerabbit) [17:24:34] 06Operations, 10ops-eqiad: kafka1013 hardware crash - https://phabricator.wikimedia.org/T135557#2309860 (10Ottomata) Hm. Ok today I ran an election to bring kafka1013 back as a leader. Immediately after, `nf_conntrack_count` almost doubled, mostly on TIME_WAITS from MW app servers. https://grafana-admin.wik... [17:24:40] RECOVERY - Check size of conntrack table on kafka1013 is OK: OK: nf_conntrack is 46 % full [17:25:07] !log execute sysctl -w net.netfilter.nf_conntrack_max=512000 on kafka1013 as temporary measure (investigating why conntrack count is higher after leader election) - T135557 [17:25:08] T135557: kafka1013 hardware crash - https://phabricator.wikimedia.org/T135557 [17:30:39] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:33:02] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.88 seconds [17:34:14] is doing the alter table, not in production [17:37:35] hey ops, we're about to deploy parsoid, in our usual window. any problem with that? [17:39:11] !log starting Parsoid deploy (of 67816adf) [17:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:39:22] why lag? [17:40:53] urandom: I guess that https://gerrit.wikimedia.org/r/#/c/289683/1/modules/cassandra/manifests/init.pp needs me to upgrade first right? Or at least to set hiera [17:41:02] it has been running for 6669 seconds, I wonder why lag now [17:41:32] and lag going down [17:41:46] strange [17:42:08] yes, I was taking a look at the innodb status [17:42:49] disk? some kind of replication incompatibility? [17:43:14] I hope that doesn't happen on eqiad [17:43:21] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 0.48 seconds [17:44:33] elukey: once merged, it will cause a puppet error on any machine not running either 2.1.13 or 2.2.6 (depending on whether cassandra::target_version is 2.1 or 2.2 respectively), unless you've overridden cassandra::version [17:44:51] I didn't see any query stuck on the replication, the replica thread was almost all the time in Waiting for master to send event [17:45:20] !log synced code; restarted Parsoid on wtp1001.eqiad.wmnet as a canary [17:45:57] elukey: come to think of it, we should make sure that situation applies everywhere before merging that (maps is still on 2.1.12 for example, i think) [17:46:01] as per I/O I have some iostat, that shows that was writing and sometimes doing a big sync jynus [17:46:55] and the alter finished if you didn't interrupt it, so it's something done at the end [17:47:43] urandom: all right adding a comment in the CR [17:48:58] (03CR) 10Elukey: "AQS is still half on 2.1.12 (I am completing the upgrade tomorrow) and maps too (as urandom mentioned on IRC). So it might be wise not to " [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) (owner: 10Giuseppe Lavagetto) [17:49:26] hey ops, git-deploy seems to be having trouble with the checkout stage [17:50:04] i seem to be stuck at "21/44 minions completed checkout" [17:50:17] any ideas [17:51:09] 06Operations, 10ops-eqiad, 10Analytics-Cluster: kafka1013 hardware crash - https://phabricator.wikimedia.org/T135557#2309973 (10Ottomata) [17:52:07] "checkout status: 50" [17:53:05] oh, there are dirty repos :( [17:54:53] (03PS1) 10ArielGlenn: remove cron job on snapshots that generates list of media upload dirs [puppet] - 10https://gerrit.wikimedia.org/r/289700 [18:00:01] (03PS2) 10Dzahn: mv files/misc/scripts/Makefile to scap module [puppet] - 10https://gerrit.wikimedia.org/r/289351 [18:00:11] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:00:12] PROBLEM - puppet last run on mw2076 is CRITICAL: CRITICAL: Puppet has 1 failures [18:00:40] cscott: probably releng would be best to help, thcipriani or twentyafterfour^ [18:01:16] chasemp: i figured it out, the depos were dirty with some sketchy patch applied during the last outage. [18:01:24] ah [18:01:39] chasemp: i'm just going to revert the attempted deploy until subbu/arlo are back online and can tell me if the dirty hack is still needed [18:02:20] (03PS2) 10Dzahn: logging: move files/misc/demux.py to modules/udp2log [puppet] - 10https://gerrit.wikimedia.org/r/289353 [18:02:31] yeah, I'm not a ton of help with Trebuchet stuff. I can tell you that it failed running git submodule update --recursive --init remotely if it returned 50 [18:02:41] (03PS1) 10Ottomata: Make dumps.wikimedia.org access logs readable on stat1002 dest, also only rsync *.gz files [puppet] - 10https://gerrit.wikimedia.org/r/289702 (https://phabricator.wikimedia.org/T134776) [18:02:49] cscott, here now ... ori had applied it during the restbase/changeprop/parsoid outage. [18:02:53] chasemp, csoit can be reverted. [18:03:31] chasemp, cscott so it can be reverted. [18:03:35] (03PS2) 10Dzahn: mv files/misc/udp2log.init into modules/udp2log [puppet] - 10https://gerrit.wikimedia.org/r/289354 [18:03:40] but, requires a root to do it. i or you cannot do it. [18:03:50] (03PS2) 10ArielGlenn: remove cron job on snapshots that generates list of media upload dirs [puppet] - 10https://gerrit.wikimedia.org/r/289700 [18:04:08] subbu: ok. [18:04:26] ori (or some other op): can you revert the parsoid patch? [18:04:43] is _joe_ the one on duty? [18:05:35] !log git-deploy of Parsoid failed with "21/44 minions completed checkout" due to dirty repos, root had applied patch during the restbase/changeprop/parsoid outage. [18:05:41] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:05:52] cscott, have you reverted the deploy? [18:06:11] otherwise, half the nodes will have the new checkout. [18:06:15] not yet, the repos are inconsistent right now. i was about to revert it when you popped up. [18:06:20] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:06:34] i was going to quickly see if ori or whoever could revert the patch to let us complete the deploy [18:06:38] ok. [18:07:13] i'm guessting the 21 minions who failed checkout are the eqiad minions, and it's just codfw which has the new code checked out [18:08:10] thcipriani, twentyafterfour: do either of you have root on the parsoid cluster? [18:08:12] don't assume that unless you know for sure. [18:08:25] (03CR) 10Ottomata: [C: 031] mv files/misc/udp2log.init into modules/udp2log [puppet] - 10https://gerrit.wikimedia.org/r/289354 (owner: 10Dzahn) [18:08:29] cscott: no I don't, sorry :( [18:08:30] (03CR) 10Eevans: [C: 04-1] "Actually, I am +1 on the idea here, but before merging we should either a) upgrade any machines still running 2.1.12 (maps-test*, at least" [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) (owner: 10Giuseppe Lavagetto) [18:08:30] there are 24 eqiad nodes and 16 codfw nodes. [18:08:37] cscott, just revert for now. [18:09:05] subbu: yeah, i think that's best. let's treat "revert the dirty repos" as a separate deploy which needs to happen before we can move forward with our usual deploys again. [18:09:14] k [18:09:21] (03CR) 10ArielGlenn: [C: 032] remove cron job on snapshots that generates list of media upload dirs [puppet] - 10https://gerrit.wikimedia.org/r/289700 (owner: 10ArielGlenn) [18:09:28] !log starting to revert Parsoid deploy due to unresolved dirty repos [18:09:33] (03PS2) 10Ottomata: Make dumps.wikimedia.org access logs readable on stat1002 dest, also only rsync *.gz files [puppet] - 10https://gerrit.wikimedia.org/r/289702 (https://phabricator.wikimedia.org/T134776) [18:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:09:48] (03CR) 10Eevans: "Or... what elukey said. :)" [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) (owner: 10Giuseppe Lavagetto) [18:10:13] (03CR) 10Ottomata: [C: 032 V: 032] Make dumps.wikimedia.org access logs readable on stat1002 dest, also only rsync *.gz files [puppet] - 10https://gerrit.wikimedia.org/r/289702 (https://phabricator.wikimedia.org/T134776) (owner: 10Ottomata) [18:13:44] !log parsoid deploy reverted to parsoid/deploy-sync-20160504-200410 tag (b0d015fa); 21 repos still dirty [18:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:14:09] cscott, still dirty? [18:14:11] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:14:29] cscott, subbu: I'm not working today, but I can take a look. What do you need me to do exactly? [18:14:49] ori, you had applied a live hack on some nodes during the outage. [18:14:57] subbu: the checkout succeeded, but the patch is still present [18:15:16] can you give me one node where this is the case as an example? [18:15:18] ori can you do a git reset? [18:15:35] wtp1001.eqiad [18:15:42] looking [18:15:43] cscott@wtp1001:~$ cd /srv/deployment/parsoid/deploy/src/ [18:15:43] cscott@wtp1001:/srv/deployment/parsoid/deploy/src$ git diff [18:16:33] i suspect the intent was to apply this patch to wtp*.eqiad.wmnet, but only 21 hosts appear to actually have the patch in place [18:17:07] (03PS1) 10ArielGlenn: remove leftover conf file from media upload dirs cron job on snapshots [puppet] - 10https://gerrit.wikimedia.org/r/289711 [18:17:36] apergos: great. Thanks! [18:17:48] wtp1002 is missing the patch. [18:17:48] i ran: salt 'wtp*' cmd.run "sed -i -e '/106801025/d' /srv/deployment/parsoid/deploy/src/lib/api/routes.js" [18:17:58] they should be all good now [18:18:02] let's check [18:18:13] got another example host ? [18:18:16] wtp1001 seems to be clean [18:18:25] wtp1003 had it [18:18:47] the easiest way might be for me to restart the parsoid deploy i was attempting, and see if it fails again on any hosts during the checkout phase [18:19:00] subbu@earth:~$ for wtp in `ssh ssastry@bast1001.wikimedia.org cat /etc/dsh/group/parsoid` ; do echo $wtp ; ssh ssastry@$wtp 'cd /srv/deployment/parsoid/deploy/src && git diff' ; done [18:19:06] shows no diffs on eqiad nodes. [18:19:19] so, looks like that worked. [18:19:21] (subbu it was actually 23 hosts, not 21 -- i was misreading 21/44 successful. wtp1002 was the oddball.) [18:19:38] and codfw is clean as well. [18:19:43] so shall i try to actually complete the parsoid deploy? [18:20:18] sure ... the other part is whether perms have changed on the repo that wll prevent a checkout. [18:20:36] let me check on wtp1001 [18:20:40] i looked at the perms, they are already all owned by root [18:20:47] ok. [18:20:57] so, go ahead with the deploy then. [18:21:30] (and fwiw, wtp1002 was still running b0d015 -- so even though it had the new code checked out, it didn't spontaneously restart while i was figuring stuff out) [18:21:48] ...which is a shame, since i was going to go ahead and use wtp1002 as my canary if it had been running the new code all along. [18:21:50] anyway... [18:22:19] ori, thanks! [18:22:25] !log to cleanup Parsoid repos ori ran: salt 'wtp*' cmd.run "sed -i -e '/106801025/d' /srv/deployment/parsoid/deploy/src/lib/api/routes.js" [18:22:37] cscott: thanks [18:22:50] o/ [18:22:54] (03PS2) 10Ottomata: Upgrade xenon to Cassandra 2.2 [puppet] - 10https://gerrit.wikimedia.org/r/289685 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [18:23:14] !log re-attempting parsoid deploy of 67816adf [18:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:23:34] (03CR) 10Ottomata: [C: 032 V: 032] Upgrade xenon to Cassandra 2.2 [puppet] - 10https://gerrit.wikimedia.org/r/289685 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [18:24:05] (03Abandoned) 10ArielGlenn: remove leftover conf file from media upload dirs cron job on snapshots [puppet] - 10https://gerrit.wikimedia.org/r/289711 (owner: 10ArielGlenn) [18:25:17] !log Stopping Cassandra on xenon.eqiad.wmnet : T126629 [18:25:18] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [18:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:25:31] 06Operations, 10DBA, 13Patch-For-Review: Investigate/decom db2001-db2009 - https://phabricator.wikimedia.org/T125827#2310087 (10Dzahn) After we talked on IRC i am using db2007 to test upgrading RT (T119112) which involves a schema change. It's in Icinga as a host, but mariadb/mysql was already removed and th... [18:26:08] !log synced; restarted Parsoid on wtp1001.eqiad as a canary [18:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:26:20] !log Upgrading Cassandra on xenon.eqiad.wmnet to 2.2.6 : T126629 [18:26:21] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [18:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:26:36] RECOVERY - puppet last run on mw2076 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:27:29] !log Enabling puppet on xenon.eqiad.wmnet and forcing run : T126629 [18:27:30] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [18:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:30:53] subbu: https://logstash.wikimedia.org/#dashboard/temp/AVTKSKqBDxp7yus2hz9_ looks clean on the canary [18:31:27] lgtm [18:31:31] PROBLEM - cassandra-a CQL 10.64.0.202:9042 on xenon is CRITICAL: Connection refused [18:32:35] (03PS1) 10Dzahn: planet/hiera: move cluster setting role, not hosts [puppet] - 10https://gerrit.wikimedia.org/r/289712 [18:33:59] PROBLEM - cassandra-a service on xenon is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [18:34:52] restarting the rest of the parsoid nodes [18:35:29] RECOVERY - cassandra-a CQL 10.64.0.202:9042 on xenon is OK: TCP OK - 0.002 second response time on port 9042 [18:35:40] bblack: hola. do you have a few minutes to talk about the https://phabricator.wikimedia.org/T127883 (lazy loading)? [18:36:08] RECOVERY - cassandra-a service on xenon is OK: OK - cassandra-a is active [18:38:24] !log updated Parsoid to version 67816adf (T100681, T130638) [18:38:26] T130638: Add data-mw as a separate JSON blob in the pagebundle output of Parsoid's API - https://phabricator.wikimedia.org/T130638 [18:38:26] T100681: Deprecate and remove Parsoid's "v1" API (and also the "v2" API while we're at it) - https://phabricator.wikimedia.org/T100681 [18:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:47:13] (03CR) 10Krinkle: "Yes, I'd recommend using OutputPage::transformFilePath() or OutputPage::transformResourcePath()." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289652 (owner: 10Nikerabbit) [19:00:04] ostriches: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160519T1900). [19:00:16] lies, it's not me [19:02:14] that's me [19:15:23] nuria_: yes [19:15:44] nuria_: it's not really my thing, though. I'm just the guy who has to estimate and/or implement whatever's asked for in varnish :) [19:15:59] bblack: nah, you got the power. doesn't it seem a little heavy to enable lazy loading for 50% of the whole user base [19:16:07] bblack: all at once? [19:16:45] nuria_: I as understand it (and I could be wrong), this was supposed to be experimented on some limited wikis first, before the 50% test. [19:17:24] nuria_: (and as a counter-point from the pro-lazyload side: technically, we do sometimes try things for half or all the userbase at once, if it's not deemed a risky experiment. I don't know how risky this really is or not) [19:17:58] bblack: ok, maybe we have to ask how risky it is 1st. I have noted that on wiki [19:18:08] bblack: sorry on phab [19:18:12] nuria_: ok, thanks :) [19:18:43] bblack: the other thing i was thinking about is that we might have experiments running into each other (hovercards and lazy come to mind) [19:19:42] nuria_: yeah, I definitely feel for that one. at many levels, we've been running experiments concurrently the past several months (and really, we need "normal" space between them to establish new patterns to compare, too)... [19:20:00] bblack: so i think we will benefit from having a way to centralize these things on varnish, we do not have to get to that now but i think a dedicated weblab.erb that describes experiments (analytics.erb works too) [19:20:13] nuria_: it's kind of been unavoidable (for many of the past cases I'm thinking about) because the rate of "things we want to change" and our ability to test them quickly without stepping on each other just isn't working out [19:20:21] bblack: i can write something up on wikitech and that leads me to my 3rd concern [19:20:32] bblack: for an even split i do not think ips will work [19:21:08] *would work [19:21:34] bblack: cause we know for a fact that our ip ranges are not equally distributted [19:21:36] nuria_: re: generic infrastructure for a/b (or (a/b)/c) testing, yes, it's sorely needed, but nobody has the time and it's kind of complicated. it comes up just about every time [19:22:16] bblack: we will own that, if with lower priority that edit data for now [19:22:28] nuria_: the IP split was discussed in detail (but maybe not all in that ticket? there were a few). It would be split on the final digit of the IP, not at the per-network level. Tilman ran some stats on the final digits, they're pretty close to even fro this purpose. [19:22:36] so it looks like wmf.2 is unblocked, I'm going to deploy to group2 [19:22:48] bblack: ah ok, very well. [19:23:33] nuria_: links re IPs from: https://phabricator.wikimedia.org/T127883#2288277 [19:23:36] bblack: i think to get things on the right track we can: [19:24:15] bblack: if tilman looked at it i am sure he is right, and that makes sense, cause full range will not be equally distribbuted [19:24:30] (03PS1) 1020after4: all wikis to 1.28.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289719 [19:24:52] (03CR) 1020after4: [C: 032] all wikis to 1.28.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289719 (owner: 1020after4) [19:25:40] (03Merged) 10jenkins-bot: all wikis to 1.28.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289719 (owner: 1020after4) [19:26:21] nuria_: what we've talked about to be more-precise about bucketing "users" (since, for the general case, we have no formal notion of a user anyways if they're not logged in...) is using IPs as a good-enough proxy for users for test-binning [19:26:54] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.2 [19:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:27:06] nuria_: the problem is doing the last-digit thing is coarse and imprecise. We only get ~10% increments, there could be correlations we don't know about (that for an arcane reason, last digit 5 has way more mobile than desktop, etc..) [19:27:35] bblack: could we random sample and drive stickyness with a cookie? [19:28:06] nuria_: so for an immediate slightly-better step, we've talked about hashing IPs. If we had support in varnish to md5(IP), it would get past ipv4-vs-ipv6, and let us split up the hash in fine-grained amounts (0.5% in bin A, 37.2% in bin B, etc)... [19:28:33] bblack: like if (1 in 10.000) { if( 1 in 2) {set cookie} ="Weblab: lazy=A" else {set cookie="Weblab:lazy=B"} [19:29:10] if you mean randomly per-request, I think that wouldn't be stable. it would grow over time. [19:29:32] (and/or it would shift the same user in and back out of the test randomly, too) [19:29:42] bblack: unless we persist that in weblab cookie [19:30:38] bblack: so we only do the cookie setting if not set .. but i do not understand why will the random "grow" over time [19:30:38] well yeah, but given your example.... everyone who has hit the 1/10.000 chance on a random request would be 50/50 split into bins, but the percentage of clients that have made it past the 1/10.000 gate would continuously grow over time until it was the whole userbase eventually. [19:30:39] (03PS1) 10Eevans: Upgrade remaining RESTBase staging nodes to 2.2.6 [puppet] - 10https://gerrit.wikimedia.org/r/289722 (https://phabricator.wikimedia.org/T126629) [19:31:31] bblack: ay, i do not understand why "will grow over time" if cookie has a clear expiration date [19:31:38] (if the 1/10.000 is on raw requests, without regard to "users" somehow, which IP is an approximation for) [19:32:02] !log Disabling puppet in RESTBase staging : T126629 [19:32:03] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [19:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:32:34] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002 for Chedasaurus - https://phabricator.wikimedia.org/T113302#2310273 (10egalvezwmf) http://i.imgur.com/ChzUb.jpg :) I was trying different things - got it to work. Thanks for the info! [19:33:11] bblack: ay... [19:33:47] bblack: why is it different to use requests vs ips? (ahhh... because of img and js requests?) [19:34:00] not just that. but yes, that makes it worse [19:34:05] even if one pageview == one request [19:34:44] Over the course of a given time window (say one hour), we'd set a cookie for 1/10K reqs during that hour. We'd set it many times for many random requests, some from the same client, some different. [19:35:03] bblack: so it would work better if client does the 1 in 10.000 and on top of that varnish does the bucketing then [19:35:14] (03PS5) 10Aaron Schulz: Lowered $wgMaxUserDBWriteDuration to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275734 (https://phabricator.wikimedia.org/T95501) [19:35:36] maybe in the first few minutes, it actually would set a cookie for 1/10K "users" that hit us during that minute. But with repeat traffic from users, over a period sufficiently long (days, weeks?) it would touch every user, unless they've never browsed us before and it's their first hit ever. [19:35:39] bblack: and in cleint says " i want to be counted in weblab" and server answers with a cookie like " set Cookie Weblab: lazy=A" [19:35:54] bblack: or " set Cookie Weblab: lazy=B" [19:36:13] bblack: but yes, that is more complicated than what we wnat to do now [19:36:15] *want [19:36:20] we've talked about client-side sampling before, but there are a few caveats [19:36:31] bblack: like? [19:36:35] 1) It relies on client-side JS, and some things we want to test, we don't want to only test on JS-capable clients [19:36:53] not necessarily [19:37:39] 2) The client has to make the split-call. If we want to put 14% of users in the test-bucket, the client does the random-number generation and checks the 14% threshold itself, and uses localstorage/cookie to persist the decision [19:38:04] how does a non-JS client do that? [19:38:50] bblack: it depends how much control you have of headers on php end , you can calculate that on php end and add it to headers [19:39:07] bblack: but right, in our case we cache phpo heavily so it no work [19:39:11] without unique-user tracking somehow, or client code running with client-side localstorage/cookies, our only proxy for binning users is IPs, IMHO. [19:39:36] I think uniques is a no-go, and client code isn't always there (and it may be important that tests fairly sample non-JS clients), that leaves IPs [19:40:38] bblack: while js is not so big of a concern (is about 5% of user base) i think ips is teh best choice yes [19:40:40] once we're splitting on IPs, even using cookies is a little imprecise (but works for short terms) [19:41:05] because users will fall in and out of the random IP sample as their DSL switches client IPs or their mobile phone moves networks [19:41:18] bblack: taht is why we need a cookie regardless [19:41:22] *that [19:41:32] it's why we *don't* need that cookie, IMHO :) [19:42:00] bblack: true if we do not want to track changes across different sessions [19:42:12] if our starting point is to randomly-select 5% of all IPs (by hashing for randomness), and set a cookie in them (let's say 2.5% each A and B) [19:42:34] bblack: and fater not worry about IP, yes [19:42:35] over time we'll end up hitting more than 5% of devices with the cookies, which stick. because they all move IPs slowly over time [19:42:36] *after [19:42:52] bblack: ah yes [19:42:54] ay ay [19:43:14] (unless we also un-set the cookie when we set it from an inappropriate IP. then we're at least consistently testing a given % of IPs, and clients move in and out of testing as they move networks) [19:43:25] s/when we set it/when we see it/ [19:43:48] but at that point... we really don't need a cookie to do the binning anyways. [19:45:10] the varnish code can make a choice on request reception based on IP, binning IPs with arbitrary precision and completely randomness. whoever's in the bin gets a special request header added to their requests that both (a) splits varnish caching and (b) is sent to the application to turn on the feature for those responses [19:45:26] (03PS1) 10Ottomata: Install heirloom-mailx on analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/289724 [19:45:33] then there's no cookie, no JS, etc... the only real downside is clients that move IPs/Networks will fall in and out of testing over time. [19:45:49] (03CR) 10Ottomata: [C: 032 V: 032] Install heirloom-mailx on analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/289724 (owner: 10Ottomata) [19:46:48] if we use a client cookie to keep the device's test-participation constant as it moves networks/IPs... there's no good way to keep the sample-size precise. clients that switch away from test-binned IPs keep the cookie. clients that switch to test-binned IPs get the cookie too [19:47:02] so you may start at 5% of IPs, and weeks later you've got 6 or 8% of clients, or whatever it is. [19:47:46] bblack: yes, if you assume you are not using js [19:47:51] right [19:47:55] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [19:48:14] if we have JS available, this gets easy. We just need to communicate JS to the browser and have it randomly-bin itself and tell us with a header or a cookie. [19:48:38] we could even automate that in a way that doesn't require changing the client-side JS for new tests [19:49:03] bblack: yes, it is looking more and more like we are going to have to do that [19:49:07] (if the JS itself isn't cached too long) [19:49:38] we could perhaps just have generic JS pre-bin to the precision we need, and not involve the client in specific test protocols/labels. [19:49:40] bblack: the random-bin set code we alredy have cause i added it to mediawiki a while back , we will just need to add teh weblab cookie [19:50:22] e.g. in static JS all clients get, all clients bin themselves into bins of 0.01%, which would be integer bin numbers 1-10000. They all send us a Cookie: header that says "bin=234" or whatever. [19:50:26] bblack: right the random identifier in mw unqiue to our user base so we can do that [19:50:28] and we make feature decisions on that [19:50:54] then we can decide in varnish that client bins 0-200 (2% of the userbase) get feature A turned on today [19:51:08] bblack: we would also need varnish to handle the cookie "weblab" or something that will store what features are turned on [19:51:13] it's like a unique, but it's a hashed-down unique. it only tells us which .01% of the userbase the user is persistently in. [19:51:38] we probably don't need that much precision, though, and too much precision makes it too close to unique-tracking [19:51:49] maybe 0.1% will do? [19:51:53] or even 1% :) [19:51:56] it'll only work for repeat views though, not first visits. [19:52:18] bblack: yes, but for experimentation is fine [19:52:18] yeah, but we can fix that up I think [19:52:27] i do not think we have to [19:52:30] nuria_: some experiments do care greatly about first-views... [19:53:26] (if nothing else, second views are different with how much of the site is 304'd or cached completely) [19:53:31] note that cookies and cache live for about 2 weeks typically. So after 2 weeks, a repeat view is a first view. And then there is device upgrades/switches which clear the state. [19:53:46] (client-side cache/cookies that is) [19:53:58] browsers clear it based on how often a user uses a site and the available space. [19:54:07] regardless of cookie expiration. [19:54:13] recently read an interesting paper about https://en.wikipedia.org/wiki/Canvas_fingerprinting [19:54:34] (03PS1) 10Dzahn: temp. setup to use db2007 for RT upgrade test [puppet] - 10https://gerrit.wikimedia.org/r/289725 (https://phabricator.wikimedia.org/T119112) [19:54:36] more durable than cookies ;) [19:54:38] the no-JS, no-cookies, varnish hashing IPs solution avoids all of these problems though. the only significant problem it raises is that individual clients will fall in and out of testing as they switch IPs [19:54:51] Krinkle: in this case how long the cookie will live is up to us [19:54:56] gwicke: yeah but fingerprinting is putting us back in the territory of uniques... [19:55:17] Krinkle: user cookies are not recycled after two weeks rehgardless of expiration time [19:55:24] nuria_: No, it isn't. Cookies do not live for longer than 2 weeks. Usually shorter (10-14 days), especially on mobile. [19:55:34] PROBLEM - puppet last run on db1010 is CRITICAL: CRITICAL: Puppet has 1 failures [19:56:06] Krinkle: mmm...That is not what we have seen in the past sorry [19:56:17] bblack: I know, wasn't 100% serious about it -- but it would help with the "long-term experiment with reliable group association" bit [19:56:18] (03CR) 10jenkins-bot: [V: 04-1] temp. setup to use db2007 for RT upgrade test [puppet] - 10https://gerrit.wikimedia.org/r/289725 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn) [19:56:42] (but the no-JS/no-cookie solution gives you as an upside: works on first visit, works if JS/cookies disabled, and percentage of clients is stable over time) [19:56:48] bblack: you can hash that down to low precision, too [19:56:57] nuria_: There are exceptions, but I'm merely mentioning it as something to be aware of. Browsers do clear caches and cookies (always as a whole, not individual objects) based on available space and frequency of use. This has always been the case on the web. [19:56:59] bblack: i will write some of this ideas up and create ticket so we do not forget [19:57:04] so, basically the same uniqueness as any cookie approach, but more durable association [19:57:11] (03PS3) 10Dzahn: mv files/misc/udp2log.init into modules/udp2log [puppet] - 10https://gerrit.wikimedia.org/r/289354 [19:57:15] gwicke: true, we could do some very advanced fingerprinting, that will get us good precision, and hash down to 1K buckets or something. [19:57:18] It's done per origin (full domain name) [19:57:28] but for this purpose, cookies is fine. [19:57:38] (03PS2) 10Ottomata: Upgrade remaining RESTBase staging nodes to 2.2.6 [puppet] - 10https://gerrit.wikimedia.org/r/289722 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [19:57:44] (03CR) 10Ottomata: [C: 032 V: 032] Upgrade remaining RESTBase staging nodes to 2.2.6 [puppet] - 10https://gerrit.wikimedia.org/r/289722 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [19:57:46] Krinkle: sure but recycling time of two weeks seems much to short on my experience [19:57:50] (03CR) 10Dzahn: [C: 032] "the init.erb template is being used, but not this file without .erb" [puppet] - 10https://gerrit.wikimedia.org/r/289354 (owner: 10Dzahn) [19:58:06] (03PS4) 10Dzahn: mv files/misc/udp2log.init into modules/udp2log [puppet] - 10https://gerrit.wikimedia.org/r/289354 [19:58:24] nuria_: https://code.facebook.com/posts/964122680272229/web-performance-cache-efficiency-exercise/ [19:58:40] (03CR) 10Dzahn: [V: 032] mv files/misc/udp2log.init into modules/udp2log [puppet] - 10https://gerrit.wikimedia.org/r/289354 (owner: 10Dzahn) [19:59:15] !log Stopping Cassanra on cerium.eqiad.wmnet for Cassandra 2.2.6 upgrade : T126629 [19:59:16] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [19:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:59:28] Krinkle:we can calculate our nocookies percentages (precisely) and i am not sure they will match those [19:59:37] (03PS2) 10Dzahn: planet/hiera: move cluster setting role, not hosts [puppet] - 10https://gerrit.wikimedia.org/r/289712 [19:59:40] Krinkle: that is a different conversation [19:59:45] (03CR) 10Dzahn: [C: 032] planet/hiera: move cluster setting role, not hosts [puppet] - 10https://gerrit.wikimedia.org/r/289712 (owner: 10Dzahn) [20:00:02] Krinkle: but we have that data too. [20:00:10] For things like lazy loading, only working on repeat views doesn't seem acceptable. [20:00:36] bblack: The reason IP hashing is persued is that because computing rand() in Varnish is too slow? [20:00:43] but for something like lazy loading, hashed IPs would work perfect. we really don't care if users fall in and out of the bin too much. just that we sample X% of users [20:00:47] (03PS3) 10Dzahn: planet/hiera: cluster setting by role, not host names [puppet] - 10https://gerrit.wikimedia.org/r/289712 [20:00:49] (03CR) 10Dzahn: [V: 032] planet/hiera: cluster setting by role, not host names [puppet] - 10https://gerrit.wikimedia.org/r/289712 (owner: 10Dzahn) [20:00:55] !log Upgrading Cassandra on cerium.eqiad.wmnet, and forcing puppet run : T126629 [20:00:56] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:01:02] (03PS2) 10Rush: labstore nfs introduce nfs_mount defined type [puppet] - 10https://gerrit.wikimedia.org/r/289727 [20:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:01:05] (03PS3) 10Rush: labstore nfs introduce nfs_mount defined type [puppet] - 10https://gerrit.wikimedia.org/r/289727 [20:01:15] Krinkle: for lazy loading how brandon describes it it will work on 1st visit too [20:01:19] Krinkle: varnish has a rand function, but it has no notion of a pageview, only a request. [20:01:31] nuria_: I'm aware that that would work for IP hashing as there is no cookie. [20:01:37] Krinkle: for the ip/varnish bit [20:01:44] Krinkle: ah sorry! [20:01:46] bblack: Yeah, but we do that for GeoIP cookie as well. [20:01:52] bblack: We set it if absent [20:02:02] we can do the same with a rand bucket 1-1000 or something, right? [20:02:03] yeah but that doesn't get us a percentage split.... [20:02:08] Then we won't need JS. [20:02:14] no, we're circling back to the beginning of this thread :) [20:02:19] k [20:02:24] * Krinkle finds beginning. [20:02:25] if you set a cookie on 1/10000 requests, eventually all clients will get it [20:02:37] it's not a way to sample a reliable percentage [20:02:53] bblack: OK, I'm not saying the same as beginning, not a loop. [20:03:00] bblack: Don't sampel the cookie, sample the value. [20:03:01] jajaja [20:03:09] Always set the cookie, just like GeoIP [20:03:19] With a random value that expresses your range. [20:03:25] well now we've re-implemented randomized uniques :) [20:03:44] right. [20:03:46] (03CR) 10jenkins-bot: [V: 04-1] labstore nfs introduce nfs_mount defined type [puppet] - 10https://gerrit.wikimedia.org/r/289727 (owner: 10Rush) [20:03:57] It's like GeoIP or WMF-Last-Access but with a random assigned value instead. [20:04:02] but combined with local use of the number, yes.... [20:04:34] In order to use it to flip features, we can program Varnish to translate that cookie into a specific cookie (or header) used by MediaWiki. [20:04:36] so let me recap what you're saying here to make sure I'm clear, because I think that does work (except for no-cookie clients of course, but that's probably smaller than no-JS): [20:04:43] Krinkle: BTW, in terms of users 10% of our them come w/o cookies [20:04:51] (03PS2) 10Dereckson: Adjust groups permissions on fa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289656 (https://phabricator.wikimedia.org/T135736) (owner: 10Huji) [20:05:16] bblack: that shouldn't be a concern [20:05:22] bblack: the nocookies users [20:05:36] (03PS9) 10Ottomata: Initial debian packaging [debs/druid] - 10https://gerrit.wikimedia.org/r/287285 (https://phabricator.wikimedia.org/T134503) [20:05:38] bblack: we want nocookie to be a defacto opt out from everything [20:05:43] when varnish receives each request, it checks if there's a SampleBucket cookie. If there is one, it has the SampleBucket. If there isn't one, it (a) randomly generates a new SampleBucket (b) sends that on for caching/applayer purposes (c) sends a Set-Cookie in the response to persist it [20:05:54] (e.g. in varnish if cookie.rand = 1-200, set X-MW-Lazy: ON, or add cookie.lazyLoading=1 whatever for the backend response) [20:06:01] 06Operations, 06Discovery, 10Maps: Configure monitoring / alerting of Postgresql / redis cluster for maps - https://phabricator.wikimedia.org/T135647#2310375 (10Gehel) [20:06:14] bblack: Right, that works too. [20:06:15] probably no need to pollute cookie-space [20:06:31] Krinkle: rather "Set cookie weblab: lazy=A" [20:06:37] That means you'll fragment Varnish cache by 1000 instead of 2 for lazy-load, though. [20:06:46] (and any other feature). [20:06:58] And means we enable it in MW instead of Varnish [20:07:00] We have one cookie, it assigns a sample bucket (1000 possible values or whatever). And then per-experiment, we say we're assigning buckets 301-400 to experimentX, which turns on a feature header + cache split for them) [20:07:00] That part is nice. [20:07:15] Yeah, that sounds good. [20:07:20] varnish<->client has bucket cookie [20:07:21] we don't have to fragment on buckets. we can change fragmenting as we add and remove experiments [20:07:31] varnish<->backend doesn't have it and has the feature cookie or header instead. [20:07:44] right [20:07:53] anyway, I don't know what weblab is or what prompted the conversation :) [20:07:55] easier with just a header rather than a cookie [20:08:00] (03PS4) 10Rush: labstore nfs introduce nfs_mount defined type [puppet] - 10https://gerrit.wikimedia.org/r/289727 [20:08:10] Krinkle: it's a thought experiment on how we stop having this conversation every time someone wants to test a feature [20:08:26] set up generic infrastructure for it, to make it easy [20:08:34] weblab=abtest [20:09:01] Krinkle: want to update https://phabricator.wikimedia.org/T135762 [20:09:03] ? [20:09:15] Krinkle: cause i did not understand your suggestion fully [20:09:33] bblack: that ticket means we do not have to repeat this conversation [20:10:01] nuria_: I'm making a new ticket now [20:10:04] we can link it into that one [20:10:11] Krinkle: ah wait now I get it -> slow me [20:10:27] oh sorry, I thought you were linking the lazyload ticket [20:10:33] Krinkle: I just to read what you wrote 10 times [20:10:39] editing your ticket :) [20:11:12] * i just had to read what you wrote [20:11:16] (03PS5) 10Rush: labstore nfs introduce nfs_mount defined type [puppet] - 10https://gerrit.wikimedia.org/r/289727 [20:11:18] bblack: ok [20:12:14] (03CR) 10Aaron Schulz: [C: 032] Lowered $wgMaxUserDBWriteDuration to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275734 (https://phabricator.wikimedia.org/T95501) (owner: 10Aaron Schulz) [20:13:02] (03Merged) 10jenkins-bot: Lowered $wgMaxUserDBWriteDuration to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275734 (https://phabricator.wikimedia.org/T95501) (owner: 10Aaron Schulz) [20:13:16] (03PS2) 10Dzahn: temp. setup to use db2007 for RT upgrade test [puppet] - 10https://gerrit.wikimedia.org/r/289725 (https://phabricator.wikimedia.org/T119112) [20:14:39] (03CR) 10Dereckson: [C: 031] "Looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289656 (https://phabricator.wikimedia.org/T135736) (owner: 10Huji) [20:15:20] !log Stopping Cassandra on praseodymium.eqiad.wmnet : T126629 [20:15:20] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:15:22] (03PS2) 10Dereckson: Make SUL icons square and use global defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288582 (https://phabricator.wikimedia.org/T135212) (owner: 10Lokal Profil) [20:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:15:29] (03CR) 10Dereckson: [C: 031] Make SUL icons square and use global defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288582 (https://phabricator.wikimedia.org/T135212) (owner: 10Lokal Profil) [20:15:47] (03PS6) 10Rush: labstore nfs introduce nfs_mount defined type [puppet] - 10https://gerrit.wikimedia.org/r/289727 [20:15:56] (03PS3) 10Dzahn: temp. setup to use db2007 for RT upgrade test [puppet] - 10https://gerrit.wikimedia.org/r/289725 (https://phabricator.wikimedia.org/T119112) [20:16:02] (03CR) 10Dzahn: [C: 032] temp. setup to use db2007 for RT upgrade test [puppet] - 10https://gerrit.wikimedia.org/r/289725 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn) [20:16:14] !log aaron@tin Synchronized wmf-config/CommonSettings.php: Lowered to 5 (duration: 00m 29s) [20:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:16:31] !log Upgrading Cassandra to 2.2.6 on praseodymium.eqiad.wmnet : T126629 [20:16:32] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:17:16] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [20:17:56] (03PS3) 10Dereckson: Make SUL icons square and use global defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288582 (https://phabricator.wikimedia.org/T135212) (owner: 10Lokal Profil) [20:18:11] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2310391 (10BBlack) From irc conversation in wikimedia-operations w/ @Nuria, @Krinkle, and myself. This is the varnish-level pseudo-code proposed (ignore arbitrary names and con... [20:18:14] (03CR) 10Dereckson: [C: 031] "PS3: optipng" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288582 (https://phabricator.wikimedia.org/T135212) (owner: 10Lokal Profil) [20:18:26] nuria_: Krinkle: https://phabricator.wikimedia.org/T135762#2310391 [20:18:29] bblack, Krinkle super thanks for the long conversation [20:19:18] I missed an else-clause, comment edited now heh [20:19:58] !log praseodymium.eqiad.wmnet upgraded and online : T126629 [20:19:59] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:20:38] (03PS7) 10Rush: labstore nfs introduce nfs_mount defined type [puppet] - 10https://gerrit.wikimedia.org/r/289727 [20:21:36] RECOVERY - puppet last run on db1010 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [20:23:58] (03CR) 10Dereckson: "Next step is to add this patch to one of our deployment windows." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288582 (https://phabricator.wikimedia.org/T135212) (owner: 10Lokal Profil) [20:25:35] csteipp: Hi do you mind if i add this test https://gerrit.wikimedia.org/r/#/c/288818/ to centralauth. [20:25:51] Currently centralauth is not being linited by anything for php. [20:25:56] jdlrobson ^^ [20:26:09] csteipp: you're welcome to join our conversation at #wikimedia-codereview ;) [20:26:13] thanks for checking paladox :) [20:26:21] Are we having CR office hours again? [20:26:38] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2310419 (10Nuria) Looks great, one minor nit: rather than having distinct cookies per feature we can have one weblab cookie that contains all features and bucketing for those.... [20:26:45] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2310422 (10BBlack) Other nits and notes: 1. Don't send the cookie to the applayer, just the feature header 2. Validate the cookie's value, clear+reset if invalid 3. Block the f... [20:26:56] bawolff yes [20:27:03] and your welcome jdlrobson [20:27:10] paladox: I'm fine with adding that, as long as nothing breaks. [20:27:17] bblack: added more info cc Krinkle let me know if it makes sense, super thanks again [20:27:37] csteipp: Ok, thanks it wont wont break anything it will only be used as testing purpose [20:27:45] Hmm, its just adding lint, that should be fine [20:27:51] Ok, thanks [20:28:12] !log Stopping Cassandra on restbase-test2001.codfw.wmnet : T126629 [20:28:13] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:29:20] !log Upgrading Cassandra to 2.2.6 on restbase-test2001.codfw.wmnet : T126629 [20:29:22] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:29:50] twentyafterfour, are you deploying? [20:30:05] Krenair: nope did phab go dark for you? [20:30:10] nuria_: what's the point of making a lazy=off bin? if we turn on lazy for 5%, the other 95% are lazy=off [20:30:27] bblack: but then you are running into another experiments [20:30:35] chasemp, no, I want to change the live MW code to debug a VE issue [20:30:44] nuria_: it's set by bin-number, which are all tracked in one place [20:30:49] bblack: so for further analysis you want users under same conditions [20:31:00] nuria_: bins 100-150 = 5% for lazy. bins 200-250 = 5% for other-experiment, etc... [20:31:18] nuria_: oh, you mean so you don't have to discount the other bins? [20:31:19] Krenair: I'm done deploying [20:31:23] thanks [20:31:36] (03CR) 10Dereckson: [C: 031] "This patch is no-op, as ca75365aa removed $wmgUseClusterSession." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289395 (https://phabricator.wikimedia.org/T135446) (owner: 10Sbisson) [20:31:39] bblack: at the time of analysis though i need to know this user is on "B" lazy group (i.e. control group for lazy) [20:31:43] !log restbase-test2001.codfw.wmnet upgraded and online : T126629 [20:31:45] nuria_: ok, makes sense, they have different start/end times, right... [20:31:45] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:31:50] bblack: exactly [20:32:19] nuria_: will every experiment have a control-split like this? just trying to think out how to structure the data/flags [20:32:19] bblack: cause otherwise it is not a control versus "enabled" group, so you need both bits of info to analyze data [20:32:24] bblack: yes [20:32:44] bblack: that bit is just for analysis, the app layer only cares about turning features on [20:32:53] bblack: we will pouplate all those to x-analytics [20:33:04] nuria_: well, I assume the same header we send to the app, we send to X-Analytics, yeah... [20:33:04] bblack: all lazy=A and lazy=B [20:33:12] bblack: right [20:34:17] nuria_: do we want explicit support for mixed experiments, or I guess you could term it as more subdivisions. [20:34:29] bblack: i am going to say no for now [20:34:48] bblack: if we get single experiments with this splits well we are GOLDEN for now [20:34:55] (03PS2) 10Dereckson: Enable experimental Video.js player on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289261 (owner: 10TheDJ) [20:35:01] e.g. allocate a 10% bin to this one effort, but within we have 2.5% divisions as lazy=A, B, C, D. A has lazyimages, B has lazyrefs, C has lazyimages+lazyrefs, D is control [20:35:03] (03CR) 10Dereckson: "Okay. Next step is to include it to a SWAT window on https://wikitech.wikimedia.org/wiki/Deployments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289261 (owner: 10TheDJ) [20:35:14] !log iridium try to block vandal by ip temp so puppet disable and edit of /etc/apache2/phabbanlist.conf [20:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:35:32] !log Stopping Cassandra on restbase-test2002.codfw.wmnet : T126629 [20:35:33] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:35:49] bblack: if we do well on A/B splits we can move on onto more complicated ones [20:35:56] bblack: analysis wise [20:36:15] !log Upgrading Cassandra to 2.2.6 on restbase-test2002.codfw.wmnet : T126629 [20:36:16] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:36:17] just trying to architect it better up-front so it's not a big effort later to switch [20:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:36:31] bblack: are you going to get to this instead of the Ips changes? (no pressure just asking) [20:36:32] !log krenair@tin Synchronized php-1.28.0-wmf.2/extensions/VisualEditor/ApiVisualEditor.php: slight update to the debug logging to try to find the callers (duration: 00m 31s) [20:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:36:57] nuria_: what about precision and #experiments in parallel, which determines how many total bins we need? [20:37:15] nuria_: I mean, I'd like to keep the bin-count small to avoid using it as a proxy to help ID users. [20:37:25] okay,
  • ApiVisualEditor.php line 615 calls ApiVisualEditor->storeInSerializationCache()
  • [20:37:41] Hi. [20:37:45] nuria_: is 1000 bins enough? you can subdivide on 0.1% with that? [20:37:49] ostriches: could you add a new week to https://wikitech.wikimedia.org/wiki/Deployments? [20:38:15] bblack: plenty for our user base at large [20:38:41] we could go 100 too, but then you're stuck with a minimum sub-set of 1% obviously [20:38:43] bblack: size of bin depends on effect of experiment [20:38:50] !log restbase-test2002.codfw.wmnet upgraded and online : T126629 [20:38:51] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:38:59] well there's only 100% total no matter what, so it's mostly about how small a bin we'll ever need [20:39:23] (how small a sub-bin, just for lazy=A) [20:40:05] bblack: but given our numbers (global userbase) i'd be comfortable with say tests that reach 1% [20:40:42] bblack: we can start with 10% to keep numbers coarse, that means a split of a 100 on user base. does that sound ok? [20:40:57] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 626.53 seconds [20:41:05] bblack: sorry, not clear enough [20:41:23] bblack: dividing our user base in 100 buckets is a good start i think. [20:41:40] which gives you at best 1% precision on bucketing, including the control/feature split [20:41:59] !log Stopping Cassandra on restbase-test2003.codfw.wmnet : T126629 [20:42:00] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:42:09] bblack: right [20:42:33] is the control/feature split always 50% within the assigned bucket? [20:42:38] !log Upgrading Cassandra to 2.2.6 on restbase-test2003.codfw.wmnet : T126629 [20:42:39] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:42:58] bblack: for now let's say yes, [20:43:20] bblack: i will add leila to ticket in case she wants to comment with more detail [20:43:23] I guess worst-case, we can come up with a better scheme later on, and wipe the old cookies and start setting new ones with a new name [20:43:27] ok [20:43:35] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [20:44:18] nuria_: I'll re-work my example pseudo-code into something closer based on the above... [20:44:24] ^ ok, that has to do with a host i added to site.pp [20:44:49] !log restbase-test2003.codfw.wmnet upgraded and online : T126629 [20:44:50] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:45:03] bblack: sounds good [20:45:20] !log RESTBase staging Cassandra upgrade complete : T126629 [20:45:21] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:46:36] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /page/mobile-sections/{title} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200) [20:47:25] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: /page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 500 (expecting: 200) [20:47:54] ^^^ looking [20:48:14] (03CR) 10Dereckson: "Change 289656 takes care of the botadmin issue." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289645 (https://phabricator.wikimedia.org/T135725) (owner: 10Huji) [20:48:25] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [20:49:19] !log krenair@tin Synchronized php-1.28.0-wmf.2/extensions/VisualEditor/ApiVisualEditor.php: update to the debug logging to try to find other params (duration: 00m 28s) [20:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:49:47] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [20:50:22] !log krenair@tin Synchronized php-1.28.0-wmf.2/extensions/VisualEditor/ApiVisualEditor.php: (no message) (duration: 00m 25s) [20:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:51:06] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [20:51:07] gwicke, we should always have a valid ETag if we got a page from RB, right? [20:51:08] (03PS1) 10Rush: labs bastions cgroup for /shared/bin/node [puppet] - 10https://gerrit.wikimedia.org/r/289734 (https://phabricator.wikimedia.org/T131541) [20:51:29] (03PS2) 10Rush: labs bastions cgroup for /shared/bin/node [puppet] - 10https://gerrit.wikimedia.org/r/289734 (https://phabricator.wikimedia.org/T131541) [20:51:43] Krenair: yes, I think so [20:52:22] (03PS1) 10Dzahn: requesttracker: use test db if on jessie [puppet] - 10https://gerrit.wikimedia.org/r/289735 (https://phabricator.wikimedia.org/T119112) [20:52:29] (03CR) 10BryanDavis: [C: 031] labs bastions cgroup for /shared/bin/node [puppet] - 10https://gerrit.wikimedia.org/r/289734 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [20:53:37] gwicke, it's difficult to debug this without having a known client-side way of replicating the issue [20:53:42] (03PS2) 10Dzahn: requesttracker: use test db if on jessie [puppet] - 10https://gerrit.wikimedia.org/r/289735 (https://phabricator.wikimedia.org/T119112) [20:53:56] (03CR) 10Rush: [C: 032] labs bastions cgroup for /shared/bin/node [puppet] - 10https://gerrit.wikimedia.org/r/289734 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [20:54:20] Krenair: -> #mediawiki-services? [20:56:50] (03PS3) 10Dzahn: requesttracker: use test db if on jessie [puppet] - 10https://gerrit.wikimedia.org/r/289735 (https://phabricator.wikimedia.org/T119112) [20:56:52] (03CR) 10jenkins-bot: [V: 04-1] requesttracker: use test db if on jessie [puppet] - 10https://gerrit.wikimedia.org/r/289735 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn) [20:57:06] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [20:58:17] ostriches, deploying? [20:58:35] (03CR) 10jenkins-bot: [V: 04-1] requesttracker: use test db if on jessie [puppet] - 10https://gerrit.wikimedia.org/r/289735 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn) [20:59:58] Krenair, or is that you deploying? [21:00:32] I had a one line wfDebugLog call change deployed, will put it back to how it was [21:00:47] (03CR) 10Bartosz Dziewoński: Final Commons configuration for $wgUploadDialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289109 (https://phabricator.wikimedia.org/T134775) (owner: 10Bartosz Dziewoński) [21:00:53] Krenair, i'm about to sync tilerator (git deploy), shouldn't affect anything [21:01:06] !log krenair@tin Synchronized php-1.28.0-wmf.2/extensions/VisualEditor/ApiVisualEditor.php: (no message) (duration: 00m 27s) [21:01:06] (03PS4) 10Dzahn: requesttracker: use test db if on jessie [puppet] - 10https://gerrit.wikimedia.org/r/289735 (https://phabricator.wikimedia.org/T119112) [21:01:12] yeah but I needed to get rid of my change anyway [21:01:13] :) [21:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:01:36] Krenair, i don't think we will conflict with that depl :) [21:02:46] !log deployed tilerator service update https://gerrit.wikimedia.org/r/#/c/289736/ [21:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:04:16] (03CR) 10Dzahn: [C: 032] "no-op on magnesium http://puppet-compiler.wmflabs.org/2852/" [puppet] - 10https://gerrit.wikimedia.org/r/289735 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn) [21:05:22] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2310635 (10BBlack) Better pseudo-code, after more conversation: Data structure (which we update as we add/remove experiments): ``` experiments => { # 100 total bins to use: 0-9... [21:10:19] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2310677 (10BBlack) If the `setRequestHeader` bits didn't use else-if, you could have overlapping buckets with multiple features in play too, but this seems simpler for the momen... [21:11:32] (03PS8) 10Rush: labstore nfs introduce nfs_mount defined type [puppet] - 10https://gerrit.wikimedia.org/r/289727 [21:11:39] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2310694 (10BBlack) Another complication that didn't come up in conversation earlier: what about domainnames for these cookies? They'll be getting binned independently for every... [21:12:45] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [21:14:44] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2310716 (10BBlack) Another thought: with the above code, I've intentionally set it so that both sides of the experiment will initially share an empty cache split (they'll have t... [21:16:56] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [21:22:25] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.74 seconds [21:28:15] !log krenair@tin Synchronized php-1.28.0-wmf.2/extensions/VisualEditor/ApiVisualEditor.php: more debug logs (duration: 00m 25s) [21:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:28:48] (03PS2) 10Jforrester: Drop already-enabled VisualEditorNewAccountEnableProportion wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289653 [21:28:56] (03PS2) 10Jforrester: Follow-up 6dbf876: Move VisualEditor to secondary status on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288450 (https://phabricator.wikimedia.org/T132806) [21:29:09] (03CR) 10Jforrester: [C: 031] "Now ready to go in SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288450 (https://phabricator.wikimedia.org/T132806) (owner: 10Jforrester) [21:29:18] (03CR) 10Jforrester: [C: 031] "Ready to go whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289653 (owner: 10Jforrester) [21:33:11] jouncebot: next [21:33:12] In 1 hour(s) and 26 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160519T2300) [21:39:18] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [21:41:42] 06Operations, 10Wikimedia-Mailing-lists: Reset Mailman List Creator password - https://phabricator.wikimedia.org/T135776#2310845 (10Jalexander) [21:44:57] (03PS1) 10Rush: labstore1003 define scratch share [puppet] - 10https://gerrit.wikimedia.org/r/289774 [21:51:11] (03PS2) 10Rush: labstore1003 define scratch share [puppet] - 10https://gerrit.wikimedia.org/r/289774 [21:55:15] 06Operations, 10Wikimedia-Mailing-lists: Reset Mailman List Creator password - https://phabricator.wikimedia.org/T135776#2310969 (10tomasz) p:05Triage>03High [21:56:16] (03PS1) 10Jalexander: Add Account throttle exception for SF Edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289777 (https://phabricator.wikimedia.org/T135777) [21:56:53] 06Operations, 10Wikimedia-Mailing-lists: Reset Mailman List Creator password - https://phabricator.wikimedia.org/T135776#2310845 (10tomasz) Tentatively triaging as high as per suggestions on Phabricator priority assignment on MediaWiki wiki. [22:05:31] (03PS1) 10CSteipp: Redo local password enforcement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289780 (https://phabricator.wikimedia.org/T119736) [22:13:54] !log krenair@tin Synchronized php-1.28.0-wmf.2/extensions/VisualEditor/ApiVisualEditor.php: rv for now, have another idea for later (duration: 00m 40s) [22:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:34:46] (03PS3) 10TheDJ: Enable experimental Video.js player on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289261 [22:37:11] (03CR) 10Dereckson: [C: 031] Add Account throttle exception for SF Edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289777 (https://phabricator.wikimedia.org/T135777) (owner: 10Jalexander) [22:40:08] MatmaRex: https://phabricator.wikimedia.org/T135773#2311079 [22:40:36] blergh [22:40:36] MatmaRex: this is for the reimplementation to come in OOUI or for now? [22:40:45] Dereckson: hm? [22:41:27] i don't understand what you're asking [22:41:44] Danny_B notes there is something wrong with the messages. That's for the OOUI version or also the former one? [22:41:55] Dereckson: i agree with thedj that the current OOUI version is… highly suboptimal. but i think it can be done well, and that we should do it (better) at some point [22:42:04] Dereckson: no, i think that was a problem with the OOUI version [22:42:07] k [22:42:30] and was undone in https://gerrit.wikimedia.org/r/#/c/289770/2/languages/i18n/en.json,cm [22:42:32] Dereckson: only ooui [22:43:07] eh, i didn't look at this change before. [22:43:29] i wonder how many translations were "updated" in the meantime [22:43:42] i know about cs ;-) [22:43:49] we can check twn [22:45:22] I've CR+2ed it, so Zuul will have launched the Jenkins jobs and merged it for SWAT. [22:47:02] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [22:47:53] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [22:52:37] MatmaRex: at least 28 langs [22:52:56] (based on last 5000 edits on twn) [22:53:13] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:54:03] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:58:28] 22:50:56 < grrrit-wm> (Merged) jenkins-bot: Revert "Convert Special:WhatLinksHere from XML form to OOUI form" [core] (wmf/1.28.0-wmf.2) - https://gerrit.wikimedia.org/r/289772 (https://phabricator.wikimedia.org/T135773) (owner: TheDJ) [22:58:46] Good. That's ready. [23:00:04] RoanKattouw ostriches Krenair awight Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160519T2300). Please do the needful. [23:00:04] MatmaRex James_F thedj Jamesofur: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:09] hmm, i think i'm actually going to move my swat to monday. should be safe, but i don't really want to do it before the weekend. [23:00:11] Dereckson: ^ [23:00:23] MatmaRex: k [23:00:43] Okay I can swat this evening and let's start with this thedj fix. [23:00:44] i'm trying to edit the Deployments page. but my internet connection is all busted today [23:00:46] thedj: ping? [23:01:10] i'm here [23:01:16] MatmaRex: note if something is wrong [23:01:28] MatmaRex: there is the friday to add an emergency fix... [23:01:39] * James_F is here, BTW. [23:01:45] Dereckson: but i'm taking friday off. :) [23:01:51] k [23:03:57] oh oh [23:04:22] Let's do the config changes first, https://gerrit.wikimedia.org/r/#/c/289772/1/languages/i18n/en.json will require a full scap. [23:05:00] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289261 (owner: 10TheDJ) [23:05:18] * Jamesofur is here ftr [23:05:47] (03Merged) 10jenkins-bot: Enable experimental Video.js player on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289261 (owner: 10TheDJ) [23:05:53] hi Jamesofur, noted [23:09:12] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable experimental Video.js player on test2wiki (1/2) (duration: 00m 30s) [23:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:58] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable experimental Video.js player on test2wiki (1/2) (duration: 00m 26s) [23:10:39] Dereckson: i can take responsibility for the whatlinkshere revert, if thedj doesn't reappear [23:10:50] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Enable experimental Video.js player on test2wiki (2/2) (duration: 00m 27s) [23:10:52] thedj: please test ^ [23:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:11:00] MatmaRex: 23:01:10 < thedj> i'm here [23:11:24] oh, sorry. i misread. :D [23:11:51] (03PS2) 10Dereckson: Add Account throttle exception for SF Edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289777 (https://phabricator.wikimedia.org/T135777) (owner: 10Jalexander) [23:11:59] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289777 (https://phabricator.wikimedia.org/T135777) (owner: 10Jalexander) [23:12:44] (03Merged) 10jenkins-bot: Add Account throttle exception for SF Edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289777 (https://phabricator.wikimedia.org/T135777) (owner: 10Jalexander) [23:12:44] Dereckson: checking.. [23:13:11] thedj: It dosent seem to be working https://test2.wikipedia.org/wiki/File:2012-07-18_Market_Street_-_San_Francisco.webm [23:13:32] !log dereckson@tin Synchronized wmf-config/throttle.php: Add Account throttle exception for SF Edit-a-thon (T135777) (duration: 00m 27s) [23:13:33] T135777: Add IP to account creation whitelist for Yerba Buena Center for the Arts editathon - https://phabricator.wikimedia.org/T135777 [23:13:37] Jamesofur: done ^ [23:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:13:47] paladox: i think it's due to the parser cache [23:13:55] which is sort of to be expected... [23:13:58] thedj: Oh, ok [23:14:03] either that, or... [23:14:13] try with ?debug=true ? [23:14:31] yeah debug=true makes it work [23:14:50] strange. either a RL setup issue, or cache busting is meh... [23:15:07] anyway, doesn't really matter. that's why we are enabling it for test2.wp. [23:15:15] Well [23:15:24] Dereckson: thanks, do you know if you deleted the memcache key for that? I can do it myself on terbium but don't want to if already done (says it should be for https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold just given how soon the event is) [23:15:26] there is a five minute delay when we enable stuff like Echo [23:16:02] there it is [23:16:07] fancy :) [23:16:34] thedj: Could we make the play button be in the center please [23:17:00] paladox: actually there were lots of requests for the old player to move it away from the center [23:17:07] Oh [23:17:10] OK [23:17:16] since that is usually what the camera focuses on [23:17:31] you don't want to plaster a play button on top of a face rly [23:17:38] Jamesofur: no key enwiki:acctcreate:ip:209.116.58.194 [23:17:59] Oh [23:18:49] thedj: I wonder how we can get the support to if you click on an unsupported source [23:18:56] you should be able to choose another [23:18:57] one [23:19:00] without refreshing [23:20:19] thedj: so it's fine? [23:20:24] Dereckson: yes. [23:20:34] paladox: let's not discuss that here during a swat [23:20:46] James_F: okay, you're next [23:21:25] Kk. [23:21:39] Dereckson: thanks [23:21:45] You're welcome. [23:23:01] (03PS3) 10Dereckson: Follow-up 6dbf876: Move VisualEditor to secondary status on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288450 (https://phabricator.wikimedia.org/T132806) (owner: 10Jforrester) [23:23:12] thedj: Ok [23:23:23] (03CR) 10Dereckson: [C: 032] "PS3: adding task id, rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288450 (https://phabricator.wikimedia.org/T132806) (owner: 10Jforrester) [23:24:07] 06Operations, 10Wikimedia-Mailing-lists: Reset Mailman List Creator password - https://phabricator.wikimedia.org/T135776#2311162 (10Jalexander) 05Open>03Resolved a:03Dzahn done, thanks! [23:24:10] (03Merged) 10jenkins-bot: Follow-up 6dbf876: Move VisualEditor to secondary status on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288450 (https://phabricator.wikimedia.org/T132806) (owner: 10Jforrester) [23:25:00] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Move VisualEditor to secondary status on English Wikipedia (T132806) (duration: 00m 29s) [23:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:25:17] T132806: Enwiki is a wikitext-primary site, and the visual editor is the primary editor for new logged-in editors - https://phabricator.wikimedia.org/T132806 [23:25:24] James_F: ^ [23:26:19] One sec. [23:27:17] Dereckson: Yup, looks good. [23:27:37] k [23:27:41] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289653 (owner: 10Jforrester) [23:27:49] (03PS3) 10Dereckson: Drop already-enabled VisualEditorNewAccountEnableProportion wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289653 (owner: 10Jforrester) [23:27:57] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289653 (owner: 10Jforrester) [23:28:42] (03Merged) 10jenkins-bot: Drop already-enabled VisualEditorNewAccountEnableProportion wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289653 (owner: 10Jforrester) [23:29:26] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Drop already-enabled VisualEditorNewAccountEnableProportion wikis (duration: 00m 27s) [23:29:28] James_F: here you are ^ [23:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:29:37] Thank you [23:30:48] You're welcome. [23:33:03] !log dereckson@tin Started scap: Revert "Convert Special:WhatLinksHere from XML form to OOUI form" ([[Gerrit:289772]], T135773) [23:33:04] T135773: [Regression] Special:WhatLinksHere is unusable - https://phabricator.wikimedia.org/T135773 [23:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:33:26] thedj: in 20-40 minutes you'll be able to test that [23:35:03] wow, takes that long to scap ? [23:35:26] l10n builds are not fast [23:35:34] (03PS1) 10Dzahn: RT: loading mod_fcgi wasnt puppetized [puppet] - 10https://gerrit.wikimedia.org/r/289795 [23:36:55] around 13 minutes to rebuild LocalisationCache generally [23:37:30] (03CR) 1020after4: "responding to inline comments. I'll upload a new patch that addresses them shortly." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [23:37:32] you're lucky, we only need to rebuild it for wmf/1.28.0-wmf.2 the Thursday. [23:39:09] hehe [23:40:00] (03PS1) 10Dzahn: RT: do not ensure=>latest,install perldoc [puppet] - 10https://gerrit.wikimedia.org/r/289796 [23:42:58] thedj: cache rebuilt, we're sending it to the servers now [23:43:45] (it = the full 4 Gb of MediaWiki+extensions code) [23:51:27] Dereckson: confirmed ok [23:51:44] for? [23:51:49] en.wp [23:51:52] sync-apaches: 16% (ok: 67; fail: 0; left: 334) [23:52:23] Good news. But we've still servers to send the update. [23:53:09] your test worked because you were on one of the sixty servers already updated [23:53:12] i notice