[00:00:13] that's why I'm asking, I have the extension already [00:00:31] MaxSem, https://test.wikipedia.org/w/index.php?title=Special:ZeroRatedMobileAccess&zcmd=img-banner&zlang=en&X-CS=TEST [00:00:34] try that [00:00:59] hmm, works [00:02:31] MaxSem, you got an image? [00:02:35] if so, push to all [00:03:31] !log maxsem@tin Synchronized php-1.28.0-wmf.1/extensions/ZeroBanner: https://gerrit.wikimedia.org/r/#/c/288236/ (duration: 00m 26s) [00:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:05:02] !log maxsem@tin Synchronized php-1.27.0-wmf.23/extensions/ZeroBanner: https://gerrit.wikimedia.org/r/#/c/288236/ (duration: 00m 25s) [00:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:05:13] yurik, plz test [00:05:34] err, gotta register these in core depl branches [00:07:12] MaxSem, testing... [00:08:54] MaxSem, all's good, thx :) [00:09:15] MaxSem, you want to kill obsolete wmfconfigs as well? [00:10:11] not now [00:10:13] ok [00:10:19] actually i'm not sure if there is anything there [00:10:25] window is over, and we fixed what was needed :) [00:10:43] MaxSem, thanks!!! [00:11:05] checking logstash... [00:12:14] MaxSem, all's good, logstash is all clear now. Now it is Redis' turn to cause issues :) [00:13:29] RoanKattouw, redis is having many issues in logstash fatalmon, is it normal after scap? [00:15:03] yurik, Echo borked it [00:15:13] 57826 errors 0_0 [00:15:48] MaxSem, echo was deployed 15 min before that according to those little tag icons [01:22:33] Yeah it's a known bug in wmf1 but my Echo change seems to have spread it to wmf23 [01:23:44] Ooh I think I understand why [01:24:05] It must be where I use -1 as a "no timestamp present" value [01:24:10] And redis doesn't deal well with -1 [01:33:35] I think https://phabricator.wikimedia.org/T134923#2287485 is the fix we need [01:33:48] I'll put that up as a patch after dinner [01:34:04] 06Operations, 10Wikimedia-IRC-RC-Server: udpmxircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875#2287934 (10Dzahn) I tried a couple different combinations with Requires= , After=, Before= etc in both unit files but i haven't found a combination yet that res... [01:34:42] I thought about modifying Echo to stop writing -1 to the cache but that doesn't help because this error occurs when retrieving stuff from the cache [02:30:01] !log mwdeploy@tin scap sync-l10n completed (1.27.0-wmf.23) (duration: 09m 28s) [02:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:56:09] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.1) (duration: 07m 23s) [02:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:05:39] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu May 12 03:05:39 UTC 2016 (duration 9m 30s) [03:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:11:47] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: puppet fail [03:11:48] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: puppet fail [03:28:15] (03CR) 10BBlack: [C: 031] acme-setup: only accept '^[-a-zA-Z0-9_]+$' as unique cert ID [puppet] - 10https://gerrit.wikimedia.org/r/287032 (https://phabricator.wikimedia.org/T134447) (owner: 10Dzahn) [03:38:26] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [03:38:26] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [03:54:26] !log catrope@tin Synchronized php-1.27.0-wmf.23/extensions/Flow/includes/Notifications/MentionPresentationModel.php: Fix Flow fatal (duration: 00m 26s) [03:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:37:14] !log catrope@tin Synchronized php-1.28.0-wmf.1/includes/objectcache/RedisBagOStuff.php: Fix unserialization of negative numbers (duration: 00m 31s) [04:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:42:32] 06Operations, 10Traffic, 07HTTPS: Secure connection failed when attempting to preview or save pages - https://phabricator.wikimedia.org/T134869#2280304 (10MSJapan) I would just like to add that I've having the same issue, but it occurs when saving almost any edit, as well as trying to XfD in Twinkle, and I'm... [04:56:10] !log catrope@tin Synchronized php-1.27.0-wmf.23/includes/objectcache/RedisBagOStuff.php: Fix unserialization of negative numbers (duration: 00m 32s) [04:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:56:40] Looks like that fixed it [04:56:47] Watching the error stream stop in real time was cool [05:25:44] <_joe_> RoanKattouw: cool :) [05:31:31] 06Operations, 10OCG-General, 06Scrum-of-Scrums, 06Services, 07Technical-Debt: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#2288227 (10Joe) >>! In T120079#2281953, @cscott wrote: > Ok, looks like the script is fixed (see above). Of course, the entries all... [05:53:59] 06Operations, 06Performance-Team, 06Services, 07Availability: Consider cassandra for session storage (and SSL) - https://phabricator.wikimedia.org/T134811#2288254 (10Joe) I would like to keep session management in a pretty confined environment - I would frankly avoid using the cassandra cluster we use for... [06:02:31] (03PS1) 10Ema: Add debian/patches/0005-handle-eof-http1.1.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/288350 (https://phabricator.wikimedia.org/T134989) [06:31:24] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:24] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:55] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:55] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:35] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:35] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:46] (03PS4) 10Muehlenhoff: Disable unprivileged bpf on Linux >= 4.4 [puppet] - 10https://gerrit.wikimedia.org/r/288201 [06:37:52] (03CR) 10Ema: [C: 032 V: 032] Add debian/patches/0005-handle-eof-http1.1.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/288350 (https://phabricator.wikimedia.org/T134989) (owner: 10Ema) [06:41:55] PROBLEM - Varnish HTTP misc-backend - port 3128 on cp3007 is CRITICAL: Connection refused [06:41:55] PROBLEM - Varnish HTTP misc-backend - port 3128 on cp3007 is CRITICAL: Connection refused [06:42:41] !log depooling cp3007 [06:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:44:55] (03CR) 10Giuseppe Lavagetto: [C: 031] "I just mistakenly set -1 instead of +1" [puppet] - 10https://gerrit.wikimedia.org/r/283175 (owner: 10Muehlenhoff) [06:49:35] RECOVERY - Varnish HTTP misc-backend - port 3128 on cp3007 is OK: HTTP OK: HTTP/1.1 200 OK - 153 bytes in 0.167 second response time [06:49:35] RECOVERY - Varnish HTTP misc-backend - port 3128 on cp3007 is OK: HTTP OK: HTTP/1.1 200 OK - 153 bytes in 0.167 second response time [06:50:06] !log repooling cp3007 (ran puppet after varnish upgrade) [06:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:56:23] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:56:23] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:57:34] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:35] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:54] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:54] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:03:15] 06Operations, 06Performance-Team, 06Services, 07Availability: Consider cassandra for session storage (and SSL) - https://phabricator.wikimedia.org/T134811#2288280 (10mobrovac) >>! In T134811#2288254, @Joe wrote: > I would like to keep session management in a pretty confined environment - I would frankly av... [07:04:46] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2288293 (10elukey) Re-checked mc1009 stats today and it looks ok: evictions keep staying 0, hit get ratio is now 0.86 and current items stored are growing. Be... [07:05:38] !log running varnish 4.1.2-1wm3 in misc esams (T134989) [07:05:38] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989 [07:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:19:33] (03PS5) 10Muehlenhoff: Disable unprivileged bpf on Linux >= 4.4 [puppet] - 10https://gerrit.wikimedia.org/r/288201 [07:32:40] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2288343 (10Gehel) a:05Gehel>03None [07:41:55] 06Operations, 06Discovery, 10Maps, 03Discovery-Maps-Sprint, 13Patch-For-Review: Install / configure new maps servers in codfw - https://phabricator.wikimedia.org/T134901#2288355 (10Gehel) [07:41:57] 06Operations, 10ops-codfw: rack/setup/deploy maps200[1-4] - https://phabricator.wikimedia.org/T134406#2288354 (10Gehel) 05Open>03Resolved [07:47:03] !log upgrading misc codfw to 4.1.2-1wm3 and wiping caches (T134989) [07:47:04] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989 [07:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:48:33] !log testing schema change T73563 on db1040 [07:48:34] T73563: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563 [07:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:51:40] !log upgrading misc ulsfo to varnish 4.1.2-1wm3 and wiping caches (T134989) [07:51:41] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989 [07:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:52:15] ema is in serious business today! [07:52:43] and he had no coffee yet! [07:53:22] that's not acceptable! [07:53:37] let's fix that [07:55:32] PROBLEM - MariaDB Slave Lag: s4 on db1040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 333.01 seconds [07:55:32] PROBLEM - MariaDB Slave Lag: s4 on db1040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 333.01 seconds [07:57:18] it didn't work [07:58:57] (03PS6) 10Muehlenhoff: Disable unprivileged bpf on Linux >= 4.4 [puppet] - 10https://gerrit.wikimedia.org/r/288201 [08:02:44] RECOVERY - MariaDB Slave Lag: s4 on db1040 is OK: OK slave_sql_lag Replication lag: 1.20 seconds [08:02:44] RECOVERY - MariaDB Slave Lag: s4 on db1040 is OK: OK slave_sql_lag Replication lag: 1.20 seconds [08:14:04] (03PS3) 10Muehlenhoff: Add ferm rules for pybal_conf / http [puppet] - 10https://gerrit.wikimedia.org/r/283175 [08:14:28] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for pybal_conf / http [puppet] - 10https://gerrit.wikimedia.org/r/283175 (owner: 10Muehlenhoff) [08:14:34] (03CR) 10Jcrespo: [C: 031] MariaDB: Change mariadb::core parameters [puppet] - 10https://gerrit.wikimedia.org/r/288205 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [08:16:08] (03PS2) 10Jcrespo: Cleanup my.cnf by grouping options and enable skip-slave-start [puppet] - 10https://gerrit.wikimedia.org/r/286858 [08:19:56] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/286858 (owner: 10Jcrespo) [08:23:21] jynus: go ahead and merge, I'll rebase mines [08:23:26] :) [08:24:17] volans, I won't [08:24:22] it is not tested [08:24:36] and es needs to be done too [08:24:41] (or remerged) [08:25:11] I thought was tested, ok I'll go with mines then [08:27:20] (03PS4) 10Volans: MariaDB: Change mariadb::core parameters [puppet] - 10https://gerrit.wikimedia.org/r/288205 (https://phabricator.wikimedia.org/T111654) [08:28:03] RECOVERY - MariaDB Slave Lag: s6 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 0.49 seconds [08:28:03] RECOVERY - MariaDB Slave Lag: s6 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 0.49 seconds [08:32:34] (03CR) 10Volans: [C: 032] MariaDB: Change mariadb::core parameters [puppet] - 10https://gerrit.wikimedia.org/r/288205 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [08:33:24] !log wiping misc caches once again (T134989) [08:33:25] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989 [08:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:45:09] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2288419 (10elukey) Also found another problem ``` ExecStart=/usr/bin/memcached -p 11211 -u nobody -m 89088 -c 25000 -l 0.0.0.0 -n 5 -f 1.15 -D :-o slab_reassig... [08:48:28] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2288421 (10Joe) @elukey the correct way to handle this is to not let base::service_unit handle automatic restarts of memcached, there is an option for it [08:53:03] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2288423 (10elukey) @Joe: yep better, my idea was also completely not viable since the systemd unit would get changed anyhow: ``` [Service] ExecStart=/usr/bin/m... [08:55:25] (03CR) 10Filippo Giunchedi: [C: 031] ircd: remove exec permission bit on unit file [puppet] - 10https://gerrit.wikimedia.org/r/288333 (owner: 10Dzahn) [08:58:09] (03CR) 10Elukey: [C: 04-1] "We'd need to disable memcached auto-restart by Puppet first, see comments starting from https://phabricator.wikimedia.org/T129963#2288419" [puppet] - 10https://gerrit.wikimedia.org/r/288230 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [09:00:34] (03CR) 10Filippo Giunchedi: [C: 031] "one nit but LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/288201 (owner: 10Muehlenhoff) [09:04:38] (03PS7) 10Muehlenhoff: Disable unprivileged bpf on Linux >= 4.4 [puppet] - 10https://gerrit.wikimedia.org/r/288201 [09:20:11] PROBLEM - check_mysql on fdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3211 [09:20:11] PROBLEM - check_mysql on fdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3211 [09:20:49] we have 2 icinga bots today? :) [09:24:28] (03PS3) 10Volans: MariaDB: tune thread-pool to avoid Aborted_connects [puppet] - 10https://gerrit.wikimedia.org/r/287394 (https://phabricator.wikimedia.org/T133333) [09:25:11] PROBLEM - check_mysql on fdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1896 [09:25:11] PROBLEM - check_mysql on fdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1896 [09:28:02] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 612 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6188125 keys - replication_delay is 612 [09:28:02] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 612 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6188125 keys - replication_delay is 612 [09:31:32] <_joe_> grr this check has to get better [09:31:41] <_joe_> I'll write up a ticket today [09:34:13] is it really the check? I mean, replication gets behind, right? [09:35:14] RECOVERY - check_mysql on fdb2001 is OK: Uptime: 680389 Threads: 1 Questions: 15429574 Slow queries: 3944 Opens: 720 Flush tables: 2 Open tables: 572 Queries per second avg: 22.677 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:35:14] RECOVERY - check_mysql on fdb2001 is OK: Uptime: 680389 Threads: 1 Questions: 15429574 Slow queries: 3944 Opens: 720 Flush tables: 2 Open tables: 572 Queries per second avg: 22.677 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:35:33] <_joe_> yes, but it does print the number of keys, which means we get too many notifications [09:36:31] I have the same issue- how to do in a bad state but that is not really actionable [09:37:08] what is the best model for that? [09:37:58] (03CR) 10Filippo Giunchedi: "generally LGTM, haven't tried to build it yet either" (0311 comments) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/288196 (https://phabricator.wikimedia.org/T132317) (owner: 10Giuseppe Lavagetto) [09:38:26] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6144964 keys - replication_delay is 0 [09:38:26] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6144964 keys - replication_delay is 0 [09:39:19] (03CR) 10Volans: [C: 032] "I will apply shortly the new values live to all masters" [puppet] - 10https://gerrit.wikimedia.org/r/287394 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans) [09:45:56] (03CR) 10Filippo Giunchedi: [C: 031] Disable unprivileged bpf on Linux >= 4.4 [puppet] - 10https://gerrit.wikimedia.org/r/288201 (owner: 10Muehlenhoff) [09:47:18] !log Apply at runtime thread_pool_max_threads=2000 for all coredb masters (Gerrit 287394, T133333) [09:47:19] T133333: Audit new eqiad masters configuration - https://phabricator.wikimedia.org/T133333 [09:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:51:38] !log Apply at runtime thread_pool_stall_limit=10 for all coredb masters (Gerrit 287394, T133333) [09:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:51:45] T133333: Audit new eqiad masters configuration - https://phabricator.wikimedia.org/T133333 [09:53:34] (03PS2) 10Volans: Avoid loading my.cnf twice [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/285679 (https://phabricator.wikimedia.org/T133780) [10:00:04] godog: Dear anthropoid, the time has come. Please deploy Ops / Graphite (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160512T1000). [10:00:04] godog: A patch you scheduled for Ops / Graphite is about to be deployed. Please be available during the process. [10:00:41] I am indeed available [10:02:30] (03PS3) 10Filippo Giunchedi: graphite: add cluster_servers for codfw/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/287970 (https://phabricator.wikimedia.org/T134889) [10:02:39] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: add cluster_servers for codfw/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/287970 (https://phabricator.wikimedia.org/T134889) (owner: 10Filippo Giunchedi) [10:02:52] (03PS7) 10Filippo Giunchedi: graphite: add 'big_users' route and cluster [puppet] - 10https://gerrit.wikimedia.org/r/277490 (https://phabricator.wikimedia.org/T85451) [10:03:28] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: add 'big_users' route and cluster [puppet] - 10https://gerrit.wikimedia.org/r/277490 (https://phabricator.wikimedia.org/T85451) (owner: 10Filippo Giunchedi) [10:06:41] !log reload carbon-c-relay on labmon1001, noop [10:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:08:56] PROBLEM - puppet last run on graphite2001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [10:08:56] PROBLEM - puppet last run on graphite2001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [10:09:11] yeah yeah [10:09:23] !log run puppet on graphite2001 to split cassandra metrics [10:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:10:11] (03CR) 10Volans: [C: 032] Avoid loading my.cnf twice [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/285679 (https://phabricator.wikimedia.org/T133780) (owner: 10Volans) [10:10:56] RECOVERY - puppet last run on graphite2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:10:56] RECOVERY - puppet last run on graphite2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:12:52] (03PS1) 10Volans: MariaDB: update submodule reference [puppet] - 10https://gerrit.wikimedia.org/r/288361 (https://phabricator.wikimedia.org/T133780) [10:20:30] !log restarted oozie, hive-* daemons on analytics1003 for java upgrades [10:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:21:20] (03PS1) 10Jcrespo: Change binlog format for parsercaches to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/288362 (https://phabricator.wikimedia.org/T133523) [10:21:29] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Puppet has 1 failures [10:21:29] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Puppet has 1 failures [10:23:36] (03PS2) 10Jcrespo: Change binlog format for parsercaches to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/288362 (https://phabricator.wikimedia.org/T133523) [10:24:25] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:24:25] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:25:01] that's me [10:25:48] ACKNOWLEDGEMENT - graphite.wikimedia.org on graphite1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Filippo Giunchedi adding second machine [10:25:48] ACKNOWLEDGEMENT - graphite.wikimedia.org on graphite1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Filippo Giunchedi adding second machine [10:31:06] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [10:31:06] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [10:37:15] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [10:37:15] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [10:38:42] 06Operations: Contain imagemagick on the image scalers with firejail - https://phabricator.wikimedia.org/T135111#2288587 (10MoritzMuehlenhoff) [10:38:57] 06Operations: Contain imagemagick on the image scalers with firejail - https://phabricator.wikimedia.org/T135111#2288599 (10MoritzMuehlenhoff) p:05Triage>03High a:03MoritzMuehlenhoff [10:43:07] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [10:43:07] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [10:44:32] !log testing dual masters and r/w mode on parsercache nodes [10:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:45:58] I think I will keep heartbeat only on the active datacenter [10:47:12] now waiting for things to break horribly [10:47:26] (03PS1) 10Giuseppe Lavagetto: ocg: restart processing queues on ocg1003 [puppet] - 10https://gerrit.wikimedia.org/r/288366 [10:47:36] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [10:47:36] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [10:49:07] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [10:49:07] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [10:50:06] ACKNOWLEDGEMENT - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] Filippo Giunchedi graphite split metrics [10:50:06] ACKNOWLEDGEMENT - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] Filippo Giunchedi graphite split metrics [10:51:52] (03Abandoned) 10Giuseppe Lavagetto: ocg: restart processing queues on ocg1003 [puppet] - 10https://gerrit.wikimedia.org/r/288366 (owner: 10Giuseppe Lavagetto) [10:55:19] 06Operations, 10DBA, 13Patch-For-Review, 07Performance, and 2 others: [RFC] improve parsercache replication and sharding handling - https://phabricator.wikimedia.org/T133523#2288621 (10aaron) >>! In T133523#2279579, @jcrespo wrote: > Writes generate errors at 1000-10000 (rate per second), I am concerned ab... [10:58:35] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1686 bytes in 0.142 second response time [10:58:35] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1686 bytes in 0.142 second response time [11:04:15] 06Operations, 10ops-eqiad: dbstore1001 degraded RAID - https://phabricator.wikimedia.org/T134471#2266613 (10Volans) @Cmjohnson did you replace the disk? I can see it rebuilding :) ``` volans@dbstore1001:~$ sudo megacli -PDRbld -ShowProg -PhysDrv [32:2] -aALL Rebuild Progress on Device at Enclosure 32, Slot 2... [11:09:45] 06Operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner: Rationalize our jobqueues redis topology - https://phabricator.wikimedia.org/T135113#2288644 (10Joe) [11:18:21] (03CR) 10Jcrespo: [C: 032] Change binlog format for parsercaches to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/288362 (https://phabricator.wikimedia.org/T133523) (owner: 10Jcrespo) [11:18:51] (03PS1) 10Filippo Giunchedi: graphite: bump uwsgi processes [puppet] - 10https://gerrit.wikimedia.org/r/288368 (https://phabricator.wikimedia.org/T134889) [11:23:31] (03PS2) 10Filippo Giunchedi: graphite: bump uwsgi processes [puppet] - 10https://gerrit.wikimedia.org/r/288368 (https://phabricator.wikimedia.org/T134889) [11:23:37] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: bump uwsgi processes [puppet] - 10https://gerrit.wikimedia.org/r/288368 (https://phabricator.wikimedia.org/T134889) (owner: 10Filippo Giunchedi) [11:24:12] jynus: I'm merging your change too btw [11:24:36] jynus: ack? [11:27:25] godog: go ahead, that change is ok and it's already applied live [11:27:34] if it's https://gerrit.wikimedia.org/r/288362 [11:27:52] that you're referring to [11:30:20] yeah that, thanks volans ! [11:30:57] yw :) [11:34:52] 06Operations, 10DBA, 13Patch-For-Review, 07Performance, and 2 others: [RFC] improve parsercache replication and sharding handling - https://phabricator.wikimedia.org/T133523#2288724 (10jcrespo) I've just applied the STATEMENT/REPLACE option. This will basically avoid all issues of the passive datacenter be... [11:34:55] yes, sorry, I was on palladium [11:35:07] and got distracted as usual by a ticket [11:35:19] no worries [11:35:36] too much multi-tasking [11:41:50] RECOVERY - RAID on dbstore1001 is OK: OK: optimal, 1 logical, 2 physical [11:41:50] RECOVERY - RAID on dbstore1001 is OK: OK: optimal, 1 logical, 2 physical [11:43:06] (03PS1) 10BBlack: cache_misc: do not deliver expired cached objects [puppet] - 10https://gerrit.wikimedia.org/r/288372 [11:43:31] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: do not deliver expired cached objects [puppet] - 10https://gerrit.wikimedia.org/r/288372 (owner: 10BBlack) [11:44:51] (03PS1) 10Elukey: Add a new AQS testing environment to play with Cassandra settings before production. [puppet] - 10https://gerrit.wikimedia.org/r/288373 (https://phabricator.wikimedia.org/T124314) [11:45:57] (03PS2) 10BBlack: X-Cache: support moving through hit before miss [puppet] - 10https://gerrit.wikimedia.org/r/288244 [11:46:26] (03CR) 10BBlack: [C: 032 V: 032] X-Cache: support moving through hit before miss [puppet] - 10https://gerrit.wikimedia.org/r/288244 (owner: 10BBlack) [11:46:47] (03PS2) 10Elukey: Add a new AQS testing environment to play with Cassandra settings before production. [puppet] - 10https://gerrit.wikimedia.org/r/288373 (https://phabricator.wikimedia.org/T124314) [11:53:13] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2288816 (10Jonas) Now I get content-length 0 for query.wikidata.org ``` jonkr@C134:~$ curl -v 'https://query.wikidata.org/'... [12:05:07] !log added jenkins 1.651.2 for precise-wikimedia to carbon [12:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:05:43] 06Operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner: Rationalize our jobqueues redis topology - https://phabricator.wikimedia.org/T135113#2288844 (10Joe) p:05Triage>03Normal [12:09:12] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 1 failures [12:09:12] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 1 failures [12:23:20] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 707 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6158200 keys - replication_delay is 707 [12:23:20] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 707 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6158200 keys - replication_delay is 707 [12:28:49] 06Operations: stats.wikimedia.org down - https://phabricator.wikimedia.org/T135121#2288867 (10Kghbln) [12:30:01] (03PS1) 10Mobrovac: Change-Prop: Limit the number of concurrent tasks [puppet] - 10https://gerrit.wikimedia.org/r/288376 (https://phabricator.wikimedia.org/T134456) [12:30:22] (03PS1) 10Dereckson: UK EU edit-a-thon throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288377 (https://phabricator.wikimedia.org/T134902) [12:32:18] PROBLEM - puppet last run on wtp1014 is CRITICAL: CRITICAL: Puppet has 1 failures [12:32:18] PROBLEM - puppet last run on wtp1014 is CRITICAL: CRITICAL: Puppet has 1 failures [12:32:51] !log etherpad-lite_1.6.0-1 uploaded on apt.wikimedia.org [12:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:33:02] (03CR) 10Dereckson: "@DCausse, Zuul enforces Depends-On: rule, so we can't merge accidentally these changes anymore." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281966 (https://phabricator.wikimedia.org/T131944) (owner: 10DCausse) [12:33:49] (03CR) 10Mobrovac: "PCC OKed it - https://puppet-compiler.wmflabs.org/2769/" [puppet] - 10https://gerrit.wikimedia.org/r/288376 (https://phabricator.wikimedia.org/T134456) (owner: 10Mobrovac) [12:33:58] (03CR) 10DCausse: "thanks for the info :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281966 (https://phabricator.wikimedia.org/T131944) (owner: 10DCausse) [12:36:07] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:36:07] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:37:48] PROBLEM - DPKG on etherpad1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:37:48] PROBLEM - DPKG on etherpad1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:39:05] !log upgraded nodejs on etherpad1001 to nodejs 4.3 [12:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:39:31] moritzm: ^ [12:39:32] :-) [12:39:47] RECOVERY - DPKG on etherpad1001 is OK: All packages OK [12:39:47] RECOVERY - DPKG on etherpad1001 is OK: All packages OK [12:40:10] !log kill duplicate ircecho daemon on neon [12:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:40:20] I got the right one! [12:41:02] and I upgraded nodejs on etherpad1001 with just a sec of downtime.. I did not expect that [12:41:11] I was sure there would be some hiccup [12:41:51] nice :-) [12:45:08] RECOVERY - puppet last run on maps2002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [12:53:01] <_joe_> akosiaris: did you also fix etherpad's db schema? [12:53:04] <_joe_> :P [12:56:04] 06Operations, 05codfw-rollout: Reduce etcd technical debt - https://phabricator.wikimedia.org/T135122#2288903 (10Joe) [12:56:17] _joe_: yes! all of it and then some :P [12:56:17] 06Operations, 05codfw-rollout: Reduce etcd technical debt - https://phabricator.wikimedia.org/T135122#2288915 (10Joe) p:05Triage>03Normal [12:58:31] (03PS1) 10Muehlenhoff: Install firejail von image/video scalers [puppet] - 10https://gerrit.wikimedia.org/r/288379 (https://phabricator.wikimedia.org/T135111) [12:59:19] RECOVERY - puppet last run on wtp1014 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [12:59:37] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:59:53] <_joe_> akosiaris: ok next in line is european economy, then world hunger [13:00:27] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:00:38] PROBLEM - puppet last run on mw2136 is CRITICAL: CRITICAL: puppet fail [13:01:08] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:01:09] 06Operations: Deploy etcddump (or another etcd dump & load tool) to production - https://phabricator.wikimedia.org/T135124#2288935 (10Joe) [13:01:18] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:03:05] 06Operations, 05codfw-rollout: Install a second etcd cluster in codfw - https://phabricator.wikimedia.org/T135125#2288951 (10Joe) [13:07:41] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2288974 (10elukey) Current hit ratio for mc10XX hosts: .91238308555293170316 .91134372545957438854 .90685357675851006484 .91770057045075672736 .913717948920564... [13:08:45] !log disabling puppet and restarting MySQL on db2040 to test change 288361 (in scheduled downtime on icinga) T133780 [13:08:46] T133780: Multiple Puppet class make MySQL load /etc/my.cnf twice - https://phabricator.wikimedia.org/T133780 [13:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:09:18] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [13:09:18] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [13:09:38] * godog shakes fist at duplicate ircecho [13:09:57] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [13:09:57] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [13:10:15] (03PS1) 10Filippo Giunchedi: graphite: always sort lists [puppet] - 10https://gerrit.wikimedia.org/r/288382 (https://phabricator.wikimedia.org/T134889) [13:10:47] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [13:10:47] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [13:10:48] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6137755 keys - replication_delay is 0 [13:10:48] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6137755 keys - replication_delay is 0 [13:11:18] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [13:11:18] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [13:11:19] (03PS1) 10Elukey: Raise the memcached chunk growth factor on mc1007 as part of a perf experiment. [puppet] - 10https://gerrit.wikimedia.org/r/288383 (https://phabricator.wikimedia.org/T129963) [13:13:33] (03PS2) 10Alexandros Kosiaris: Install firejail on image/video scalers [puppet] - 10https://gerrit.wikimedia.org/r/288379 (https://phabricator.wikimedia.org/T135111) (owner: 10Muehlenhoff) [13:13:55] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: always sort lists [puppet] - 10https://gerrit.wikimedia.org/r/288382 (https://phabricator.wikimedia.org/T134889) (owner: 10Filippo Giunchedi) [13:14:09] !log deploying GTID replication on all of es3 shard [13:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:14:22] jynus: \o/ [13:17:16] <_joe_> jynus: :)) [13:17:45] 06Operations, 05codfw-rollout: Turn on TLS for intra-cluster communications - https://phabricator.wikimedia.org/T135128#2289022 (10Joe) [13:19:07] 06Operations, 05codfw-rollout: Turn on etcd TLS for intra-cluster communications - https://phabricator.wikimedia.org/T135128#2289022 (10Joe) [13:19:22] (03PS1) 10Ema: Add debian/patches/0006-IMS-duplicate-headers-proper-handling.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/288387 (https://phabricator.wikimedia.org/T134989) [13:20:27] (03CR) 10BBlack: [C: 031] "Worth a shot!" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/288387 (https://phabricator.wikimedia.org/T134989) (owner: 10Ema) [13:21:06] 06Operations: Create backup/restore scripts for etcd - https://phabricator.wikimedia.org/T135129#2289040 (10Joe) [13:21:21] (03CR) 10Ema: [C: 032 V: 032] Add debian/patches/0006-IMS-duplicate-headers-proper-handling.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/288387 (https://phabricator.wikimedia.org/T134989) (owner: 10Ema) [13:22:30] 06Operations, 05codfw-rollout: Reduce etcd technical debt - https://phabricator.wikimedia.org/T135122#2289055 (10Joe) [13:22:32] 06Operations: Create backup/restore scripts for etcd - https://phabricator.wikimedia.org/T135129#2289040 (10Joe) [13:22:44] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2289058 (10elukey) I am going to raise the growth factor to 1.15 on mc1007 too to have a comparison after the weekend about this variable change with 1.4.21. Cu... [13:23:59] (03PS2) 10Elukey: Raise the memcached chunk growth factor on mc1007 as part of a perf experiment. [puppet] - 10https://gerrit.wikimedia.org/r/288383 (https://phabricator.wikimedia.org/T129963) [13:25:35] _joe_ I am checking --^ with the puppet compiler and if ok I'll merge, so we'll have a good comparison before adding the new options [13:26:51] akosiaris, _joe_ I am surprised you are happy, I am confused why? [13:27:06] (03CR) 10Elukey: [C: 032] "Puppet compiler looks good: https://puppet-compiler.wmflabs.org/2771/" [puppet] - 10https://gerrit.wikimedia.org/r/288383 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [13:28:18] it is literally 1 line change I am running [13:28:57] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:28:57] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:29:10] <_joe_> jynus: GTID is a good thing, isn't it? [13:29:18] yes of course [13:29:37] <_joe_> and I think it has been a long preparation for that change [13:29:47] exactly [13:29:49] <_joe_> it's like when I changed pybal's source of config [13:29:57] I am also happy we are moving with this, it is just that I would not even mention on a meeting [13:30:06] !log restarted memcached on mc1007 with chunk growth factor set to 1.15 - Part of a perf experiment (T129963) [13:30:07] T129963: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963 [13:30:10] <_joe_> it was a 3 letter change, but it was the coronation of a lot of work :) [13:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:30:33] it is the MariaDB 10 that took us out of the 90's technlogy [13:30:38] heh I remember doing circular mysql replication years ago without GTID [13:30:41] <_joe_> jynus: exactly [13:30:44] <_joe_> bblack: me too [13:30:55] and having to have my own special IDs assigned in my mysql configs [13:30:57] <_joe_> bblack: it was a thing when we were young :P [13:30:58] and at that point I was underwhelmed on the reaction [13:31:15] and this, which is a 1 line change, gets noticed :-) [13:31:23] that is why I was confused [13:31:54] <_joe_> jynus: it's not like we didn't notice the moving to mariadb 10, but am I wrong or we completed it during the switchover? [13:31:57] bblack, actually, I have to dissapoint you [13:32:02] yes [13:32:12] <_joe_> which was not exactly the most relaxed time of my work here [13:32:24] mariadb's GTID implementation uses server_id, not UUIDs like 5.6 [13:32:36] jynus: I just mean the old way was very ugly to maintain, in old mysql multi-master circular with server_id [13:32:39] yeah, _joe_ I assume so [13:32:49] you couldn't easily even expand your initial decision of a range of server_id and such [13:33:02] right now, I could do "CHANGE MASTER" on es3 [13:33:14] and it would auto-position the slaves [13:33:20] <_joe_> bblack: you were using master-master or a ring topology? [13:33:29] ring, across oceans... [13:33:33] we are like in 2000's technology already :-) [13:33:51] lol [13:33:52] still not autoprovisioning [13:34:01] east-coast-US + west-coast-US + Germany sites, in a 3-way multi-master ring with huge latencies [13:34:02] <_joe_> bblack: oh, my [13:34:17] and application trying to handle the rare transaction failures due to colliding updates [13:34:21] <_joe_> volans: do you remember our "somewhat consistent" mysql multimaster? [13:34:23] app is writing locally in all 3 places [13:34:23] !log upgrading misc to varnish 4.1.2-1wm4 and wiping caches (T134989) [13:34:24] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989 [13:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:34:53] _joe_, I just setup that for parsercaches [13:34:54] yeah _joe_ I remember well ;0 [13:35:02] so do not go too far for that [13:35:25] of course, we do not need consistency on a cache [13:35:38] <_joe_> jynus: trust me, that was _bad_ [13:35:47] <_joe_> but volans knows the gory details ofc [13:35:53] I've seen those, remember I used to be a consultant [13:36:34] <_joe_> right [13:36:35] aka "Mr. Clinet, do not worry, I've seen worse" [13:36:43] except that one time [13:36:44] <_joe_> you've seen it all, and even worse :P [13:37:09] which I litterally hadn't seen worse [13:37:21] <_joe_> I don't dare to imagine [13:37:39] BTW, I was googling for common issues with GTID [13:37:59] and run into AWS's official documentation on "how to setup a slave" [13:38:23] I got scared :-O [13:38:30] <_joe_> ahah [13:38:33] eheheh [13:39:06] nothing wrong on it, but "setting the master in read only mode while doing a dump" was like [13:39:33] HA level expectation is not very high there for clients [13:42:48] (03PS1) 10Muehlenhoff: WIP: Use firejail in image scaling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288390 (https://phabricator.wikimedia.org/T135111) [13:48:50] (03PS1) 10Dereckson: Allow import from outreach to meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288392 (https://phabricator.wikimedia.org/T134788) [13:58:53] (03PS1) 10Filippo Giunchedi: graphite: exclude local machine from cluster_servers [puppet] - 10https://gerrit.wikimedia.org/r/288395 (https://phabricator.wikimedia.org/T134889) [14:01:37] 06Operations, 10DBA: High replication lag to dewiki - https://phabricator.wikimedia.org/T135100#2289159 (10valhallasw) 05Open>03Resolved Presumably there was a high write load and db1026 couldn't keep up. Because this bug is about the production database, please tag it as #operations rather than #labs. A... [14:01:57] 06Operations, 10DBA: High replication lag to dewiki - https://phabricator.wikimedia.org/T135100#2289164 (10valhallasw) p:05Unbreak!>03Triage [14:06:37] (03PS2) 10Filippo Giunchedi: graphite: exclude local machine from cluster_servers [puppet] - 10https://gerrit.wikimedia.org/r/288395 (https://phabricator.wikimedia.org/T134889) [14:07:16] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:07:17] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:08:06] still me heh [14:08:40] godog: you killed the second icinga didn't you? :-) [14:08:50] volans: yeah but it came back :( [14:09:06] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1572 bytes in 0.006 second response time [14:09:06] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1572 bytes in 0.006 second response time [14:09:08] try with the other one :D [14:10:16] heheh I'm still fighting with graphite if you want to take a stab at ircecho [14:11:47] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [14:11:48] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [14:12:12] (03PS1) 10Muehlenhoff: Update to 4.4.10 [debs/linux44] - 10https://gerrit.wikimedia.org/r/288397 [14:14:02] 06Operations, 10DBA: High replication lag to dewiki - https://phabricator.wikimedia.org/T135100#2288166 (10jcrespo) @doctaxon one of the largest dewiki servers went down a few days ago. New servers have already been requested and will be up soon. In particular, to mitigate the load issues, I decided to sacrifi... [14:15:35] (03PS3) 10Filippo Giunchedi: graphite: exclude local machine from cluster_servers [puppet] - 10https://gerrit.wikimedia.org/r/288395 (https://phabricator.wikimedia.org/T134889) [14:15:42] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: exclude local machine from cluster_servers [puppet] - 10https://gerrit.wikimedia.org/r/288395 (https://phabricator.wikimedia.org/T134889) (owner: 10Filippo Giunchedi) [14:18:27] 06Operations, 10DBA: High replication lag to dewiki - https://phabricator.wikimedia.org/T135100#2289181 (10jcrespo) [14:18:29] 06Operations, 10DBA: Install, configure and provision recently arrived db core machines - https://phabricator.wikimedia.org/T133398#2230646 (10jcrespo) [14:19:16] 06Operations, 07Technical-Debt, 05codfw-rollout: Reduce etcd technical debt - https://phabricator.wikimedia.org/T135122#2289184 (10Aklapper) [14:19:59] 06Operations, 10DBA: Install, configure and provision recently arrived db core machines - https://phabricator.wikimedia.org/T133398#2230646 (10jcrespo) Prioritize db1058's replacement, which its lack is causing replication issues on T135100. This is still block by DC work. [14:20:26] 06Operations, 10ops-eqiad: dbstore1001 degraded RAID - https://phabricator.wikimedia.org/T134471#2289189 (10Cmjohnson) Yes, disk was replaced yesterday afternoon [14:23:46] PROBLEM - Recursive DNS on 208.80.153.51 is CRITICAL: CRITICAL - Plugin timed out while executing system call [14:23:46] PROBLEM - Recursive DNS on 208.80.153.51 is CRITICAL: CRITICAL - Plugin timed out while executing system call [14:24:03] 06Operations, 10DBA: Install, configure and provision recently arrived db core machines - https://phabricator.wikimedia.org/T133398#2289208 (10jcrespo) Although looking at the graphs, not only the current capacity is lower, there was also an unusualy high amount of INSERT and UPDATES. {F4001166} [14:24:43] <_joe_> uh? [14:25:14] <_joe_> chasemp, andrewbogott ^^ it's a labs recursor? [14:25:32] _joe_: it's in labtest sorry, not sure why enabled for alerting [14:25:38] RECOVERY - Recursive DNS on 208.80.153.51 is OK: DNS OK: 0.143 seconds response time. www.wikipedia.org returns 208.80.153.224 [14:25:38] RECOVERY - Recursive DNS on 208.80.153.51 is OK: DNS OK: 0.143 seconds response time. www.wikipedia.org returns 208.80.153.224 [14:25:38] _joe_: that should not be alerting at all, I don't know what that's about [14:26:09] 06Operations, 10ops-eqiad: dbstore1001 degraded RAID - https://phabricator.wikimedia.org/T134471#2289213 (10jcrespo) 05Open>03Resolved [14:31:09] _joe_, chasemp: I downtimed that alert until 2018 [14:33:34] 06Operations, 10DBA: Install, configure and provision recently arrived db core machines - https://phabricator.wikimedia.org/T133398#2289261 (10Luke081515) @jcrespo If it's not a secret, maybe you can list the machines here? The task from the description is not visible ;) [14:34:31] (03CR) 10Thcipriani: [C: 031] "Looks good. I made one inline suggestion, but this seems like a good base to iterate from. Thank you for doing this!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284851 (owner: 10Alex Monk) [14:35:30] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 4.4.10 [debs/linux44] - 10https://gerrit.wikimedia.org/r/288397 (owner: 10Muehlenhoff) [14:36:01] 06Operations: stats.wikimedia.org down - https://phabricator.wikimedia.org/T135121#2289268 (10Aklapper) Cannot reproduce - it's up for me. [14:38:48] 06Operations, 10DBA: Install, configure and provision recently arrived db core machines - https://phabricator.wikimedia.org/T133398#2289270 (10jcrespo) @Luke081515 don't worry, you will be able to see the list when we start working on this ticket. That other ticket is hidden only because it is procurement (con... [14:39:47] (03Abandoned) 10Elukey: Add new memcached features/settings to mc1009 as part of perf experiment. [puppet] - 10https://gerrit.wikimedia.org/r/288230 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [14:42:33] (03PS1) 10Filippo Giunchedi: graphite: bump uWSGImaxVars [puppet] - 10https://gerrit.wikimedia.org/r/288401 (https://phabricator.wikimedia.org/T134889) [14:43:32] (03PS2) 10Filippo Giunchedi: graphite: bump uWSGImaxVars [puppet] - 10https://gerrit.wikimedia.org/r/288401 (https://phabricator.wikimedia.org/T134889) [14:45:24] !log enabling GTID on x1 shard (except db1029) [14:45:26] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: bump uWSGImaxVars [puppet] - 10https://gerrit.wikimedia.org/r/288401 (https://phabricator.wikimedia.org/T134889) (owner: 10Filippo Giunchedi) [14:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:46:11] so now it starts the real test. I intend to move around and crash db2008 and db2009. [14:46:21] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 701 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6146026 keys - replication_delay is 701 [14:46:21] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 701 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6146026 keys - replication_delay is 701 [14:50:16] !log downtime flapping redis replication on rdb2006 alert for 10d [14:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:50:37] _joe_: ^ [14:51:35] (03CR) 10Thcipriani: [C: 04-1] "https://puppet-compiler.wmflabs.org/2774/" [puppet] - 10https://gerrit.wikimedia.org/r/284851 (owner: 10Alex Monk) [14:54:24] (03PS6) 10Gehel: WIP - Create necessary folders for Postgresql and Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/288215 (https://phabricator.wikimedia.org/T134901) [14:55:57] !log executing random CHANGE MASTERS and crashes on db200[89] trying to break them [14:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:56:16] if this doesn't work, we will have to migrate to postgres [14:56:25] :-) [14:56:41] rotfl [14:57:33] (03PS1) 10Gehel: Depooling wdqs1001 for reinstall and new disks [puppet] - 10https://gerrit.wikimedia.org/r/288402 (https://phabricator.wikimedia.org/T120714) [14:58:57] 06Operations: stats.wikimedia.org down - https://phabricator.wikimedia.org/T135121#2289304 (10Kghbln) Odd, I just get a blanc screen so most probably a 500 both with Firefox 46 and Chrome 49. However I just tried Opera and Epiphany and these are working. I am trying to access from a Linux Mint 17.3 machine. Tha... [15:00:04] anomie ostriches thcipriani marktraceur aude: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160512T1500). [15:00:04] kart_ dcausse bawolff frimelle: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:12] Woo! [15:00:15] o/ [15:00:23] swat... [15:01:02] I can SWAT. Whew, lots of changes :) kart_ frimelle ping me if/when around. [15:01:25] i am around for the wikidata stuff, if frimelle is not [15:01:43] (03PS3) 10Thcipriani: Remove deprecated settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281966 (https://phabricator.wikimedia.org/T131944) (owner: 10DCausse) [15:01:50] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281966 (https://phabricator.wikimedia.org/T131944) (owner: 10DCausse) [15:02:07] aude: thanks. [15:02:13] thcipriani I am around too [15:02:21] hi frimelle :) [15:02:21] aude ^ [15:02:28] hey aude! [15:02:39] (03Merged) 10jenkins-bot: Remove deprecated settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281966 (https://phabricator.wikimedia.org/T131944) (owner: 10DCausse) [15:03:19] * kart_ around. [15:03:26] thcipriani: yes [15:03:36] hmm dcausse I'm a little worried about how to sync this one https://gerrit.wikimedia.org/r/#/c/287973/ can I just sync event-schemas first? [15:03:52] thcipriani: yes, event-schemas firt [15:04:14] this one caused some trouble last time but hopefully we fixed the issue :) [15:05:02] sync-dir the whole wmf-config dir :) [15:05:40] this process is a deferredUpdates so in the worst case we'll flood the logs but it should not be noticed by the users [15:06:53] !log thcipriani@tin Synchronized wmf-config/CirrusSearch-common.php: SWAT: Remove deprecated settings [[gerrit:281966]] (duration: 00m 34s) [15:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:07:11] (03PS2) 10Thcipriani: Bump CirrusSearchRequestSet avro schema to rev 121456865906 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287973 (https://phabricator.wikimedia.org/T133726) (owner: 10DCausse) [15:07:28] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287973 (https://phabricator.wikimedia.org/T133726) (owner: 10DCausse) [15:08:01] (03Merged) 10jenkins-bot: Bump CirrusSearchRequestSet avro schema to rev 121456865906 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287973 (https://phabricator.wikimedia.org/T133726) (owner: 10DCausse) [15:09:06] oh boy. OK, I'll just try a sync-dir on this one. [15:09:13] * thcipriani braces [15:10:06] thcipriani: you can sync-dir the event-schemas first should have no effect [15:10:09] (no prod effect) [15:10:16] then syncing the other will start using it [15:10:17] ah, ok. I'll do that then. [15:11:18] <_joe_> godog: thanks, the related ticket is (more or less) https://phabricator.wikimedia.org/T135113 [15:12:01] _joe_: ack, thanks! [15:12:36] !log thcipriani@tin Synchronized wmf-config/event-schemas: SWAT: Bump CirrusSearchRequestSet avro schema to rev 121456865906 PART I [[gerrit:287973]] (duration: 00m 26s) [15:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:53] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2289342 (10elukey) a:05Cmjohnson>03elukey [15:13:25] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Bump CirrusSearchRequestSet avro schema to rev 121456865906 PART II [[gerrit:287973]] (duration: 00m 25s) [15:13:28] ^ dcausse check please [15:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:46] (03PS1) 10Elukey: Add some suggestions to the aqs partman recipe. [puppet] - 10https://gerrit.wikimedia.org/r/288403 (https://phabricator.wikimedia.org/T133785) [15:14:26] (03CR) 10Elukey: [C: 032 V: 032] Add some suggestions to the aqs partman recipe. [puppet] - 10https://gerrit.wikimedia.org/r/288403 (https://phabricator.wikimedia.org/T133785) (owner: 10Elukey) [15:14:41] thcipriani: unfortunately I can only check in one hour when the logs will be available in hadoop [15:14:54] but if you don't see any new errors it's fine [15:15:20] looking at fatalmonitor it seems to be ok [15:15:21] dcausse: ok. Nothing spiking logs. Thanks for walking me through the sync, appreciated :) [15:15:29] thcipriani: thanks! :) [15:15:41] (03PS1) 10Matěj Suchánek: Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288404 [15:16:04] dcausse: we can sorta check with kafkacat [15:16:18] ebernhardson: oh that's true [15:17:06] except my avrotest.scala script now errors out about not finding avro deps ... so i gotta figure that out :) [15:17:07] but I'm pretty sure that avro would have started to spam the logs if something was not set properly :) [15:17:21] yea it would have typically [15:19:50] (03PS1) 10BBlack: cache_maps: use more FE mem on prod hardware [puppet] - 10https://gerrit.wikimedia.org/r/288406 [15:20:03] !log thcipriani@tin Synchronized php-1.28.0-wmf.1/extensions/ContentTranslation/modules/tools/ext.cx.tools.mt.js: SWAT: MT: Use custom labels instead of provider id [[gerrit:288171]] (duration: 00m 25s) [15:20:05] ^ kart_ check if possible please [15:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:20:13] (03PS1) 10Gehel: Increased wdqs partition now that we have new disks ready. [puppet] - 10https://gerrit.wikimedia.org/r/288407 (https://phabricator.wikimedia.org/T120714) [15:20:54] dcausse: yup looks to be working fine, seeing records with the new 121456865906 rev id [15:21:31] thcipriani: yep [15:21:41] 06Operations: stats.wikimedia.org down - https://phabricator.wikimedia.org/T135121#2289351 (10Kghbln) It's a **304 Not Modified** I am getting from those two. [15:23:05] thcipriani: oh ca/hewikis are not on 1.28.0-wmf.1 yet, right? [15:23:25] kart_: right, we held train yesterday [15:24:07] thcipriani: difficult to check, so go ahead with next patch wmf23. [15:24:17] kart_: ack. [15:25:12] 06Operations: stats.wikimedia.org down - https://phabricator.wikimedia.org/T135121#2289370 (10BBlack) [15:25:14] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2289372 (10BBlack) [15:25:53] !log thcipriani@tin Synchronized php-1.27.0-wmf.23/extensions/ContentTranslation/modules/tools/ext.cx.tools.mt.js: SWAT: MT: Use custom labels instead of provider id [[gerrit:288342]] (duration: 00m 26s) [15:25:55] ^ kart_ check please [15:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:22] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285020 (10BBlack) In the merged ticket above, it's browser access to status.wm.o, and the browser's getting a 304 Not Modifie... [15:27:47] ok [15:28:39] thcipriani: looks good. Thanks! [15:28:51] kart_: awesome, thanks for checking! [15:30:11] bawolff: I am going to wait for zuul and just sync all changes to updateCollation at once, if that's ok with you. [15:30:24] thcipriani: sounds good [15:31:16] I've just done a hard reset on db2008, see how that goes considering the hardware cache [15:31:33] (03PS1) 10Andrew Bogott: Open up the firewall for mysql on labtestcontrol2001 [puppet] - 10https://gerrit.wikimedia.org/r/288409 [15:31:36] (in write-back) [15:32:40] Oh, I just realized that 1.28wmf1 is not actually on more wikis, guess its too late to request for the other branch too [15:33:16] (03CR) 10Mobrovac: [C: 031] "Cherry-picked on beta, works." [puppet] - 10https://gerrit.wikimedia.org/r/288376 (https://phabricator.wikimedia.org/T134456) (owner: 10Mobrovac) [15:33:21] PROBLEM - DPKG on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:33:21] PROBLEM - DPKG on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:33:21] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:33:21] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:33:21] PROBLEM - salt-minion processes on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:33:21] PROBLEM - salt-minion processes on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:33:50] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:33:50] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:33:50] PROBLEM - Disk space on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:33:50] PROBLEM - Disk space on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:34:00] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:34:00] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:34:01] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:34:01] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:34:01] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:34:01] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:34:10] PROBLEM - configured eth on alsafi is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:34:10] PROBLEM - configured eth on alsafi is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:34:10] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:34:10] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:34:11] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:34:11] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:34:21] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:34:21] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:34:27] (03PS1) 10Elukey: Ported all the suggestions to the AQS recipe. [puppet] - 10https://gerrit.wikimedia.org/r/288410 (https://phabricator.wikimedia.org/T133785) [15:34:41] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:34:41] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:34:44] what is this? [15:35:02] url_downloader, maybe [15:35:08] yeah I think so [15:35:08] (03CR) 10Elukey: [C: 032 V: 032] Ported all the suggestions to the AQS recipe. [puppet] - 10https://gerrit.wikimedia.org/r/288410 (https://phabricator.wikimedia.org/T133785) (owner: 10Elukey) [15:35:10] PROBLEM - graphoid endpoints health on scb2001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [15:35:10] PROBLEM - graphoid endpoints health on scb2001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [15:35:12] alsfai down [15:35:23] (03PS2) 10Andrew Bogott: Open up the firewall for mysql on labtestcontrol2001 [puppet] - 10https://gerrit.wikimedia.org/r/288409 [15:35:32] RECOVERY - Disk space on alsafi is OK: DISK OK [15:35:32] RECOVERY - Disk space on alsafi is OK: DISK OK [15:35:32] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [15:35:32] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [15:35:34] jynus: can you take a look? I'm in a meeting [15:35:48] yes [15:35:51] RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy [15:35:51] RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy [15:35:51] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [15:35:51] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [15:35:51] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [15:35:51] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [15:35:53] I am doing it now [15:36:00] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [15:36:00] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [15:36:00] RECOVERY - configured eth on alsafi is OK: OK - interfaces up [15:36:01] RECOVERY - configured eth on alsafi is OK: OK - interfaces up [15:36:01] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [15:36:01] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [15:36:06] I think it is what we all think [15:36:11] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [15:36:11] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [15:36:31] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [15:36:31] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [15:36:32] bawolff: eh, your changes were all fairly small. We could probably squeeze them in. [15:36:52] ok, should I cherry-pick to the other branch right now [15:37:01] RECOVERY - DPKG on alsafi is OK: All packages OK [15:37:01] RECOVERY - DPKG on alsafi is OK: All packages OK [15:37:01] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [15:37:01] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [15:37:01] RECOVERY - salt-minion processes on alsafi is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:37:01] RECOVERY - salt-minion processes on alsafi is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:37:01] RECOVERY - graphoid endpoints health on scb2001 is OK: All endpoints are healthy [15:37:01] RECOVERY - graphoid endpoints health on scb2001 is OK: All endpoints are healthy [15:37:19] (03PS3) 10Andrew Bogott: Open up the firewall for mysql on labtestcontrol2001 [puppet] - 10https://gerrit.wikimedia.org/r/288409 [15:37:35] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285020 (10hashar) [15:38:14] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285020 (10hashar) {T135086} does not have much details beside the layout of 15.wikipedia.org being broken. [15:39:31] 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2289435 (10jcrespo) It happened again a few minutes ago on alsafi, fixed on doing ssh again. [15:39:45] (03CR) 10Andrew Bogott: [C: 032] Open up the firewall for mysql on labtestcontrol2001 [puppet] - 10https://gerrit.wikimedia.org/r/288409 (owner: 10Andrew Bogott) [15:39:58] thcipriani: So that would be https://gerrit.wikimedia.org/r/#/c/288411/ https://gerrit.wikimedia.org/r/#/c/288412/ and https://gerrit.wikimedia.org/r/#/c/288413/ [15:41:09] bawolff: kk, thanks. [15:41:16] !log thcipriani@tin Synchronized php-1.28.0-wmf.1/maintenance/updateCollation.php: SWAT: updateCollation.php sql updates [[gerrit:288386]] [[gerrit:288384]] [[gerrit:288385]] (duration: 00m 26s) [15:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:42:00] I cannot find the restbase-depends-on-url-downloader ticket [15:44:43] hello. today's swat looks pretty crowded… any chance i could add one more thing? [15:46:56] !log thcipriani@tin Synchronized php-1.27.0-wmf.23/extensions/Wikidata/extensions/ArticlePlaceholder/includes/SearchHookHandler.php: SWAT: Update ArticlePlaceholder - Fix notability checks and props params [[gerrit:288396]] (duration: 00m 25s) [15:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:47:06] ^ aude frimelle check please [15:47:39] 06Operations, 10Graphoid, 06Services, 15User-mobrovac: graphoid should not use the http proxy to connect to the mediawiki api and other internal services - https://phabricator.wikimedia.org/T134241#2259139 (10jcrespo) A downtime happened again today on codfw (no users affected) due to this defect. While th... [15:47:44] ok [15:48:14] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2289479 (10Kghbln) >>! In T134989#2289372, @BBlack wrote: > due to missing character encoding supposedly, but it's entirely li... [15:48:22] thcipriani: do you know what's up with 1.28wmf1? when is it getting deployed where? the Deployments page lies [15:49:06] MatmaRex: it should roll forward to group1 today at least, the error that was exploding logs seems to have been corrected. [15:49:16] ostriches: ^ do you plan to roll to all today? [15:49:49] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "Asking Wikidata PM Lydia if this is ok for her." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288404 (owner: 10Matěj Suchánek) [15:51:01] thcipriani: hmm, also, you're done with swat already? :o can i add another patch? https://gerrit.wikimedia.org/r/#/c/288414/ to 1.28wmf1 [15:51:21] i just noticed a very stupid, very ugly bug in UploadWizard. we're lucky it wasn't deployed to Commons yet [15:51:57] thcipriani: i think it is ok [15:52:43] !log testing file storage on misc eqiad+esams T134989 [15:52:44] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989 [15:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:08] aude did you test it? [15:53:27] doesn't seem to work for me but not sure if that's the patch though :| [15:53:28] MatmaRex: I am actually not done with SWAT yet and we're going to end up running a bit long. Can I finish up, clear the way for puppet swat, then deploy yours before any train-rolling? [15:53:38] frimelle: i am trying to understand how it works [15:53:59] thcipriani: yeah, sure. i am late with it, after all :) [15:54:09] nothing is further broken though [15:54:31] (and it's not live on any important wikis, just group0) [15:54:45] aude: If you search for an item on the WP that has a label/alias in the WP's language it should have a new section for results in the ArticlePlaceholder [15:54:48] (03CR) 10CSteipp: [C: 031] "Those settings look ok to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288390 (https://phabricator.wikimedia.org/T135111) (owner: 10Muehlenhoff) [15:55:18] MatmaRex: thanks. We'll get it in after puppet swat finishes. [15:55:57] thcipriani, MatmaRex: Yeah, I plan to catch up today on the deploy. [15:56:01] We'll have wmf.1 everywhere [15:56:10] (03CR) 10Gehel: [C: 031] "LGTM and puppet compiler seems happy: https://puppet-compiler.wmflabs.org/2775/scb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/288376 (https://phabricator.wikimedia.org/T134456) (owner: 10Mobrovac) [15:56:54] (03CR) 10BBlack: [C: 031] Disable unprivileged bpf on Linux >= 4.4 [puppet] - 10https://gerrit.wikimedia.org/r/288201 (owner: 10Muehlenhoff) [15:57:10] frimelle: we can chat int he wikidata channel [15:57:18] * aude back in 2 minutes [15:57:39] gehel: hehe, i've already ran it on the pcc, see my first comment on the ps :) [15:57:41] (03PS1) 10Chad: group1 to 1.28.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288417 [15:58:09] (03CR) 10Chad: [C: 031] "Prepped for later." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288417 (owner: 10Chad) [15:58:14] * gehel needs to learn to read [15:59:13] thcipriani: please proceed with wmf.1 for that patch. at least it is not broken more, would be worse to not roll it out. [15:59:36] (the wikidata article placeholder one) [16:00:03] mobrovac: I'm going to take care of that deployment. It seems this is going to require a manual restart of the service. Do you take care of that? Or should I? [16:00:04] godog moritzm _joe_ gehel: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160512T1600). Please do the needful. [16:00:04] mobrovac: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:11] jzerebecki: thank you. I'll leave it as is. [16:00:29] One last patch for SWAT then I'll be out of your way for puppetswat gehel [16:00:44] thcipriani: take your time, I only have one patch... [16:00:44] thcipriani: you mean you won't apply wmf.1? [16:01:47] jzerebecki: right. sorry, forgot that the same version of wikidata is on two wikimedia versions. [16:03:52] !log thcipriani@tin Synchronized php-1.27.0-wmf.23/maintenance/updateCollation.php: SWAT: updateCollation.php sql changes backport (duration: 00m 26s) [16:03:53] (03PS1) 10Volans: Remove temporary certificate with both CAs [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/288419 (https://phabricator.wikimedia.org/T111654) [16:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:04:01] ^ bawolff backport sync'd [16:04:07] Thanks :) [16:04:11] gehel: be here in 2 mins [16:04:33] (03PS2) 10Mattflaschen: Add Flow-specific External Store on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288323 (https://phabricator.wikimedia.org/T128417) [16:05:22] (03CR) 10Mattflaschen: Add Flow-specific External Store on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288323 (https://phabricator.wikimedia.org/T128417) (owner: 10Mattflaschen) [16:05:36] (03PS1) 10Volans: MariaDB: remove special SSL option multiple-ca [puppet] - 10https://gerrit.wikimedia.org/r/288420 (https://phabricator.wikimedia.org/T111654) [16:05:50] gehel: kk, i'm here [16:05:50] jynus, could you review https://gerrit.wikimedia.org/r/#/c/288323/ ? It's setting up the Flow-specific External Store cluster on Beta. [16:06:05] sure [16:06:10] gehel: the service is stopped in prod and will stay that way until monday morning [16:06:24] (03CR) 10Volans: "WIP, needs this merged first and add the submodule change here:" [puppet] - 10https://gerrit.wikimedia.org/r/288420 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [16:06:30] mobrovac: ok, that answers my question... [16:06:38] gehel: however, we need this limit config applied before we start it up again otherwise we might be looking at another outage [16:06:45] !log thcipriani@tin Synchronized php-1.28.0-wmf.1/extensions/Wikidata/extensions/ArticlePlaceholder/includes/SearchHookHandler.php: SWAT: Update ArticlePlaceholder - Fix notability checks and props params [[gerrit:288396]] (duration: 00m 24s) [16:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:06:51] ^ jzerebecki sync'd [16:07:08] mobrovac: ok, I'll wait for thcipriani to finish and I'll send that one... [16:07:23] let me check that the tables are there [16:07:48] gehel: cool, thnx! [16:08:13] gehel: that's it for regular SWAT. Please ping me when you're done, though. I have a late one that I want to squeeze in before we roll out wmf.1 everywhere. [16:08:33] thcipriani: ok, this one should not take long... [16:08:41] thcipriani: thx [16:08:51] (03PS2) 10Gehel: Change-Prop: Limit the number of concurrent tasks [puppet] - 10https://gerrit.wikimedia.org/r/288376 (https://phabricator.wikimedia.org/T134456) (owner: 10Mobrovac) [16:10:00] (03CR) 10Gehel: [C: 032] Change-Prop: Limit the number of concurrent tasks [puppet] - 10https://gerrit.wikimedia.org/r/288376 (https://phabricator.wikimedia.org/T134456) (owner: 10Mobrovac) [16:10:26] !log evacuate ganeti2006.codfw.wmnet from ganeti secondary instances [16:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:10:46] !log migrate all ganeti instances except alsafi off ganeti2006 [16:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:10] !log merged and applied https://gerrit.wikimedia.org/r/#/c/288376/ (T134456) [16:11:11] T134456: Implement concurrency limiting on change propagation & enforce conservative limits and delays - https://phabricator.wikimedia.org/T134456 [16:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:32] matt_flaschen, I am not sure about this, assuming you have mimiced the regular external storage [16:11:48] shouldn't it need an ip on externalLoads? [16:11:54] maybe I am not seeing it [16:12:01] mobrovac: should be all good. Can you check? [16:12:18] gehel: checking [16:12:46] jynus, my bad. You're right, will amend. [16:12:46] gehel: yup, all good [16:12:48] thcipriani: you're good to go! [16:12:50] gehel: thnx! [16:12:52] mobrovac: thanks! [16:13:04] gehel: thanks! [16:13:05] matt_flaschen, it is on the same ip [16:14:08] the tables on the actual servers look ok [16:14:31] (03PS3) 10Mattflaschen: Add Flow-specific External Store on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288323 (https://phabricator.wikimedia.org/T128417) [16:14:39] in your defense, it is not the most obvious configuration file :-) [16:14:48] MatmaRex: do you have a backport to wmf.1 for https://gerrit.wikimedia.org/r/#/c/288414/ ? [16:15:00] believe me, I edit it every day [16:15:21] thcipriani: I actually have to small puppet patches to push for myself. Let me know when you're done... [16:15:46] gehel: go for it. Got some patch wrangling to do first. [16:15:51] jynus, thanks, fixed. [16:15:59] thcipriani: thanks! Should not take long... [16:16:04] didn't expect puppetswat to be so quick :P [16:16:13] (03PS2) 10Gehel: Increased wdqs partition now that we have new disks ready. [puppet] - 10https://gerrit.wikimedia.org/r/288407 (https://phabricator.wikimedia.org/T120714) [16:16:19] (03CR) 10Jcrespo: [C: 031] "Everything looks good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288323 (https://phabricator.wikimedia.org/T128417) (owner: 10Mattflaschen) [16:16:55] matt_flaschen, if you happen to deploy it early, I would also like to be present [16:17:42] (03CR) 10Gehel: [C: 032] Increased wdqs partition now that we have new disks ready. [puppet] - 10https://gerrit.wikimedia.org/r/288407 (https://phabricator.wikimedia.org/T120714) (owner: 10Gehel) [16:18:21] (03PS2) 10Gehel: Depooling wdqs1001 for reinstall and new disks [puppet] - 10https://gerrit.wikimedia.org/r/288402 (https://phabricator.wikimedia.org/T120714) [16:18:42] (03CR) 10Lydia Pintscher: [C: 031] "Ok from my side too." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288404 (owner: 10Matěj Suchánek) [16:19:57] (03CR) 10Gehel: [C: 032] Depooling wdqs1001 for reinstall and new disks [puppet] - 10https://gerrit.wikimedia.org/r/288402 (https://phabricator.wikimedia.org/T120714) (owner: 10Gehel) [16:20:01] jynus, I'll deploy it now. [16:20:11] nice [16:20:18] (03CR) 10Jcrespo: [C: 031] Remove temporary certificate with both CAs [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/288419 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [16:21:34] thcipriani: no, didn't cherrypick yet [16:21:41] (03CR) 10Mattflaschen: [C: 032] Add Flow-specific External Store on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288323 (https://phabricator.wikimedia.org/T128417) (owner: 10Mattflaschen) [16:21:51] (03PS4) 10Mattflaschen: Add Flow-specific External Store on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288323 (https://phabricator.wikimedia.org/T128417) [16:21:54] thcipriani: did now. https://gerrit.wikimedia.org/r/#/c/288423/ [16:22:26] could db1023 have disk issues, it has been a bit unstable lately? [16:23:08] MatmaRex: thanks. once we get the all-clear, I'll get it out the door. [16:23:16] (03CR) 10Alex Monk: "/usr/bin/firejail doesn't appear to exist on any apaches" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288390 (https://phabricator.wikimedia.org/T135111) (owner: 10Muehlenhoff) [16:23:38] 1 critical disk [16:23:49] (03PS1) 10Elukey: Switch the PXE installer to Trusty to check a boot bug after Jessie install. [puppet] - 10https://gerrit.wikimedia.org/r/288424 (https://phabricator.wikimedia.org/T133785) [16:23:58] !log removing wdqs1001 from rotation for disk upgrade (T120714) [16:23:59] T120714: implement wdqs1001/1002 disk upgrades (extend lvm) - https://phabricator.wikimedia.org/T120714 [16:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:24:27] Slot Number: 8 seems about to fail [16:26:07] (03CR) 10Mattflaschen: [C: 032] Add Flow-specific External Store on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288323 (https://phabricator.wikimedia.org/T128417) (owner: 10Mattflaschen) [16:26:08] jynus: yes a lot of failures there [16:26:16] I will depool it [16:26:21] (Predictive) [16:26:25] (03CR) 10Muehlenhoff: "@Alex Monk: To get it installed on the image/video scalers there's https://gerrit.wikimedia.org/r/#/c/288379/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288390 (https://phabricator.wikimedia.org/T135111) (owner: 10Muehlenhoff) [16:26:30] (03CR) 10Elukey: [C: 032] Switch the PXE installer to Trusty to check a boot bug after Jessie install. [puppet] - 10https://gerrit.wikimedia.org/r/288424 (https://phabricator.wikimedia.org/T133785) (owner: 10Elukey) [16:26:33] but I do not have enough servers yet! [16:26:42] (03Merged) 10jenkins-bot: Add Flow-specific External Store on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288323 (https://phabricator.wikimedia.org/T128417) (owner: 10Mattflaschen) [16:26:50] maybe we have a spare disk? [16:27:03] s6 and s5 are starting to get impatient for those new servers [16:28:08] or we can "depool" the disk (it is out of warranty anyway) [16:28:49] (03CR) 10Muehlenhoff: "For testing, I'll add this to deployment-prep tomorrow and see which kind of tests I can run there. Once this is running in principle we c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288390 (https://phabricator.wikimedia.org/T135111) (owner: 10Muehlenhoff) [16:29:03] !log mattflaschen@tin Synchronized wmf-config/db-labs.php: Beta Cluster change (duration: 00m 27s) [16:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:30:05] !log running "megacli -PDOffline -PhysDrv '[32:8]' -aALL" on db1023 [16:30:08] !log mattflaschen@tin Synchronized wmf-config/CommonSettings-labs.php: Beta Cluster change (duration: 00m 29s) [16:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:30:27] wait for the alert [16:30:52] (03PS1) 10Alexandros Kosiaris: deploy-service: add akosiaris [puppet] - 10https://gerrit.wikimedia.org/r/288425 [16:31:12] jynus: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=db1023&service=RAID [16:31:18] (03CR) 10Mobrovac: "The aqs* files lack the .yaml extension and AFAIK Hiera won't read them otherwise." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/288373 (https://phabricator.wikimedia.org/T124314) (owner: 10Elukey) [16:31:26] (03CR) 10Mobrovac: [C: 04-1] Add a new AQS testing environment to play with Cassandra settings before production. [puppet] - 10https://gerrit.wikimedia.org/r/288373 (https://phabricator.wikimedia.org/T124314) (owner: 10Elukey) [16:31:36] volans, can you check the disks on db1026, there was the lag you mentioned last nigh, and I want to discard disks as the cause [16:31:41] jynus, it's on Beta, testing now. [16:31:44] meanwhile I can file a ticket [16:31:49] about db1023 [16:32:09] sure [16:32:39] morebots: aaaargh that .yaml, I always forget :( [16:32:39] I am a logbot running on tools-exec-1207. [16:32:39] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [16:32:40] To log a message, type !log . [16:32:40] thanks [16:32:50] oh sorry, it was mobrovac [16:32:51] jynus: all predictive failures are zero [16:32:51] :P [16:33:03] haha [16:33:06] :D [16:33:15] volans, great [16:33:24] so my other hypothesis stands [16:33:27] morebots should already be used to it, elukey, people often do that [16:33:28] I am a logbot running on tools-exec-1207. [16:33:28] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [16:33:28] To log a message, type !log . [16:33:33] i guess it's not [16:33:34] jynus, I made another mistake. Fixing now. [16:33:34] haha [16:34:09] what is is it? [16:34:16] 06Operations, 10Traffic, 07HTTPS: status.wikimedia.org has no (valid) HTTPS - https://phabricator.wikimedia.org/T34796#366296 (10faidon) Option (3) sounds like the easiest way forward to me and an acceptable option. My only concern would be whether it could handle a surge of traffic (the kind of traffic it'd... [16:34:21] (I will wait for the change) [16:34:53] (03PS1) 10Mattflaschen: Fix Flow external store: Protocol was missing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288427 [16:35:14] oh, sorry about that [16:35:21] I should have caught that [16:35:29] PROBLEM - RAID on db1023 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [16:35:29] PROBLEM - RAID on db1023 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [16:35:45] (03PS2) 10Mattflaschen: Fix Flow external store: Protocol was missing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288427 [16:36:06] (03CR) 10Mattflaschen: [C: 032] Fix Flow external store: Protocol was missing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288427 (owner: 10Mattflaschen) [16:36:25] 06Operations, 10RESTBase, 10RESTBase-Cassandra, 10cassandra, and 2 others: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#2289684 (10faidon) 05Open>03Resolved "Set up multi-DC replication for Cassandra" is a bit misleading nowadays — that part is done. The remaini... [16:36:45] (03Merged) 10jenkins-bot: Fix Flow external store: Protocol was missing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288427 (owner: 10Mattflaschen) [16:38:01] !log mattflaschen@tin Synchronized wmf-config/CommonSettings-labs.php: Beta Cluster change (duration: 00m 28s) [16:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:39:42] (03PS8) 10Dzahn: acme-setup: only accept '^[-a-zA-Z0-9_]+$' as unique cert ID [puppet] - 10https://gerrit.wikimedia.org/r/287032 (https://phabricator.wikimedia.org/T134447) [16:39:45] 06Operations, 10Graphoid, 06Services, 15User-mobrovac: graphoid should not use the http proxy to connect to the mediawiki api and other internal services - https://phabricator.wikimedia.org/T134241#2259139 (10Yurik) I will work on it later today. Graphoid needs to be able to contact any wiki project via AP... [16:40:00] (03CR) 10Elukey: "You are completely right Marko, fixing the code review thanks!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/288373 (https://phabricator.wikimedia.org/T124314) (owner: 10Elukey) [16:40:52] matt_flaschen: gehel done on tin? [16:41:07] thcipriani: sorry, yes, completely done [16:41:35] thcipriani: and those were actually puppet patches, they did not event touch tin... [16:41:52] (03CR) 10Alex Monk: "Really shouldn't be necessary to do this given that you're in the ops group." [puppet] - 10https://gerrit.wikimedia.org/r/288425 (owner: 10Alexandros Kosiaris) [16:42:07] gehel: thanks [16:42:23] thcipriani, yeah. [16:42:32] matt_flaschen: kk, thanks [16:42:59] (03CR) 10Dzahn: [C: 032] acme-setup: only accept '^[-a-zA-Z0-9_]+$' as unique cert ID [puppet] - 10https://gerrit.wikimedia.org/r/287032 (https://phabricator.wikimedia.org/T134447) (owner: 10Dzahn) [16:43:47] (03PS2) 10Dzahn: ircd: remove exec permission bit on unit file [puppet] - 10https://gerrit.wikimedia.org/r/288333 [16:43:51] !log thcipriani@tin Synchronized php-1.28.0-wmf.1/extensions/UploadWizard/resources/controller/uw.controller.Details.js: SWAT: Fix Uncaught TypeError: this.copyMetadataWidget.remove is not a function [[gerrit:288423]] (duration: 00m 24s) [16:43:57] ^ MatmaRex [16:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:44:31] thcipriani: thank you [16:45:21] (03CR) 10Dzahn: [C: 032] ircd: remove exec permission bit on unit file [puppet] - 10https://gerrit.wikimedia.org/r/288333 (owner: 10Dzahn) [16:45:36] (03PS3) 10Elukey: Add a new AQS testing environment to play with Cassandra settings before production. [puppet] - 10https://gerrit.wikimedia.org/r/288373 (https://phabricator.wikimedia.org/T124314) [16:45:43] jynus: are you taking care of the icinga alarm for RAID on db1023? [16:45:57] !log restart alsafi for 2.5+dfsg-4~bpo8+1 qemu upgrade [16:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:46:14] 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2289736 (10akosiaris) I 've emptied ganeti2006 and drained it. It can not accept new VMs (either primary or secondary). I 've left alsafi on it and upgrade to qemu to `2.5+dfsg-4~bpo8+1`. Now the waiting part begins... [16:46:21] volans, I intend to [16:46:33] !log investigating carbon-c-relay crash with buffer overflow on graphite1001 [16:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:46:45] matt tried to hold me a bit :-) [16:46:52] if you can't just tell me I'll do it ;) [16:47:15] Sorry [16:47:17] Works now, though. :) [16:47:23] I won't say no to help, volans [16:47:36] I wanted to see if it fixed the replag first [16:47:44] on https://tendril.wikimedia.org/host/view/db1023.eqiad.wmnet/3306 [16:49:28] I've put a watch on the host, I'll keep an eye on the lag [16:49:29] matt_flaschen, very nice contents, too: "1 | �K-W(�/�L " [16:50:18] my idea is to create a new blobs table once this has been deployed to production [16:50:35] so that the new table only has non-flow content [16:51:13] and also tables do not grow indefinitely big [16:51:46] (03PS3) 10Alex Monk: Try to separate trebuchet stuff from role::deployment::server [puppet] - 10https://gerrit.wikimedia.org/r/284851 [16:51:56] that is why I wanted to conserve the script for the table creation [16:52:13] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: letsencrypt puppetization: upgrade for scalability - https://phabricator.wikimedia.org/T134447#2289769 (10Dzahn) [16:52:19] even if it didn't work [16:54:19] jynus, yeah. :) I just figured you weren't using that script (since it didn't work) and were doing it some other way. Glad we got it sorted out. [16:54:51] well, technically we were not using it [16:55:23] but only because it has been 3 years (maybe more) since it last was run? :-) [16:55:32] Right [16:55:33] but there was no replacement [16:55:48] jynus, so the next step is https://phabricator.wikimedia.org/T119567 , dry-run mode on Beta. [16:56:04] yes [16:56:50] (03CR) 10Alex Monk: [C: 04-1] "Error: Duplicate declaration: Class[Role::Deployment::Apache] is already declared in file /mnt/jenkins-workspace/puppet-compiler/2776/chan" [puppet] - 10https://gerrit.wikimedia.org/r/284851 (owner: 10Alex Monk) [16:56:51] I said I would get permission from RelEng, so I'll double check with greg, though I think he will be fine with it. [16:57:29] technically if we break something on beta db, I will fix it, so it will probably be ok [16:57:47] why d oyou need permission from releng? aren't you a deployment-prep projectadmin? [16:58:00] I can take a backup just in case so beta can recover quickly [16:58:11] (03CR) 10jenkins-bot: [V: 04-1] Try to separate trebuchet stuff from role::deployment::server [puppet] - 10https://gerrit.wikimedia.org/r/284851 (owner: 10Alex Monk) [16:58:21] I would, however, add more records to flow [16:58:48] well, now it is late [16:58:59] but to es == more pages [16:59:16] so that the script doesn't run on 0.1 seconds [17:00:03] I would give the script a second look [17:00:04] yurik gwicke cscott arlolra subbu: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160512T1700). [17:00:16] jynus, okay, can you post on the task when you've taken the backup and are comfortable with me running it? [17:00:19] (03CR) 10Yuvipanda: [C: 04-1] prometheus: add node_exporter support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) (owner: 10Filippo Giunchedi) [17:00:19] now that we have a proper testing ground [17:00:23] I have to go to a meeting now. [17:00:27] sure, will do [17:00:31] jynus, do you think I need to check with RelEng? [17:00:52] do, out of courtesy, I do not want to override them [17:01:22] Okay, will do. [17:03:33] !log upgrade carbon-c-relay from jessie-backports on graphite1003 / graphite2002 [17:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:04:00] 06Operations, 06Performance-Team, 06Services, 07Availability: Consider cassandra for session storage (and SSL) - https://phabricator.wikimedia.org/T134811#2289856 (10Krinkle) [17:06:29] 06Operations, 10ops-ulsfo: ulsfo planned maintenance on 2016-05-11 - https://phabricator.wikimedia.org/T134831#2289860 (10RobH) 05Open>03Resolved This work was completed yesterday, and I returned ULSFO to service with https://gerrit.wikimedia.org/r/#/c/288325/ Had to learn conftool and how to repool varn... [17:07:08] PROBLEM - WDQS HTTP on wdqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 416 bytes in 0.001 second response time [17:07:08] PROBLEM - WDQS HTTP on wdqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 416 bytes in 0.001 second response time [17:07:29] PROBLEM - WDQS SPARQL on wdqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 416 bytes in 0.001 second response time [17:07:29] PROBLEM - WDQS SPARQL on wdqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 416 bytes in 0.001 second response time [17:07:31] (03PS9) 10Yuvipanda: prometheus: add node_exporter support [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) (owner: 10Filippo Giunchedi) [17:08:04] gehel: ^ is that you? [17:08:11] ebernhardson: SMalyshev ^ [17:08:18] YuviPanda: yes that's me [17:08:20] ok! [17:08:49] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6130758 keys - replication_delay is 0 [17:08:50] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6130758 keys - replication_delay is 0 [17:08:59] (03PS10) 10Yuvipanda: prometheus: add node_exporter support [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) (owner: 10Filippo Giunchedi) [17:09:08] I need to start a tip jar for every time I forget to disable alerts... [17:09:55] (03PS11) 10Yuvipanda: prometheus: add node_exporter support [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) (owner: 10Filippo Giunchedi) [17:10:20] who gets the money? you cannot get tipped for your own alerts! [17:10:37] robh: I'll make a donation to WMF... [17:10:46] * robh forgot the ipsec alerts yesterday and it spammed the channel with hundreds of them =P [17:13:19] !log enabling GTID on selected db production hosts to change topology of dbstore1001 [17:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:19:41] (I probably will have to wait 1 day to change its master) [17:20:16] gehel: oh you're installing the disks there? cool :) [17:21:43] gehel: that always works well. take wmf money, pay you, remove taxes, then send back to wmf ;) [17:25:28] ebernhardson: I'm a European communist, I want to pay more taxes... [17:26:30] SMalyshev: yes, chris should add those disks any minute. And then we just have to reinstall / reimport... [17:27:49] (03PS2) 10BBlack: cache_maps: use more FE mem on prod hardware [puppet] - 10https://gerrit.wikimedia.org/r/288406 [17:28:04] (03CR) 10BBlack: [C: 032] cache_maps: use more FE mem on prod hardware [puppet] - 10https://gerrit.wikimedia.org/r/288406 (owner: 10BBlack) [17:28:12] (03CR) 10BBlack: [V: 032] cache_maps: use more FE mem on prod hardware [puppet] - 10https://gerrit.wikimedia.org/r/288406 (owner: 10BBlack) [17:28:15] greg-g, are you there? [17:28:15] matt_flaschen: You sent me a contentless ping. This is a contentless pong. Please provide a bit of information about what you want and I will respond when I am around. [17:28:19] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: Error: connecting slave requested to start from GTID 0-171966669-2545795168, which is not in the masters binlog. Since the masters binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executin [17:28:19] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: Error: connecting slave requested to start from GTID 0-171966669-2545795168, which is not in the masters binlog. Since the masters binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executi [17:28:30] wow [17:28:58] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: Error: connecting slave requested to start from GTID 0-171974683-3913731164, which is not in the masters binlog [17:28:58] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: Error: connecting slave requested to start from GTID 0-171974683-3913731164, which is not in the masters binlog [17:29:02] cool [17:29:11] maybe not-cool? :) [17:29:18] it seems things werent as easy as we thought [17:29:20] :-) [17:29:26] greg-g, okay. :) Can we discuss External Store dry run script (https://phabricator.wikimedia.org/T119567) when you are around? [17:29:38] was a sarcastic cool [17:29:40] it is a delayed slave, but we call him "special" slave [17:29:53] so I guess he's delayed longer than binlogs go? :) [17:29:58] well, longer than GTID binlogs go [17:30:06] because it is starting and stopping all the time [17:30:25] for backups? [17:30:38] no, to keep it delayed [17:30:47] oh ok [17:30:47] it is for backups, but it is delayed for disaster recovery [17:31:12] I do not understand, it worked for a while [17:31:42] ^ I think I could say that at least 10 times every day [17:31:44] I will revert it, investigate tomorrow [17:32:07] good news is that going from it to non-gtid is very transparent [17:33:18] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [17:33:19] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [17:33:32] db1023 looks better now, I'll open a replace disk task and ack icinga with that [17:34:05] !log reverting dbstore1001 to regular replication coords [17:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:34:12] (03PS1) 10Legoktm: Disable $wgCentralAuthCheckSULMigration functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288435 (https://phabricator.wikimedia.org/T127887) [17:34:27] jynus, if you want to add more records to Flow, we have to change $wgFlowExternalStore back to the old one. Right now, nothing has been migrated or dry-run-migrated, but new Flow records are going to flow_cluster1. [17:34:39] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [17:34:39] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [17:34:51] bblack, it is so much fun when things fail 24 hours after you create an issue! [17:35:16] jynus, oh, I see, that's what you meant "now it is late". I thought you meant the actual time in Europe. [17:35:59] but aside from jokes (which are sadly real), the delayeds slaves have saved us several times from data loss [17:36:23] is one of those things that you never pay attention until you need them [17:37:23] matt_flaschen, no worries, I will check it soon and give it a thought for proper testing [17:37:47] jynus, thanks. I am also adding notes to the task from our discussion above. [17:38:21] bblack, also, by experience, I enabled it on very selected hosts for precisely the above reasons [17:38:37] jynus: do we want to wait tomorrow to ask for replacement of db1023 disk? [17:38:51] volans, no, file the task if you want now [17:38:59] ok [17:39:04] but do not give it too much priority [17:39:41] hopefully we will decom db1023 soon [17:40:49] how can that not be on the master's log? [17:41:22] I am missing something obvious, but I am now too tired to discover it [17:41:31] see you [17:41:43] 06Operations, 10ops-eqiad: db1023 Degraded RAID - https://phabricator.wikimedia.org/T135157#2290041 (10Volans) [17:42:43] ACKNOWLEDGEMENT - RAID on db1023 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Volans https://phabricator.wikimedia.org/T135157 [17:42:43] ACKNOWLEDGEMENT - RAID on db1023 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Volans https://phabricator.wikimedia.org/T135157 [17:43:02] (03CR) 10DCausse: "small typo in class name" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268048 (https://phabricator.wikimedia.org/T121542) (owner: 10EBernhardson) [17:45:35] can someone with op grants kick icinga-wm_, it is getting annoying? [17:45:43] (03PS6) 10EBernhardson: A/B/C test of control vs textcat vs accept-lang + textcat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268048 (https://phabricator.wikimedia.org/T121542) [17:53:41] (03CR) 10Keegan: [C: 031] "Time to sunset." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288435 (https://phabricator.wikimedia.org/T127887) (owner: 10Legoktm) [17:58:39] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:58:39] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:59:09] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:59:09] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:59:11] (03PS12) 10Filippo Giunchedi: prometheus: add node_exporter support [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) [17:59:40] I'm trying to find the error rate dashboard, but https://gdash.wikimedia.org/dashboards/reqerror/ no longer works and https://graphite.wikimedia.org/dashboard/ just says "Drop to Merge" [18:00:11] Where did that move to? The documentation on wikitech doesn't say [18:00:22] grafana.wikimedia.org perhaps [18:00:26] I know exactly why dbstore1001 failed and I will fix it tomorrow (failing wasn't an error but our fault) [18:00:49] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:00:49] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:01:11] hmmm https://grafana.wikimedia.org/dashboard/db/varnish-http-errors doesn't seem to work eigther [18:01:15] either [18:01:28] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:01:29] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:02:08] there is high 500 right now [18:02:31] 600-700/min [18:03:58] OK, finally found something usable: https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json [18:06:02] it is /api/rest_v1/metrics/pageviews/ [18:08:46] kaldari: I'm not sue why, but for me loading a grafana dashboard often requires hitting the "refresh" arrow button in the top right corner. https://grafana.wikimedia.org/dashboard/db/varnish-http-errors showed errors until I did htat [18:08:51] kaldari: I think it's probably working ok [18:08:55] but try this: [18:08:56] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes [18:09:16] (and select for status category '5' and cache_type 'text') [18:09:46] so what's the big spike in 500s? [18:09:52] if someone else is here, pageviews is "down" [18:10:00] not sure who owns that [18:10:08] looks like it took off around :53 [18:10:10] analytics? [18:10:41] ping someone there, allow me to excuse myself as it is not a core service [18:10:45] was there a recent deploy? [18:10:53] this isn't an analytics problem :P [18:10:56] bd808: Yeah, it seems to work if you use the refresh button. Weird. Any idea what the diffference between https://grafana.wikimedia.org/dashboard/db/varnish-http-errors and https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json? They seem identical. [18:11:04] bblack: I'll ping analytics [18:11:14] 18:10 < bblack> this isn't an analytics problem :P [18:11:15] I just poked them [18:11:24] we have a real 5xx elevation on mediawiki/services [18:11:38] 500 server error, specifically, not 503 [18:11:55] it stopped [18:12:01] what was it? [18:12:08] no idea [18:12:21] just stating facts while tailing the errors :-) [18:12:23] zooming out to the full week on https://grafana.wikimedia.org/dashboard/db/varnish-http-errors shows similar peaks [18:12:46] no idea either [18:12:47] oh, no [18:13:01] it is not fixed, just someone stopped hitting it quickly [18:13:43] the bulk of the 5xx are for: [18:13:44] /api/rest_v1/metrics [18:14:29] other recent peaks are usually 503 for other known reasons [18:15:21] well, I take that back, there's a similarly-shaped 500 chunk about 20 hours ago [18:15:29] jynus> it is /api/rest_v1/metrics/pageviews/ [18:15:34] :-) [18:15:56] bblack: it looks fairly regular actually [18:16:12] /api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/XXXXXXXXXX/2015070100/2016051200 [18:16:15] infact every day for the past week [18:16:20] https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/de.wikipedia/all-access/user/Johann_Wolfgang_von_Goethe/daily/2015101300/2015102700 seems to result in stuff [18:16:24] where XXXXXXX seems to be every user, one request per user [18:16:34] https://ganglia.wikimedia.org/latest/graph.php?r=2hr&z=xlarge&title=MediaWiki+errors&vl=errors+%252F+sec&x=0.5&n=&hreg%255B%255D=vanadium.eqiad.wmnet&mreg%255B%255D=fatal%257Cexception>ype=stack&glegend=show&aggregate=1&embed=1 and [18:16:34] https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=MediaWiki+errors&vl=errors+%252F+sec&x=0.5&n=&hreg%5b%5d=vanadium.eqiad.wmnet&mreg%5b%5d=fatal%7cexception>ype=stack&glegend=show&aggregate=1&embed=1 are broken too. [18:16:35] bblack: oh [18:16:47] that doesnt even seem like a valid request [18:16:54] and they all return 500 Server Error [18:17:11] * YuviPanda is around if this ends up being someone doing things from labs [18:17:34] bblack: xxxx should be page title - may be that's what they are passing. though [18:17:42] jouncebot: next [18:17:42] In 0 hour(s) and 42 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160512T1900) [18:17:44] and they're all coming from an IP address in London belonging to "Duedil Limited" [18:17:47] (not us) [18:17:51] hmmm [18:18:09] ooooh Duedil, I have one of their stickers on my laptop... [18:18:17] https://www.duedil.com/ [18:18:20] allow me now to leave (for realz), bye! [18:18:21] * bd808 blames addshore [18:18:25] ;0 [18:18:35] sounds like they're using us to harvest BI [18:18:43] its okay bd808, that laptop has now been replaced ! ;) [18:18:49] and failing :P [18:18:54] "DueDil is an easy-to-use online data tool for sales, marketing, research and risk teams." [18:19:02] BIS? BEES! [18:19:04] big data for privacy invasion! [18:19:33] bd808: you mean an easy-to-use online data tool for sales, marketing, research and risk teams? [18:19:37] anyways, it's still a bug on our end if it's a 500 [18:19:44] it should be a 404, or a 400, or something [18:19:48] indeed bblack, too many requests == many 500's [18:19:49] 500 means our code fell over and died [18:19:50] bblack: yeah [18:20:06] proof i am to blame! https://usercontent.irccloud-cdn.com/file/HVub9w3g/irccloudcapture-1609682942.jpg [18:20:13] So what do people use to monitor for fatals and exceptions now? All the tools I used to use are broken. Do I have to just tail the logs? [18:20:21] bblack: We have received new hardware and are now investigation for better sdettings in order to solve the problem [18:20:45] kaldari: https://logstash.wikimedia.org/#/dashboard/elasticsearch/default [18:20:46] joal: for what? [18:20:54] all the documentation also refers to the old broken tools :( [18:21:11] bblack: for aqs - The plan is to be able to serve more data without failing [18:21:18] kaldari: specifically https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor [18:21:23] ok [18:21:23] bblack: another task we have is to rate limit more strongly [18:21:35] bd808: thanks [18:21:41] so restbase /api/rest_v1/metrics/pageviews/... is backending to AQS? [18:21:52] is that the missing link in my brain in all the above? [18:21:52] bblack: yeah [18:21:57] correct bblack [18:22:08] ok [18:22:26] so I guess AQS gives 500, RB forwards it [18:22:29] updating documentation, like a boss :) [18:23:34] joal: boy what we need is throttling [18:23:40] (03PS1) 10BBlack: cache_misc: puppetize switch to file storage [puppet] - 10https://gerrit.wikimedia.org/r/288440 [18:23:58] joal: nut i really think throttling is osmething service platform should provide cc gwicke [18:24:05] nuria_: yes, but not only, throttling would be howfally low currently [18:24:17] joal: sorry for typos [18:24:21] nuria_: joal if i'm not wrong - bblack> /api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/XXXXXXXXXX/2015070100/2016051200 are actually malformed [18:24:26] nuria_: I aggre we need throttling, but I really hope we can have better performace [18:24:34] np nuria_, I do typos as well [18:25:38] madhuvishy: hmmm, I don't think so, why ? [18:25:58] The example url you gave has just a very long timespan, so it's very heavy to compute for cassandra [18:26:02] joal: /user/pagetitle/granularity/date/date no [18:26:02] madhuvishy: --^ [18:27:07] Hooooo ! Correct madhuvishy !!!! [18:27:09] my bad [18:27:32] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2290254 (10BBlack) So we're currently have several experiments in play trying to figure this out: 1. We've got 2x upstream bu... [18:27:51] joal: if it is indeed these urls that 500 - i'm not sure i completely understand why [18:28:27] it is, I checked oxygen 5xx logs [18:29:43] by the way, if that URL does what it claims to do, wtf? [18:30:00] we expose a public API to dig through pageviews per-article + per-username + timestamp range? [18:30:14] or do I misread that [18:30:18] bblack: no username [18:30:24] ok [18:30:29] user is just the agent type - can be user or spider [18:30:37] ok [18:30:43] the XXXX thing is Page title [18:30:53] ok so this outside party is just scanning all articles I guess [18:30:53] so per article for agent type user [18:31:10] yeah [18:31:15] bblack: but [18:31:22] phew! Nobody can find out I've been learning to build cryptography from reading wikipedia now! [18:31:39] bblack: there is supposed to be a parameter in between pagetitle and the timestamp range [18:31:50] hourly/daily granularity specifier [18:32:02] something like https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/de.wikipedia/all-access/user/Johann_Wolfgang_von_Goethe/2015101300/2015102700 which is similar to the url you pasted [18:32:10] actually returns Not Found [18:32:20] it's there, I just left it out when doing XXXX [18:32:24] oh [18:32:26] okay then [18:32:28] it makes sense [18:32:37] actual example: [18:32:37] /api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Ironbridge/daily/2015070100/2016051200 [18:33:05] too big a date range - hits cassandra - it can't handle it and gives up i suppose [18:40:28] madhuvishy: the thing is that it actually doesn't give up [18:40:56] cassandra will try to load the thing (therefore spends a lot of resource), but restbase throws a 500 because of 2s timeout [18:41:06] !log ran mwscript maintenance/updateCollation.php --wiki=ruwiktionary --force [18:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:42:37] joal: aah [18:45:00] (03CR) 10Ottomata: [C: 031] "One q, but its only about a comment, so LGTM!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/288373 (https://phabricator.wikimedia.org/T124314) (owner: 10Elukey) [18:48:07] 06Operations, 10Mail, 10MediaWiki-Email: Wiki-Mail sent but never delivered - https://phabricator.wikimedia.org/T134674#2272891 (10Catrope) I looked up @MarcoAurelio's email and the bounce logs associated with it, and we did in fact record bounces for that address on 2016-05-06 at 07:29:15 UTC and 07:48:13 U... [18:48:16] 06Operations, 10Mail, 10MediaWiki-Email: Wiki-Mail sent but never delivered - https://phabricator.wikimedia.org/T134674#2290309 (10Catrope) [18:51:08] (03PS1) 10Jforrester: Follow-up 6dbf876: Move VisualEditor to secondary status on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288450 (https://phabricator.wikimedia.org/T132806) [18:53:10] (03PS1) 10Ottomata: Add temporary hiera config to contitionally set broker protocol version [puppet] - 10https://gerrit.wikimedia.org/r/288451 (https://phabricator.wikimedia.org/T121562) [18:53:21] gwicke: do you guys have any plans to implement throttling on rest base? [18:57:33] (03PS2) 10Ottomata: Add temporary hiera config to contitionally set broker protocol version [puppet] - 10https://gerrit.wikimedia.org/r/288451 (https://phabricator.wikimedia.org/T121562) [18:58:32] (03PS3) 10Ottomata: Add temporary hiera config to contitionally set broker protocol version [puppet] - 10https://gerrit.wikimedia.org/r/288451 (https://phabricator.wikimedia.org/T121562) [19:00:04] ostriches: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160512T1900). [19:01:09] (03CR) 10Ottomata: [C: 032] Add temporary hiera config to contitionally set broker protocol version [puppet] - 10https://gerrit.wikimedia.org/r/288451 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [19:03:04] !log restarting kafka broker on kafka1012 to pick up inter.broker.protocol.version change [19:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:07:02] nuria_: it is already implemented, see https://github.com/wikimedia/service-runner/blob/master/lib/ratelimiter.js [19:07:14] docs at https://github.com/wikimedia/service-runner#rate-limiting [19:08:19] we also have a filter to configure this per entry point [19:09:09] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: processor/client-side-11 processor/client-side-10 processor/client-side-09 processor/client-side-08 processor/client-side-07 processor/client-side-06 processor/client-side-05 processor/client-side-04 processor/client-side-03 processor/client-side-02 processor/client-side-01 processor/client-side-00 forwarder/legacy-zmq [19:09:09] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: processor/client-side-11 processor/client-side-10 processor/client-side-09 processor/client-side-08 processor/client-side-07 processor/client-side-06 processor/client-side-05 processor/client-side-04 processor/client-side-03 processor/client-side-02 processor/client-side-01 processor/client-side-00 forwarder/legacy-zm [19:09:41] nuria_: what is the ETA on those SSDs? [19:09:43] gwicke: great, but from docs i cannot tell whether this is per IP or custom criteria [19:10:13] the limiter is generic, which means that you can limit whatever key you want [19:10:23] gwicke: SSDs are coming but given that to take advantage of them we need to change the compaction strategy changing them is not so easy but that is another conversation [19:10:40] the per-route limiter is per-ip and per-ip/route: https://github.com/wikimedia/hyperswitch/blob/master/lib/filters/ratelimit_route.js [19:10:56] gwicke: ok, sounds good, IP+route [19:11:11] cc joal [19:11:13] also, the limits can be configured depending on the source IP range [19:11:39] so you can set higher limits for internal clients [19:11:59] on that EL ping [19:13:03] nuria_: good to hear that SSDs are coming; at the current request rates, that will probably be enough to fix the issue [19:13:19] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [19:13:19] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [19:15:50] 06Operations, 10Ops-Access-Requests: Frack (boron and bismuth) access for Darian Patrick - https://phabricator.wikimedia.org/T135165#2290352 (10csteipp) [19:17:12] (03PS3) 10Ottomata: Druid module [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) [19:17:49] 06Operations, 10Ops-Access-Requests, 10fundraising-tech-ops: Frack (boron and bismuth) access for Darian Patrick - https://phabricator.wikimedia.org/T135165#2290365 (10Jgreen) p:05Triage>03High a:03Jgreen [19:22:20] (03PS4) 10Ottomata: Druid module [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) [19:23:55] !log adding static route to labtest instance vlan on cr(1|2)-codfw through labtestnet [19:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:28:05] 06Operations, 10Graphoid, 06Services, 15User-mobrovac: graphoid should not use the http proxy to connect to the mediawiki api and other internal services - https://phabricator.wikimedia.org/T134241#2259139 (10GWicke) @yurik, given the relatively low volume from graphoid I think just using the public domain... [19:32:02] 06Operations, 10Graphoid, 06Services, 15User-mobrovac: graphoid should not use the http proxy to connect to the mediawiki api and other internal services - https://phabricator.wikimedia.org/T134241#2290405 (10Yurik) @GWicke I do not know why its using a proxy - I don't think I ever set it up that way. Grap... [19:37:29] 06Operations, 10Graphoid, 06Services, 15User-mobrovac: graphoid should not use the http proxy to connect to the mediawiki api and other internal services - https://phabricator.wikimedia.org/T134241#2290444 (10GWicke) The only reason I could imagine for using the proxy would be restricting access to interna... [19:38:50] puppet-lint ---help [19:38:50] puppet-lint: invalid option: ---help [19:38:50] puppet-lint: try 'puppet-lint --help' for more information [19:38:53] lol, what [19:38:58] oh, --- :) [19:48:05] ha wtf [19:53:25] (03PS2) 10Madhuvishy: [WIP] jupyterhub: Add module to set up Jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/288086 [19:54:56] (03CR) 10jenkins-bot: [V: 04-1] [WIP] jupyterhub: Add module to set up Jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [19:55:33] (03PS2) 10Chad: group1 to 1.28.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288417 [19:55:42] (03CR) 10Chad: [C: 032] group1 to 1.28.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288417 (owner: 10Chad) [19:57:02] (03Merged) 10jenkins-bot: group1 to 1.28.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288417 (owner: 10Chad) [19:57:35] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to 1.28.0-wmf.1 [19:57:41] (03PS1) 10Yuvipanda: Add toollabs base container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/288464 (https://phabricator.wikimedia.org/T134748) [19:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:00:59] (03PS4) 10Ottomata: Initial debian packaging [debs/druid] - 10https://gerrit.wikimedia.org/r/287285 (https://phabricator.wikimedia.org/T134503) [20:01:40] (03CR) 10Ottomata: Initial debian packaging (036 comments) [debs/druid] - 10https://gerrit.wikimedia.org/r/287285 (https://phabricator.wikimedia.org/T134503) (owner: 10Ottomata) [20:03:16] (03PS2) 10Yuvipanda: Add toollabs base container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/288464 (https://phabricator.wikimedia.org/T134748) [20:11:55] (03PS1) 10Andrew Bogott: Customize ldap password for labtest instances [puppet] - 10https://gerrit.wikimedia.org/r/288469 [20:13:30] (03CR) 10Andrew Bogott: [C: 032] Customize ldap password for labtest instances [puppet] - 10https://gerrit.wikimedia.org/r/288469 (owner: 10Andrew Bogott) [20:13:33] (03PS1) 10Elukey: Revert "Switch the PXE installer to Trusty to check a boot bug after Jessie install." [puppet] - 10https://gerrit.wikimedia.org/r/288480 [20:13:51] (03PS2) 10Elukey: Revert "Switch the PXE installer to Trusty to check a boot bug after Jessie install." [puppet] - 10https://gerrit.wikimedia.org/r/288480 [20:14:51] (03CR) 10Elukey: [C: 032 V: 032] Revert "Switch the PXE installer to Trusty to check a boot bug after Jessie install." [puppet] - 10https://gerrit.wikimedia.org/r/288480 (owner: 10Elukey) [20:15:12] Does anyone know if this patchset is still relevant? https://gerrit.wikimedia.org/r/#/c/72737/ [20:15:19] trying to do some cleanup on Gerrit [20:15:48] jdlrobson: ottomata might know [20:15:57] (he does stuff with jmxtrans I think) [20:16:05] YuviPanda: likewise https://gerrit.wikimedia.org/r/#/c/72736/ [20:16:17] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2290595 (10RobH) Issue: Post jessie install, system states booting off C, and then fails to boot anything. Troubleshooting done so far: * compared all bios settings... [20:16:21] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2290597 (10GWicke) See also T135171 for a related investigation into VE requests without `if-match` headers. [20:16:31] hah uh [20:16:49] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [20:16:49] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [20:16:52] jdlrobson: same ( ottomata :D) [20:16:54] jdloft: no idea [20:17:31] yeah jdlrobson have no context there [20:17:51] it is probably good, but i don't remember much about the jmxtrans packaging [20:18:24] hey AzaToth are you here? Just dug out 2 really old patches from you and curious if you are still looking for reviewers ^ [20:19:14] (03PS1) 10Andrew Bogott: It should go without saying that the labtest ldap password is only useful if we're on labtest. [puppet] - 10https://gerrit.wikimedia.org/r/288507 [20:22:38] (03CR) 10Andrew Bogott: [C: 032] It should go without saying that the labtest ldap password is only useful if we're on labtest. [puppet] - 10https://gerrit.wikimedia.org/r/288507 (owner: 10Andrew Bogott) [20:29:25] (03CR) 10Jdlrobson: "Is this patchset still relevant? Do we need someone from ops to shepherd this through review?" [debs/jmxtrans] - 10https://gerrit.wikimedia.org/r/72737 (owner: 10AzaToth) [20:30:10] PROBLEM - check_raid on beryllium is CRITICAL: CRITICAL: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=1, wk=1, fail=1, sp=0: /dev/md/3: act=2, wk=2, fail=0, sp=0 [20:30:10] PROBLEM - check_raid on beryllium is CRITICAL: CRITICAL: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=1, wk=1, fail=1, sp=0: /dev/md/3: act=2, wk=2, fail=0, sp=0 [20:30:19] (03PS5) 10Ottomata: Initial debian packaging [debs/druid] - 10https://gerrit.wikimedia.org/r/287285 (https://phabricator.wikimedia.org/T134503) [20:30:48] (03CR) 10Ottomata: Initial debian packaging (031 comment) [debs/druid] - 10https://gerrit.wikimedia.org/r/287285 (https://phabricator.wikimedia.org/T134503) (owner: 10Ottomata) [20:31:39] (03CR) 10Ottomata: "OOok!" [debs/druid] - 10https://gerrit.wikimedia.org/r/287285 (https://phabricator.wikimedia.org/T134503) (owner: 10Ottomata) [20:35:11] PROBLEM - check_raid on beryllium is CRITICAL: CRITICAL: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=1, wk=1, fail=1, sp=0: /dev/md/3: act=2, wk=2, fail=0, sp=0 [20:35:11] PROBLEM - check_raid on beryllium is CRITICAL: CRITICAL: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=1, wk=1, fail=1, sp=0: /dev/md/3: act=2, wk=2, fail=0, sp=0 [20:38:23] (03PS1) 10Aude: Set Wikibase repoUrl to use https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288510 [20:38:51] aude: Good catch [20:38:56] was about to debug just that [20:38:59] (03PS2) 10Aude: Set Wikibase repoUrl to use https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288510 (https://phabricator.wikimedia.org/T135173) [20:39:10] hoo: :) [20:39:27] (03PS3) 10Hoo man: Set Wikibase repoUrl to use https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288510 (https://phabricator.wikimedia.org/T135173) (owner: 10Aude) [20:40:04] if we can deploy before swat, that would be best for me [20:40:08] PROBLEM - check_raid on beryllium is CRITICAL: CRITICAL: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=1, wk=1, fail=1, sp=0: /dev/md/3: act=2, wk=2, fail=0, sp=0 [20:40:08] PROBLEM - check_raid on beryllium is CRITICAL: CRITICAL: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=1, wk=1, fail=1, sp=0: /dev/md/3: act=2, wk=2, fail=0, sp=0 [20:40:20] 06Operations, 10ops-eqiad: Rack/Setup 4 map servers in eqiad - https://phabricator.wikimedia.org/T135018#2290660 (10Cmjohnson) [20:40:25] aude: Same here [20:40:31] don't want to stay up that long [20:40:43] * aude needs to go out in a bit [20:40:49] 06Operations, 10ops-eqiad: Rack and Set up new application servers mw1284-1306 - https://phabricator.wikimedia.org/T134309#2290661 (10Cmjohnson) [20:40:55] ostriches: around? [20:40:56] I'll be here until 12 maybe [20:41:06] * aude looks at the deployment calendar [20:41:52] guess we didn't do the train [20:42:05] or we did? [20:42:12] We were behind. [20:42:14] I did group1 [20:42:22] (03CR) 10Hoo man: [C: 04-1] Set Wikibase repoUrl to use https (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288510 (https://phabricator.wikimedia.org/T135173) (owner: 10Aude) [20:42:32] I'll catch up group2 later. [20:42:38] We had a redisbagostuff bug [20:42:46] So we held deployments yesterday [20:42:53] (03CR) 10Aude: Set Wikibase repoUrl to use https (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288510 (https://phabricator.wikimedia.org/T135173) (owner: 10Aude) [20:43:35] (03PS4) 10Aude: Set Wikibase repoUrl to use https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288510 (https://phabricator.wikimedia.org/T135173) [20:44:01] ostriches: would it be ok if we deploy https://gerrit.wikimedia.org/r/#/c/288510/ nowish? [20:44:10] * aude might not be around for swat [20:44:15] https://www.mediawiki.org/wiki/MediaWiki_1.28/Roadmap reflects reality now [20:44:19] ok [20:45:08] PROBLEM - check_raid on beryllium is CRITICAL: CRITICAL: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=1, wk=1, fail=1, sp=0: /dev/md/3: act=2, wk=2, fail=0, sp=0 [20:45:08] PROBLEM - check_raid on beryllium is CRITICAL: CRITICAL: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=1, wk=1, fail=1, sp=0: /dev/md/3: act=2, wk=2, fail=0, sp=0 [20:45:12] (03CR) 10Chad: [C: 04-1] "One minor inline, otherwise lgtm for immediate deploy." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288510 (https://phabricator.wikimedia.org/T135173) (owner: 10Aude) [20:45:21] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2290664 (10RobH) full log for the aqs1006 install: P3061 full log for the aqs1005 install: P3062 [20:46:00] (03CR) 10Aude: Set Wikibase repoUrl to use https (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288510 (https://phabricator.wikimedia.org/T135173) (owner: 10Aude) [20:46:46] hoo: do you know if we have a taks about the concept uri? [20:46:48] aude: ostriches: No, the concept URI should stay http [20:47:06] It's not meant to be used as a hyperlink [20:47:15] but as an identfier [20:47:30] i don't know if we can change it [20:47:31] and these use http by convention in the semantic web (don't ask me) [20:47:43] Will break some things, like ttl dumps [20:47:43] but it's ok to stay http [20:47:49] what hoo says :) [20:49:15] (03CR) 10Jforrester: [C: 04-1] "For next week, after wmf.2 goes live on enwiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288450 (https://phabricator.wikimedia.org/T132806) (owner: 10Jforrester) [20:50:08] PROBLEM - check_raid on beryllium is CRITICAL: CRITICAL: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=1, wk=1, fail=1, sp=0: /dev/md/3: act=2, wk=2, fail=0, sp=0 [20:50:08] PROBLEM - check_raid on beryllium is CRITICAL: CRITICAL: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=1, wk=1, fail=1, sp=0: /dev/md/3: act=2, wk=2, fail=0, sp=0 [20:50:44] ostriches: ? [20:52:19] aude, hoo: That's fine. [20:52:21] Did not know :) [20:52:25] :) [20:52:31] Anyway, fire at will [20:52:31] (03CR) 10DCausse: [C: 031] A/B/C test of control vs textcat vs accept-lang + textcat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268048 (https://phabricator.wikimedia.org/T121542) (owner: 10EBernhardson) [20:52:41] I'm stepping away for a bit, I'll do group2 later after swat prolly [20:53:12] hoo: ostriches want to +1? [20:53:21] Sure [20:53:56] (03CR) 10Hoo man: [C: 031] Set Wikibase repoUrl to use https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288510 (https://phabricator.wikimedia.org/T135173) (owner: 10Aude) [20:54:00] thanks [20:54:40] (03CR) 10Aude: [C: 032] "per hoo and discussed / agreed with Chad in irc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288510 (https://phabricator.wikimedia.org/T135173) (owner: 10Aude) [20:54:50] i don't know if a +2 works with a -1 [20:54:56] It does [20:55:03] ok [20:55:08] PROBLEM - check_raid on beryllium is CRITICAL: CRITICAL: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=1, wk=1, fail=1, sp=0: /dev/md/3: act=2, wk=2, fail=0, sp=0 [20:55:08] PROBLEM - check_raid on beryllium is CRITICAL: CRITICAL: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=1, wk=1, fail=1, sp=0: /dev/md/3: act=2, wk=2, fail=0, sp=0 [20:56:07] (03Merged) 10jenkins-bot: Set Wikibase repoUrl to use https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288510 (https://phabricator.wikimedia.org/T135173) (owner: 10Aude) [20:58:23] !log aude@tin Synchronized wmf-config/Wikibase-production.php: Update repoUrl setting for Wikibase (duration: 00m 27s) [20:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:58:49] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [20:58:49] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [20:59:04] what? [20:59:57] aude: Works: https://eo.wikipedia.org/w/index.php?title=Speciala%C4%B5o:Ser%C4%89i&profile=default&fulltext=Search&search=Magnus+Manske&searchToken=7cpncz5933lnlzso4112rwbic [21:00:02] !log aude@tin Synchronized wmf-config/Wikibase-labs.php: Update repoUrl setting for Wikibase (duration: 00m 32s) [21:00:08] PROBLEM - check_raid on beryllium is CRITICAL: CRITICAL: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=1, wk=1, fail=1, sp=0: /dev/md/3: act=2, wk=2, fail=0, sp=0 [21:00:08] PROBLEM - check_raid on beryllium is CRITICAL: CRITICAL: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=1, wk=1, fail=1, sp=0: /dev/md/3: act=2, wk=2, fail=0, sp=0 [21:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:00:19] \o/ [21:00:40] !log aude@tin Synchronized wmf-config/Wikibase.php: Update repoUrl setting for Wikibase (duration: 00m 28s) [21:00:48] https://eo.wikipedia.org/w/index.php?title=Speciala?o:Ser?i&profile=default&fulltext=Search&search=katido [21:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:00:50] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2290725 (10ssastry) [21:00:51] :) [21:01:06] {{done}} [21:01:09] heh :) [21:01:23] bot isn't here :/ [21:02:26] * aude eats and back later [21:02:50] cu :) [21:05:08] PROBLEM - check_raid on beryllium is CRITICAL: CRITICAL: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=1, wk=1, fail=1, sp=0: /dev/md/3: act=2, wk=2, fail=0, sp=0 [21:05:08] PROBLEM - check_raid on beryllium is CRITICAL: CRITICAL: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=1, wk=1, fail=1, sp=0: /dev/md/3: act=2, wk=2, fail=0, sp=0 [21:05:34] ^^ i see it... [21:07:53] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: investigate RAID failure on beryllium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T135178#2290776 (10Jgreen) [21:08:13] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: investigate RAID failure on beryllium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T135178#2290791 (10Jgreen) 05Open>03Resolved p:05Triage>03Unbreak! [21:08:40] ACKNOWLEDGEMENT - check_raid on beryllium is CRITICAL: CRITICAL: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=1, wk=1, fail=1, sp=0: /dev/md/3: act=2, wk=2, fail=0, sp=0 Jeff_Green ticketed as T135178 [21:08:40] ACKNOWLEDGEMENT - check_raid on beryllium is CRITICAL: CRITICAL: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=1, wk=1, fail=1, sp=0: /dev/md/3: act=2, wk=2, fail=0, sp=0 Jeff_Green ticketed as T135178 [21:08:48] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Puppet has 1 failures [21:08:48] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Puppet has 1 failures [21:16:33] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: investigate RAID failure on beryllium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T135178#2290811 (10Jgreen) 05Resolved>03Open closed by accident [21:18:15] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: investigate RAID failure on beryllium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T135178#2290820 (10Jgreen) looks like /dev/sda failed: [1509411.577517] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [1509411.577519] sd 0:0:0:0: [sda] CDB:... [21:18:39] 06Operations, 10ops-eqiad: investigate RAID failure on beryllium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T135178#2290836 (10Jgreen) a:05Jgreen>03None [21:24:26] (03PS3) 10Awight: Remove deprecated Fundraising thermometer config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233900 [21:24:57] (03CR) 10Awight: "@AndyRussG: This needs a CR+1 if you approve... Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233900 (owner: 10Awight) [21:27:20] PROBLEM - puppet last run on heze is CRITICAL: CRITICAL: puppet fail [21:27:20] PROBLEM - puppet last run on heze is CRITICAL: CRITICAL: puppet fail [21:29:30] (03PS1) 10Dzahn: grafana: move files from ./files/ to module [puppet] - 10https://gerrit.wikimedia.org/r/288516 [21:29:48] (03PS2) 10Dzahn: grafana: move files from ./files/ to module [puppet] - 10https://gerrit.wikimedia.org/r/288516 [21:31:49] (03PS6) 10Yuvipanda: Add a registry enforcer + tests [software/kubernetes] - 10https://gerrit.wikimedia.org/r/287572 [21:33:47] (03CR) 10Yuvipanda: [C: 032 V: 032] "Done :)" [software/kubernetes] - 10https://gerrit.wikimedia.org/r/287572 (owner: 10Yuvipanda) [21:35:39] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:35:39] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:36:07] (03PS1) 10Yuvipanda: tools: Bump k8s version [puppet] - 10https://gerrit.wikimedia.org/r/288517 [21:37:07] (03PS2) 10Yuvipanda: tools: Bump k8s version [puppet] - 10https://gerrit.wikimedia.org/r/288517 [21:37:20] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2290875 (10RobH) @papaul states he installed aqs1004 without that erorr, but it has the cannot boot issue: Attempting Boot From Hard Drive (C:) after post and then... [21:38:05] (03PS3) 10Yuvipanda: tools: Bump k8s version [puppet] - 10https://gerrit.wikimedia.org/r/288517 [21:38:24] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Bump k8s version [puppet] - 10https://gerrit.wikimedia.org/r/288517 (owner: 10Yuvipanda) [21:41:08] (03CR) 10AndyRussG: [C: 031] Remove deprecated Fundraising thermometer config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233900 (owner: 10Awight) [21:45:21] (03PS1) 10Yuvipanda: tools: Halt k8s build on errors [puppet] - 10https://gerrit.wikimedia.org/r/288519 [21:45:38] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Halt k8s build on errors [puppet] - 10https://gerrit.wikimedia.org/r/288519 (owner: 10Yuvipanda) [21:50:09] RECOVERY - WDQS HTTP on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 15301 bytes in 0.002 second response time [21:50:09] RECOVERY - WDQS HTTP on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 15301 bytes in 0.002 second response time [21:50:15] legoktm: I can't reproduce the problem you described at [[gerrit:286688]] [21:50:39] RECOVERY - WDQS SPARQL on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 15301 bytes in 0.010 second response time [21:50:39] RECOVERY - WDQS SPARQL on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 15301 bytes in 0.010 second response time [21:50:41] I assigned the new right to all users, but they are still not able to protect a page [21:51:05] Luke081515: I said effectively not literally. The concept is flawed to begin with, I'll post a longer comment on the bug in a few minutes. [21:51:15] ok [21:54:08] RECOVERY - puppet last run on heze is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [21:54:08] RECOVERY - puppet last run on heze is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [21:55:30] why we have to icingas? [21:56:50] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2290928 (10BBlack) Has anyone been able to reproduce any of the problems in the tickets merged into here, since roughly the ti... [21:56:59] (03PS1) 10Andrew Bogott: Labs puppetmaster: Get labs instance IP range from hiera [puppet] - 10https://gerrit.wikimedia.org/r/288521 [21:58:06] legoktm: Concerning the second patch: A reversible deletion of a tag would be the same, as currently deactivating it [21:59:55] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285020 (10Smalyshev) I don't see any weirdness anymore on query.wikidata.org/ @Jonas ? [22:01:56] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2290947 (10Jonas) Problem seems to be fixed. Thank you! [22:02:57] (03PS2) 10Andrew Bogott: Labs puppetmaster: Get labs instance IP range from hiera [puppet] - 10https://gerrit.wikimedia.org/r/288521 [22:04:18] (03CR) 10jenkins-bot: [V: 04-1] Labs puppetmaster: Get labs instance IP range from hiera [puppet] - 10https://gerrit.wikimedia.org/r/288521 (owner: 10Andrew Bogott) [22:05:06] (03PS3) 10Andrew Bogott: Labs puppetmaster: Get labs instance IP range from hiera [puppet] - 10https://gerrit.wikimedia.org/r/288521 [22:08:02] (03CR) 10Andrew Bogott: [C: 032] Labs puppetmaster: Get labs instance IP range from hiera [puppet] - 10https://gerrit.wikimedia.org/r/288521 (owner: 10Andrew Bogott) [22:14:49] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [22:14:49] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [22:15:12] (03PS1) 10Yuvipanda: tools: Fetch tags too before building k8s [puppet] - 10https://gerrit.wikimedia.org/r/288525 [22:19:39] YuviPanda: just because you're active: we seem to have two icinga bots [22:20:10] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 604 [22:20:10] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 604 [22:20:33] greg-g: so obviously you want a 3rd started up right? [22:21:04] (03PS2) 10Yuvipanda: tools: Fetch tags too before building k8s [puppet] - 10https://gerrit.wikimedia.org/r/288525 [22:21:18] puppet o'clock will be awesome with triple notices! [22:21:20] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Fetch tags too before building k8s [puppet] - 10https://gerrit.wikimedia.org/r/288525 (owner: 10Yuvipanda) [22:24:16] I shouldn't be watching this channel much anyways right now, so, sure :) [22:30:09] RECOVERY - check_mysql on lutetium is OK: Uptime: 177798 Threads: 2 Questions: 3603541 Slow queries: 4859 Opens: 35248 Flush tables: 2 Open tables: 64 Queries per second avg: 20.267 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [22:30:09] RECOVERY - check_mysql on lutetium is OK: Uptime: 177798 Threads: 2 Questions: 3603541 Slow queries: 4859 Opens: 35248 Flush tables: 2 Open tables: 64 Queries per second avg: 20.267 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [22:32:08] 06Operations, 10Traffic, 07HTTPS: status.wikimedia.org has no (valid) HTTPS - https://phabricator.wikimedia.org/T34796#2291063 (10BBlack) Yeah I tend to agree too. I think if we're concerned at all about status.wm.o perf during outages, we could probably also tack on a secondary task to extend the apache co... [22:32:58] 06Operations, 10Traffic, 07HTTPS: status.wikimedia.org has no (valid) HTTPS - https://phabricator.wikimedia.org/T34796#2291073 (10BBlack) Also note: while in there, should convert wikitech-static to cron'd letsencrypt (using our prod script!), and then use that for the status.wm.o cert as well. [22:37:08] (03PS1) 10Andrew Bogott: Add private bits of labsldapconfig [labs/private] - 10https://gerrit.wikimedia.org/r/288529 [22:37:37] 06Operations, 10Traffic, 07HTTPS: status.wikimedia.org has no (valid) HTTPS - https://phabricator.wikimedia.org/T34796#366296 (10Krenair) Yes, we should. Unfortunately wikitech-static might be a pain since it does not use puppet (and for obvious reasons cannot reach the production puppetmaster). :/ [22:38:43] (03PS1) 10Andrew Bogott: Create the labsldapconfig hiera struct [puppet] - 10https://gerrit.wikimedia.org/r/288530 [22:38:46] 06Operations, 10Traffic, 07HTTPS: status.wikimedia.org has no (valid) HTTPS - https://phabricator.wikimedia.org/T34796#2291120 (10BBlack) it's ok, we can just copy down the acme-setup script as it exists today (well, and acme-tiny). for a 1-2 cert setup like this, it's not hard to use it puppet-free from cr... [22:39:16] (03CR) 10Andrew Bogott: [C: 032 V: 032] Add private bits of labsldapconfig [labs/private] - 10https://gerrit.wikimedia.org/r/288529 (owner: 10Andrew Bogott) [22:47:17] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2291168 (10RobH) So @dzahn was able to work around the dependency issue, I've asked him to put an update, but I'll attempt to paraphrase from irc: > when the insta... [22:47:36] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2291170 (10RobH) Additionally, they still have the error of: Attempting Boot From Hard Drive (C:) When they should boot up the OS. [22:48:16] (03PS2) 10BBlack: cache_misc: puppetize switch to file storage [puppet] - 10https://gerrit.wikimedia.org/r/288440 [22:48:24] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: puppetize switch to file storage [puppet] - 10https://gerrit.wikimedia.org/r/288440 (owner: 10BBlack) [22:50:19] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago [22:50:19] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago [22:50:46] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2291173 (10ssastry) I installed [[https://www.joedog.org/siege-manual/|Siege]] and ran a 10 minute test on my laptop with node 0.10 and node 4.3 (siege -v -i -t 10m -f urls.txt is the... [22:50:49] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago [22:50:49] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago [22:50:49] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago [22:50:49] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago [22:50:55] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2291174 (10ssastry) p:05Triage>03Normal [22:51:19] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago [22:51:19] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago [22:51:20] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago [22:51:20] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago [22:51:29] that's me, turning on puppet for cache_misc for the first time in a while [22:51:47] also I see wdqs1001 was disabled while puppet was off: that just now went into real effect :) [22:52:01] but healthchecks would've kept it out of rotation anyways [22:52:20] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:52:20] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:52:48] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:52:48] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:52:48] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:52:48] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:53:19] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:53:19] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:53:19] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:53:19] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:00:04] RoanKattouw ostriches Krenair awight Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160512T2300). Please do the needful. [23:00:04] RoanKattouw Dereckson ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:06] (03PS3) 10Andrew Bogott: Create the labsldapconfig hiera struct [puppet] - 10https://gerrit.wikimedia.org/r/288530 [23:00:07] Hello. [23:00:08] (03PS1) 10Andrew Bogott: Remove a couple of unused settings from ldap config. [puppet] - 10https://gerrit.wikimedia.org/r/288536 [23:00:39] ebernhardson: any order for your patches (esp. the config one)? [23:00:50] because \o [23:00:54] Dereckson: the order listed [23:01:02] Dereckson: config needs to go before wikimedia events [23:01:27] So we can merge it in config before the CirrusSearch change? [23:01:31] * RoanKattouw waves [23:01:38] Dereckson: yes that will work too [23:01:48] Dereckson: cirrus and config prep things for the wikimedia events patch [23:01:54] those two can go same time/either order [23:02:06] Okay. I so offer we do the config series first. [23:02:15] sure [23:02:21] (03PS2) 10Dereckson: Enable cross-wiki notifications by default in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287035 (https://phabricator.wikimedia.org/T130655) (owner: 10Catrope) [23:03:01] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287035 (https://phabricator.wikimedia.org/T130655) (owner: 10Catrope) [23:03:47] (03Merged) 10jenkins-bot: Enable cross-wiki notifications by default in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287035 (https://phabricator.wikimedia.org/T130655) (owner: 10Catrope) [23:03:48] 06Operations, 06Discovery, 10Maps: Switch Maps to production status - https://phabricator.wikimedia.org/T133744#2291269 (10Yurik) Mobile wiki would like to use maps in their "nearby" feature, just like it is already done in the android app. I propose that we allow the maps tile service access from all WMF se... [23:03:54] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2291273 (10ssastry) Also, the load on my laptop was also lower with node v4.3 (load average: 3.10, 2.45, 1.90) vs node v0.10.26 (load average: 5.11, 4.33, 2.98). So, that is a win as... [23:04:05] cc: bblack https://phabricator.wikimedia.org/T133744 [23:06:03] (03PS1) 10Andrew Bogott: Sync up passwords with hiera and the password module [labs/private] - 10https://gerrit.wikimedia.org/r/288539 [23:06:08] (03PS2) 10Andrew Bogott: Remove a couple of unused settings from ldap config [puppet] - 10https://gerrit.wikimedia.org/r/288536 [23:06:10] (03PS4) 10Andrew Bogott: Create the labsldapconfig hiera struct [puppet] - 10https://gerrit.wikimedia.org/r/288530 [23:06:21] !log dereckson@tin Synchronized wmf-config/InitialiseSettings-labs.php: Enable cross-wiki notifications by default in production ([[Gerrit:287035]], 1/2) (duration: 00m 42s) [23:06:26] (03CR) 10Andrew Bogott: [C: 032] Sync up passwords with hiera and the password module [labs/private] - 10https://gerrit.wikimedia.org/r/288539 (owner: 10Andrew Bogott) [23:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:06:33] (03CR) 10Andrew Bogott: [V: 032] Sync up passwords with hiera and the password module [labs/private] - 10https://gerrit.wikimedia.org/r/288539 (owner: 10Andrew Bogott) [23:06:54] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable cross-wiki notifications by default in production ([[Gerrit:287035]], 2/2) (duration: 00m 29s) [23:06:55] RoanKattouw: here you are ^ [23:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:07:04] Yay, thanks [23:07:06] Testing now [23:07:54] (03PS2) 10Dereckson: UK EU edit-a-thon throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288377 (https://phabricator.wikimedia.org/T134902) [23:08:32] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2290725 (10GWicke) After moving to 4.x, you might also be able to get back more performance & memory by removing shims: T129598 [23:08:40] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288377 (https://phabricator.wikimedia.org/T134902) (owner: 10Dereckson) [23:09:18] (03Merged) 10jenkins-bot: UK EU edit-a-thon throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288377 (https://phabricator.wikimedia.org/T134902) (owner: 10Dereckson) [23:09:51] 06Operations, 06Discovery, 10Maps: Switch Maps to production status - https://phabricator.wikimedia.org/T133744#2291318 (10BBlack) We'd like to use this basically everywhere that maps make sense, obviously. But it's been locked down in a limited-beta-like status because it's not deployed in a production fas... [23:10:46] !log dereckson@tin Synchronized wmf-config/throttle.php: UK EU edit-a-thon throttle rule ([[Gerrit:288377]], T134902) (duration: 00m 28s) [23:10:47] T134902: Requesting temporary lift of IP cap on en.wp for 2016-05-14 - https://phabricator.wikimedia.org/T134902 [23:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:11:29] Dereckson: Confirmed working, thanks! [23:11:32] ebernhardson: what about https://gerrit.wikimedia.org/r/#/c/268048/ ? [23:11:49] RoanKattouw: cool, long live to cross wiki notifications. [23:12:34] (03PS2) 10Dereckson: Allow import from outreach to meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288392 (https://phabricator.wikimedia.org/T134788) [23:12:41] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288392 (https://phabricator.wikimedia.org/T134788) (owner: 10Dereckson) [23:12:50] Dereckson: thats the config patch, it sets up a query string trigger to adjust some cirrus config [23:12:55] aka ship it :) [23:12:56] bblack, i was hoping to enable it so that devs can already play with it (mobile has this limited beta mode thing). Without it, they can only start developing after it is fully production, which means it won't really go into production until at least quarter to a half a year later [23:12:59] (03CR) 10Andrew Bogott: [C: 032 V: 032] Create the labsldapconfig hiera struct [puppet] - 10https://gerrit.wikimedia.org/r/288530 (owner: 10Andrew Bogott) [23:13:01] (03PS7) 10EBernhardson: A/B/C test of control vs textcat vs accept-lang + textcat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268048 (https://phabricator.wikimedia.org/T121542) [23:13:09] (rebased so it stops saying can't merge) [23:13:13] k [23:13:34] (03Merged) 10jenkins-bot: Allow import from outreach to meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288392 (https://phabricator.wikimedia.org/T134788) (owner: 10Dereckson) [23:16:08] 06Operations, 06Discovery, 10Maps: Switch Maps to production status - https://phabricator.wikimedia.org/T133744#2291333 (10Yurik) @bblack, I was hoping we can enable it so that devs can already play with it. Mobile has this limited beta mode thing specifically for this. But with the current referer check, th... [23:16:37] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Allow import from outreach to meta (T134788) (duration: 00m 27s) [23:16:37] T134788: Add Outreach Wiki to Special:Import on Meta Wiki - https://phabricator.wikimedia.org/T134788 [23:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:16:58] Testing. [23:18:59] Well, I've noone with meta sysop access available, I'll ask testing on the Phabricator task. [23:19:12] Dereckson: sure [23:19:38] So hoo tested it. Works. Okay go for Cirrus. [23:19:57] esweet [23:20:20] (03PS8) 10Dereckson: A/B/C test of control vs textcat vs accept-lang + textcat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268048 (https://phabricator.wikimedia.org/T121542) (owner: 10EBernhardson) [23:20:31] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268048 (https://phabricator.wikimedia.org/T121542) (owner: 10EBernhardson) [23:21:06] ebernhardson: mw1017 first? [23:21:10] (03Merged) 10jenkins-bot: A/B/C test of control vs textcat vs accept-lang + textcat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268048 (https://phabricator.wikimedia.org/T121542) (owner: 10EBernhardson) [23:22:30] !log dereckson@tin Synchronized tests/cirrusTest.php: A/B/C test of control vs textcat vs accept-lang + textcat (no-op) (duration: 00m 25s) [23:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:00] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 237 bytes in 0.048 second response time [23:23:00] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 237 bytes in 0.048 second response time [23:23:12] yurik: technically, don't they not have a referrer-block problem anyways? [23:23:30] bblack, android app doesn't, but m.wikipedia.org does [23:23:46] !log dereckson@tin Synchronized wmf-config/CirrusSearch-common.php: A/B/C test of control vs textcat vs accept-lang + textcat ([[Gerrit:268048]]) (duration: 00m 25s) [23:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:24:45] bblack, that's why i propose we remove the referer check completelly, but don't allow or on any wikipedias yet [23:24:57] i mean remove for wmf sites [23:25:36] do you really anticipate much work to re-deploy your maps-test200x stuff to the real maps200x boxes? they're already being installed... [23:26:47] bblack, hard to say - the tile generation has significantly changed by mapbox, so might need some work there. We could try to copy the existing db (cassandra) over, but even with the absolute best case, it takes about a week to download and install the postgress alone [23:27:10] i just don't want to block devs :) [23:27:16] if they want to hack on it now [23:27:18] https://phabricator.wikimedia.org/T134406 [23:27:37] "the tile generation has significantly changed by mapbox" ? [23:27:47] what does that have to do with moving from one hardware box to another? [23:28:23] we don't want to simply copy the postgress db over - we are upgrading the osm2pgsql to the latest in the process, so the postgress will have to reinstall the whole db anew [23:28:36] the tile regeneration is a bit more finiky [23:28:37] anyways, akosiaris is probably way better informed on this topic than I am. [23:28:42] Dereckson: shouldn't need to go to mw1017 first [23:28:54] Dereckson: i'm not seeing the trigger it setup workin on enwiki yet though :S [23:28:57] IMHO, you should decouple other upgrades and changes from getting on proper hardware. there's no need for that inter-dependency if you're in a rush [23:29:16] there's already puppetization, you can copy over the live data for continuing replication, etc? [23:29:48] Dereckson: oh, train didn't roll forward today? enwiki is still on 1.2 [23:29:51] 1.27 [23:30:29] bblack, tiles we can copy, but the rule of thumb with OSM data is that you always want to re-import it once in a while, so might as well do it in this case. Referer is currently only blocking various devs from experimenting [23:30:35] I think the train did go forward but only to group1 [23:30:39] i.e. the Wednesday change a day late [23:30:39] ebernhardson: config change live on mw1017 [23:31:17] bblack, so we will reimport postgres, and probably copy tiles, and slowly will start regenerating to update them to the latest version [23:31:17] no, it's trying (probably without being fully successful) to keep production dependencies off of it [23:31:20] test.wikipedia.org is still up so the syntax of CirrusSearch-common.php looks good [23:31:35] Dereckson: it wont do anything, the necessary code is in 1.28.0-wmf.1, and the config change only applies to enwiki [23:31:37] yurik: part of going into production is having a plan to be able to do that without disrupting service, surely? [23:31:43] ebernhardson: k [23:31:54] bblack, sure [23:32:02] Dereckson: i suppose we hvae to skip deploying the WikimediaEvents stuff, since the train didn't roll forward and dependent code is there [23:32:09] i'll have to start the test monday then [23:32:14] okay [23:32:19] the rest is fine to leave out there though, can't hurt anything [23:32:19] so, can't you just copy over what you have today, and do this other update online without disrupting the new production service? [23:32:54] bblack, we could, but it is much harder to do an "in-place" upgrade than start from a clean slate and switch over once its fully ready [23:33:05] let's sync IS so [23:33:09] so how do you plan to do the next in-place upgrade? [23:33:39] bblack, it is still doable, just takes longer [23:33:42] these are questions that probably need answering, or we're stuck with a live production system that can't be upgraded without taking it offline for significant windows, right? [23:33:42] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: A/B/C test of control vs textcat vs accept-lang + textcat ([[Gerrit:268048]]) (duration: 00m 25s) [23:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:33:54] ebernhardson: here you are for config ^ [23:34:07] Dereckson: thanks [23:34:25] You want https://gerrit.wikimedia.org/r/#/q/288465,n,z on wmf23 too ? [23:34:32] anyways [23:34:46] Dereckson: nah there are even more patches we would have to pull forward for it to work [23:34:48] bblack, it is totally upgradable, but basically we will take a few servers down, upgrade the DB, regenerate tiles, and switch masters. For smaller upgrades we will simply create a copy of a db [23:34:54] Dereckson: i'll just wait for monday when (hopefully) group2 is on wmf.1 [23:35:18] but when we start with clean machines, it is simply easier to do it without warrying about killing the server :) [23:35:31] ebernhardson: and you want to deploy it Monday or now? If now, testable on a wmf.1 wiki? [23:35:57] Dereckson: the test only runs on enwiki (its language specific, more languages coming in the future), so lets just hold back [23:36:03] Okay [23:36:26] So we're done? [23:36:35] Dereckson: actually, i'm watching fatalmonitor and we need to revert the config change [23:36:41] k [23:36:51] it uses a new config option that would be fine if everything was on wmf.1, but it's not and causes division by zero warnings [23:37:11] * ebernhardson forgot abotu that part ... [23:37:22] no user impact, just log spam [23:38:50] (03PS1) 10Dereckson: Revert "A/B/C test of control vs textcat vs accept-lang + textcat" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288545 [23:39:21] lemme update the commit message real quick [23:39:25] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288545 (owner: 10Dereckson) [23:39:48] or not :P it's fine the commit message describes enough i was just going to be more verbose [23:40:15] (03Merged) 10jenkins-bot: Revert "A/B/C test of control vs textcat vs accept-lang + textcat" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288545 (owner: 10Dereckson) [23:40:27] sorry for the bad timing [23:40:49] You can still add a comment on Gerrit [23:40:57] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2291414 (10Dzahn) Yep, so the "install software" / tasksel step of the installer failed and there were the "packages have unmet dependencies:" errors Rob pasted above... [23:41:14] (03PS1) 10EBernhardson: Revert "Revert "A/B/C test of control vs textcat vs accept-lang + textcat"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288546 [23:41:32] Dereckson: no worries, i just put the more verbose version in the revert of the revert which will ship monday (hopefully) [23:41:50] (03PS2) 10EBernhardson: Revert "Revert "A/B/C test of control vs textcat vs accept-lang + textcat"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288546 [23:42:19] woo, might hit 200k warnings before the undeploy :) [23:42:36] !log dereckson@tin Synchronized tests/cirrusTest.php: Revert "A/B/C test of control vs textcat vs accept-lang + textcat" ([[Gerrit:288545]], 1/3) (duration: 00m 24s) [23:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:43:07] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Revert "A/B/C test of control vs textcat vs accept-lang + textcat" ([[Gerrit:288545]], 2/3) (duration: 00m 25s) [23:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:43:33] looks to have stopped increasing for the most part. A few straggling warnings will probably still come in that are cached in syslog around the cluster [23:43:35] !log dereckson@tin Synchronized wmf-config/CirrusSearch-common.php: Revert "A/B/C test of control vs textcat vs accept-lang + textcat" ([[Gerrit:288545]], 3/3) (duration: 00m 25s) [23:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:43:48] Revert done and sync'ed. [23:44:07] Now let's revert /wmf.1 [23:44:26] the cirrus patch deployed to wmf.1 is fine to leave out there [23:45:04] We don't want to let undeployed code in branches: the wmf branches must reflect as much as possible the current code really deployed on servers. [23:45:15] Dereckson: oh, just go ahead and deploy it to the cluster then [23:45:31] * ebernhardson thought it was already synced out [23:45:51] oh okay, no I did the config first [23:45:59] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 0.026 second response time [23:45:59] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 0.026 second response time [23:46:08] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 0.043 second response time [23:46:08] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 0.043 second response time [23:47:04] So, you can test 288465 right now, without waiting en.wikipedia to be php-1.28.0-wmf.1? [23:47:19] Dereckson: yes, from mwrepl [23:47:45] without the config change that code path wont ever actually be hit [23:47:59] duplicate icinga-wm detected. self destruction sequence initiated [23:48:11] So what's the point to deploy it now? [23:48:28] Dereckson: it's already merged, and doesn't hurt anything. Makes monday simpler [23:49:21] You're confident with your tests on the console? No surprise for the train when they'll upgrade to php-1.28.0-wmf.1? [23:49:30] Dereckson: yes, i can run api requests from mwrepl [23:49:39] Okay, let's deploy it. [23:51:14] 06Operations, 10netops: codfw-eqiad Zayo link is down (cr2-codfw:xe-5/0/1) - https://phabricator.wikimedia.org/T134930#2291418 (10faidon) 05Open>03stalled The issue recovered before they had a chance to investigate. I have been asked whether we are interested in an formal RFO (reason for outage), which is... [23:53:01] Okay, ready on Tin. [23:54:13] !log dereckson@tin Synchronized php-1.28.0-wmf.1/extensions/CirrusSearch/includes/CirrusSearch.php: Adjust textcat data collection for AB test (T121542) (duration: 00m 26s) [23:54:14] T121542: Write and deploy an A/B Test on enwiki using TextCat for Language Identification - https://phabricator.wikimedia.org/T121542 [23:54:16] ebernhardson: ^ [23:54:21] Dereckson: pulling up a test now [23:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:58:02] Dereckson: looks good, no errors [23:58:06] !log restart slapd instances. we have some seemingly ldap issues in labs but it may not be the servers, could be bad merged changes. [23:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master