[00:40:20] (03PS1) 10Alex Monk: phabricator: Add NE flag to old task creation URL redirect [puppet] - 10https://gerrit.wikimedia.org/r/272426 (https://phabricator.wikimedia.org/T127286) [01:05:47] 6Operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 3 others: Broken mobile edit section links are showing up in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2050230 (10Danny_B) p:5High>3Unbreak! P... [01:46:53] PROBLEM - logstash process on logstash1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [02:06:50] 6Operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 3 others: Broken mobile edit section links are showing up in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2050285 (10Danny_B) Not only headline conte... [02:07:55] 6Operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2050286 (10Danny_B) [02:13:29] !log Logstash process on logstash1002 died from jvm OOM [02:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:14:17] 6Operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2050287 (10Danny_B) [02:15:13] RECOVERY - logstash process on logstash1002 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [02:23:55] bd808: does the service not restart automatically? [02:24:13] Apparently not [02:24:43] (03PS1) 10Ori.livneh: gdash: decom [puppet] - 10https://gerrit.wikimedia.org/r/272427 (https://phabricator.wikimedia.org/T104365) [02:25:10] (03PS1) 10Ori.livneh: xhgui: sanitize keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272428 [02:25:12] (03PS1) 10Ori.livneh: xhgui: sanitize query string [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272429 [02:25:17] service logstash status said "Active: active (exited) since Thu 2016-01-21 01:13:53 UTC; 1 months 1 days ago" [02:29:35] interesting: it's a jessie system, so it's using systemd as its init. But there is no unit file for logstash; systemd can turn files in /etc/init.d into virtual services, apparently [02:29:56] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.13) (duration: 13m 28s) [02:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:04] the question is then: how does one configure the restart behavior? [02:31:04] The latest upstream version is still shipping a SysV init file for their debian package [02:31:22] you can see the autogenerated unit file with 'systemctl cat logstash.service'; it has the comment: "# Automatically generated by systemd-sysv-generator" [02:31:46] 'Restart=no' [02:32:59] Perhaps we should make our own unit file for it [02:34:31] or contribute one upstream [02:35:03] ori: filed as https://phabricator.wikimedia.org/T127677 [02:39:11] eww. They build the deb via FPM and rake [02:54:31] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.14) (duration: 12m 05s) [02:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:03:21] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Feb 22 03:03:21 UTC 2016 (duration 8m 50s) [03:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:31:04] 6Operations, 10Wikimedia-Site-Requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2050381 (10Danny_B) [03:31:33] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [03:49:03] (03PS2) 10Ori.livneh: gdash: decom [puppet] - 10https://gerrit.wikimedia.org/r/272427 (https://phabricator.wikimedia.org/T104365) [03:55:43] (03CR) 10Ori.livneh: [C: 032] xhgui: sanitize keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272428 (owner: 10Ori.livneh) [03:56:11] (03Merged) 10jenkins-bot: xhgui: sanitize keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272428 (owner: 10Ori.livneh) [03:56:18] (03CR) 10Ori.livneh: [C: 032] xhgui: sanitize query string [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272429 (owner: 10Ori.livneh) [03:56:23] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [03:56:44] (03Merged) 10jenkins-bot: xhgui: sanitize query string [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272429 (owner: 10Ori.livneh) [04:00:07] !log ori@tin Synchronized wmf-config/StartProfiler.php: Ie4c87619: xhgui: sanitize query string & I219c0901: xhgui: sanitize keys (duration: 01m 45s) [04:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:14:41] (03PS1) 10Ori.livneh: xhgui: require auth only for POST; allow anon GETs [puppet] - 10https://gerrit.wikimedia.org/r/272431 [04:14:53] 'git-review' is super-slow; I wonder if something is up with Gerrit. [04:15:18] (03PS2) 10Ori.livneh: xhgui: require auth only for POST; allow anon GETs [puppet] - 10https://gerrit.wikimedia.org/r/272431 [04:17:00] ytterbium seems fine, so maybe it was just a fluke. [04:17:14] (03PS3) 10Ori.livneh: xhgui: require auth only for POST; allow anon GETs [puppet] - 10https://gerrit.wikimedia.org/r/272431 [04:18:10] (03CR) 10Ori.livneh: [C: 032 V: 032] xhgui: require auth only for POST; allow anon GETs [puppet] - 10https://gerrit.wikimedia.org/r/272431 (owner: 10Ori.livneh) [04:25:33] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 13.79% of data above the critical threshold [100000000.0] [04:28:47] (03CR) 10BryanDavis: "Posted for 2016-02-23T00:00 SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268616 (https://phabricator.wikimedia.org/T122865) (owner: 10BryanDavis) [04:37:40] ori, do you ever sleep? :) [04:38:00] (03CR) 10Chad: [C: 04-2] "While it looks like the overwhelming consensus in the RFC is for disabling (and that's probably not going to change), I haven't seen a clo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271932 (https://phabricator.wikimedia.org/T127509) (owner: 10MarcoAurelio) [04:39:00] subbu: when you have a four-year old you get very very good at being productive in 5-minute sessions. So it looks like a block of work, but it's really a whole bunch of occasional skirmishes :) [04:39:57] :-) [04:51:20] (03PS1) 10BryanDavis: Tools: Add mytop [puppet] - 10https://gerrit.wikimedia.org/r/272435 (https://phabricator.wikimedia.org/T58999) [04:57:34] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [05:13:44] 7Puppet: Service_unit[uwsgi-startup] causes log churn - https://phabricator.wikimedia.org/T127684#2050476 (10ori) [05:24:20] (03PS1) 10Tim Landscheidt: Tools: Fix undefined variable in toollabs::kube2proxy [puppet] - 10https://gerrit.wikimedia.org/r/272438 [05:26:03] (03CR) 10Ori.livneh: [C: 032] Tools: Fix undefined variable in toollabs::kube2proxy [puppet] - 10https://gerrit.wikimedia.org/r/272438 (owner: 10Tim Landscheidt) [05:52:53] PROBLEM - puppet last run on mw2070 is CRITICAL: CRITICAL: Puppet has 1 failures [06:09:37] (03PS5) 10Tim Landscheidt: puppetmaster: Fix git-sync-upstream for unclean rebases [puppet] - 10https://gerrit.wikimedia.org/r/264692 [06:17:53] RECOVERY - puppet last run on mw2070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:31:02] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:04] PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:04] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:32] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:33] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:42] PROBLEM - puppet last run on aqs1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:43] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:51:16] (03PS1) 10Tim Landscheidt: Tools: Fix puppet-lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/272440 [06:55:53] RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:56:24] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:57:42] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:57:44] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:58:13] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:58:22] RECOVERY - puppet last run on aqs1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:23] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:03:58] (03PS1) 10Tim Landscheidt: Tools: Remove obsolete classes [puppet] - 10https://gerrit.wikimedia.org/r/272441 [07:04:56] (03CR) 10Tim Landscheidt: "Confirmed by http://tools.wmflabs.org/watroles/role/role::labs::tools::submit (and http://tools.wmflabs.org/watroles/role/toollabs::submit" [puppet] - 10https://gerrit.wikimedia.org/r/272441 (owner: 10Tim Landscheidt) [08:39:49] (03PS7) 10Phedenskog: webperf: Create new navtiming metric with higher value limit [puppet] - 10https://gerrit.wikimedia.org/r/270725 (https://phabricator.wikimedia.org/T125381) [08:41:03] phedenskog: yep; it crashed. Different reason, this time. Embarrassing! I'll fix it. [08:41:32] ori: thanks :) [08:43:29] (03PS1) 10Ori.livneh: statsv: set Restart=always in unit file [puppet] - 10https://gerrit.wikimedia.org/r/272448 [08:43:46] (03PS2) 10Ori.livneh: statsv: set Restart=always in unit file [puppet] - 10https://gerrit.wikimedia.org/r/272448 [08:43:53] (03CR) 10Ori.livneh: [C: 032 V: 032] statsv: set Restart=always in unit file [puppet] - 10https://gerrit.wikimedia.org/r/272448 (owner: 10Ori.livneh) [08:50:34] 6Operations: fix all "top-scope variable being used without an explicit namespace" across the puppet repo - https://phabricator.wikimedia.org/T125042#2050620 (10scfc) [08:52:10] 6Operations: fix all "top-scope variable being used without an explicit namespace" across the puppet repo - https://phabricator.wikimedia.org/T125042#1973024 (10scfc) I'm never quite sure if and how this applies to templates as well (not the lint warning, but the underlying issue), so if someone has investigated... [09:24:20] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "This should list stuff that "is", not stuff that "should be". Thanks for re-adding P513." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271971 (owner: 10Matěj Suchánek) [09:25:54] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2050647 (10elukey) @Ori and all: mc1001->mc1003 and mc1014->mc1018 (memcached hosts) would still need to be migrated to Debian for https://phabricat... [09:29:16] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2050648 (10ori) >>! In T126700#2050647, @elukey wrote: > @Ori and all: > > mc1001->mc1003 and mc1014->mc1018 (memcached hosts) would still need to b... [09:34:42] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2050650 (10elukey) > It would be good if you could wait a day Even a week, no real hurry, it was only to organize the work and set some ETA in my ph... [09:36:54] 6Operations, 13Patch-For-Review, 7user-notice: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#2050651 (10elukey) https://phabricator.wikimedia.org/T126700 is still a blocker for this task, the rest of the hosts will be re-imaged probably in the second part of the wee... [09:37:19] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2050653 (10elukey) [09:37:21] 6Operations, 13Patch-For-Review, 7user-notice: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#2050652 (10elukey) [09:38:08] --^ this guy is a real spammer :) [09:38:27] (03CR) 10Filippo Giunchedi: [C: 031] gdash: decom [puppet] - 10https://gerrit.wikimedia.org/r/272427 (https://phabricator.wikimedia.org/T104365) (owner: 10Ori.livneh) [09:43:16] elukey: I'm sure there is a wikilove banner for spamming [09:45:32] :) [09:46:43] PROBLEM - NTP on labstore1001 is CRITICAL: NTP CRITICAL: No response from NTP server [10:12:36] (03PS9) 10Ema: Maps VCL initial forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [10:26:22] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2050711 (10ori) >>! In T126700#2036552, @Anomie wrote: > Further, I took a look at the hits to centralauth-user and global:user during the time that... [10:38:13] (03PS1) 10Filippo Giunchedi: swiftrepl: fix destination container listing limit [software] - 10https://gerrit.wikimedia.org/r/272455 [10:38:42] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [10:39:43] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2050715 (10ori) @Anomie, when I snoop Redis GETs I see something even more bizarre: each request results in three to four `enwiki:MWSession:XXX` GETs... [10:40:32] RECOVERY - Host mw2027 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [10:43:20] elukey: is mc1007 not pooled or something? [10:43:28] memcached is active, but redis is not? [10:43:34] see memcached stats and redis stats [10:45:03] no, redis is active [10:45:10] just not updating stats, perhaps [10:46:18] <_joe_> sudo tcpdump -nv dst port 6379 tells me there is quite some activity on redis [10:46:55] <_joe_> on mc1007 that is [10:47:18] even 1009 shows low activity in Ganglia [10:47:56] yeah, the debian hosts aren't sending redis metrics to ganglia, it seems [10:48:07] weird gmond behavior? [10:48:18] <_joe_> ok this is an important bug of our ganglia redis collector [10:48:19] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2050716 (10Tgr) Not disagreeing about the importance to fix that (Brad already has a patch for the CA lookup in T127236), but there is no reason the... [10:48:40] <_joe_> it might not be in sync with what a newer redis spits out [10:49:10] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2050717 (10ori) >>! In T126700#2050716, @Tgr wrote: > Not disagreeing about the importance to fix that (Brad already has a patch for the CA lookup in... [10:50:35] that ended up masking the increase in ops/sec, visible when you look at a machine that has not yet been reimaged: http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&c=Memcached+eqiad&h=mc1017.eqiad.wmnet&jr=&js=&v=1464&m=instantaneous_ops_per_sec&vl=ops%2Fs&ti=instantaneous_ops_per_sec [10:51:02] RECOVERY - salt-minion processes on scandium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:53:30] !log started salt-minion on scandium (process had died) [10:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:54:25] ori, _joe_ I am going to brb in ~1 hour, I'll try to follow Joe's suggestion about checking the redis collector. [10:54:53] RECOVERY - NTP on labstore1001 is OK: NTP OK: Offset -0.0001752376556 secs [10:55:03] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2050735 (10ori) >>! In T126700#2050715, @ori wrote: > @Anomie, when I snoop Redis GETs I see something even more bizarre: each request results in thr... [10:55:03] (probably Joe already gathered the answer in this timeframe but I'll try in case he didn't :) [10:55:16] <_joe_> elukey: I didn't [10:55:28] <_joe_> I'm not really looking at this right now [10:55:53] all right, I'll take a look to it and get back to the phab task in case I find anything [10:58:08] ok, i'm off for real [10:58:09] bye [11:09:14] !log restarting apache on graphite1001 for glibc update [11:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:39:31] (03PS2) 10Filippo Giunchedi: swiftrepl: fix destination container listing limit [software] - 10https://gerrit.wikimedia.org/r/272455 (https://phabricator.wikimedia.org/T125791) [12:03:23] PROBLEM - puppet last run on db2066 is CRITICAL: CRITICAL: puppet fail [12:03:33] PROBLEM - puppet last run on mw1029 is CRITICAL: CRITICAL: puppet fail [12:03:33] PROBLEM - puppet last run on mw1244 is CRITICAL: CRITICAL: puppet fail [12:04:03] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: puppet fail [12:08:14] PROBLEM - puppet last run on mw2125 is CRITICAL: CRITICAL: Puppet has 1 failures [12:23:19] !log restarting apache on neon for glibc update [12:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:30:23] RECOVERY - puppet last run on db2066 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [12:31:43] RECOVERY - puppet last run on mw2125 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:32:22] RECOVERY - puppet last run on mw1029 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [12:32:23] RECOVERY - puppet last run on mw1244 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:32:44] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:05:11] (03PS2) 10Gehel: Adding gehel to some shinken notifications for labs [puppet] - 10https://gerrit.wikimedia.org/r/270729 [13:06:46] (03CR) 10Hashar: [C: 031] "Looks all legit to me. On labs we use Shinken which got setup by Yuvi Panda. I have no idea on which labs projects / instances it is run" [puppet] - 10https://gerrit.wikimedia.org/r/270729 (owner: 10Gehel) [13:11:54] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2050885 (10Anomie) >>! In T126700#2050715, @ori wrote: > @Anomie, when I snoop Redis GETs I see something even more bizarre: each request results in... [13:16:54] PROBLEM - PyBal backends health check on lvs1005 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1002.eqiad.wmnet because of too many down! [13:17:22] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: connection error: (urllib3.connectionpool.HTTPConnectionPool object at 0x7fbc601e8b10, Connection to localhost timed out. (connect timeout=5)) [13:17:34] PROBLEM - PyBal backends health check on lvs1002 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1002.eqiad.wmnet because of too many down! [13:18:11] _joe_ there is no redis config for gmond on Jessie, found the issue [13:18:44] <_joe_> shit, ocg is down [13:20:44] <_joe_> moritzm: are you rebooting rdb1002? [13:21:32] no, only made the redis instances, can also connect to the system [13:24:21] (redis restarts I meant to say) [13:31:24] RECOVERY - PyBal backends health check on lvs1005 is OK: PYBAL OK - All pools are healthy [13:31:39] <_joe_> !log restarting ocg on ocg1001, got stuck after redis restart on rdb1002 [13:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:31:52] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 1235 msg: ocg_render_job_queue 0 msg [13:32:12] RECOVERY - PyBal backends health check on lvs1002 is OK: PYBAL OK - All pools are healthy [13:32:27] --^ \o/ [13:35:13] rdb10XX are the hosts that hold the redis queue, used by ocg and MW jobrunners right? [13:46:37] (03PS1) 10Hoo man: Bump $wgCacheEpoch on Wikidata after Property conversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272463 [13:47:29] (03CR) 10Hoo man: [C: 032] Bump $wgCacheEpoch on Wikidata after Property conversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272463 (owner: 10Hoo man) [13:47:56] (03Merged) 10jenkins-bot: Bump $wgCacheEpoch on Wikidata after Property conversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272463 (owner: 10Hoo man) [13:50:10] !log hoo@tin Synchronized wmf-config/Wikibase.php: Bump $wgCacheEpoch on Wikidata after Property conversions (duration: 01m 39s) [13:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:53:52] PROBLEM - NTP on logstash1006 is CRITICAL: NTP CRITICAL: No response from NTP server [13:54:12] PROBLEM - NTP on logstash1005 is CRITICAL: NTP CRITICAL: No response from NTP server [13:58:53] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [14:02:33] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 62.96% of data above the critical threshold [5000000.0] [14:04:33] RECOVERY - NTP on logstash1006 is OK: NTP OK: Offset -0.00110912323 secs [14:05:02] RECOVERY - NTP on logstash1005 is OK: NTP OK: Offset 0.0002355575562 secs [14:06:12] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [14:16:42] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [14:21:11] _joe_, ori: https://github.com/wikimedia/operations-puppet/commit/6c23f3b2848be376275ac4c7604a78334f3da296#diff-1619526e7c0925eaa47ff9e26b5bbe4d [14:21:23] --^ Remove redis::ganglia; incompatible with multi-instance [14:22:50] <_joe_> lol [14:22:52] <_joe_> ok [14:27:28] (03PS1) 10Ladsgroup: Deploy ORES extension to Wikipedia project in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272466 [14:31:32] (03PS2) 10Ladsgroup: Deploy ORES extension to Wikipedia project in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272466 (https://phabricator.wikimedia.org/T127661) [14:32:27] 6Operations, 10Traffic, 13Patch-For-Review: openssl-1.0.2f introduced minor bug with nginx - https://phabricator.wikimedia.org/T126616#2051007 (10MoritzMuehlenhoff) This was fixed upstream at http://hg.nginx.org/nginx/rev/062c189fee20 and I built a 1.9.4+wmf2 package on copper with that patch (not copied to... [14:32:48] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2051009 (10elukey) > One reason I failed to notice this earlier is that the redis gmond plugin appears to be broken on Jessie @Ori: it seems that Ga... [14:46:22] (03CR) 10Hashar: Deploy ORES extension to Wikipedia project in beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272466 (https://phabricator.wikimedia.org/T127661) (owner: 10Ladsgroup) [14:48:47] so i get paged, but it doesnt even show here [14:49:03] icinga config is broken [14:49:43] <_joe_> jynus: oh? [14:49:48] <_joe_> I'll take a look [14:50:39] specially because Inspecifically asked not to page from inactive datacenter [14:51:00] someone changed the contact address on the general config [14:51:41] to be dba even if it is not critical [14:54:40] <_joe_> jynus: oh so it's broken as "wrongly choosen", not "syntactically broken" [14:54:50] yes [14:55:05] semantically broken [15:00:54] 6Operations, 10Analytics, 6Services, 10scap, 3Scap3: Deploy AQS with scap3 - https://phabricator.wikimedia.org/T114999#2051034 (10Ottomata) [15:01:55] (03PS9) 10Ottomata: AQS: Separate AQS off of RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/271687 (https://phabricator.wikimedia.org/T126294) (owner: 10Mobrovac) [15:03:23] (03CR) 10Ladsgroup: Deploy ORES extension to Wikipedia project in beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272466 (https://phabricator.wikimedia.org/T127661) (owner: 10Ladsgroup) [15:11:07] so the problem is it says "contact_groups dba" for non critical alerts [15:11:37] (03CR) 10Jo-Jo Eumerus: "https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_%28proposals%29&diff=prev&oldid=706278912 indicates the RfC is now close" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271932 (https://phabricator.wikimedia.org/T127509) (owner: 10MarcoAurelio) [15:11:46] and it should be admins? [15:11:53] with no sms? [15:12:16] no, that is ok [15:12:29] dba should not page by default [15:16:12] (03PS1) 10Ema: New WMF version: 4.1.1-1wm2 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/272468 (https://phabricator.wikimedia.org/T124279) [15:16:33] so not critical => admins, critical => admins,sms,admins ? [15:20:59] (03CR) 10MarcoAurelio: "I've asked TheDJ on the bug, although most voters say they want the extension removed/disabled." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271932 (https://phabricator.wikimedia.org/T127509) (owner: 10MarcoAurelio) [15:21:06] (03PS2) 10MarcoAurelio: Removing Gather from enwiki and miscellaneous cosmetic changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271932 (https://phabricator.wikimedia.org/T127509) [15:24:13] (03PS2) 10MarcoAurelio: Throttle exception for or.wikipedia Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271974 (https://phabricator.wikimedia.org/T127599) [15:24:28] (03PS2) 10MarcoAurelio: Enabling translation notifications at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271496 (https://phabricator.wikimedia.org/T126901) [15:24:36] (03PS2) 10MarcoAurelio: Enable DynamicPageList for Wikimedia Norge chapter wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271252 (https://phabricator.wikimedia.org/T127161) [15:26:29] !log stopping puppet on aqs* for scap deployment [15:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:52] (03CR) 10Ottomata: [C: 032] AQS: Separate AQS off of RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/271687 (https://phabricator.wikimedia.org/T126294) (owner: 10Mobrovac) [15:32:37] 6Operations, 10Parsoid, 10Traffic, 10VisualEditor, and 2 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2051184 (10BBlack) 24 hour log run, with pre-filtering for internal monitoring requests and definite random crawler/junk/noise traffic: * 206x (avg 8.... [15:37:34] 6Operations, 10Parsoid, 10Traffic, 10VisualEditor, and 2 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2051211 (10BBlack) Does anyone have a handle on what the random low-traffic labs usages are at the bottom of the list above? As for `parsoid-prod.wmf... [15:38:58] 6Operations, 10Parsoid, 10Traffic, 10VisualEditor, and 2 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2051212 (10BBlack) I should note, my inclination is to just shut this down today so that we can move on with other related/blocked work. We're past o... [15:39:08] (03PS1) 10Volans: Repool of es2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272474 [15:39:53] !log stopping restbase on aqs1001 [15:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:40:58] (03CR) 10Jcrespo: "Are you going to do testing on it? -1 if yes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272474 (owner: 10Volans) [15:43:24] (03PS10) 10Ema: Maps VCL initial forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [15:43:55] 6Operations, 10Parsoid, 10Traffic, 10VisualEditor, and 2 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2051238 (10BBlack) Another way to think of the stats: ignoring `parsoid-prod.wmflabs.org` crazy proxy thing, and ignoring this one oddball Russian IP,... [15:43:56] (03PS1) 10Jcrespo: Non critical DBA pages should not send an sms to the DBA group [puppet] - 10https://gerrit.wikimedia.org/r/272478 [15:44:14] 6Operations, 10Parsoid, 10Traffic, 10VisualEditor, and 2 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2051239 (10Jdforrester-WMF) Do it. [15:44:25] (03PS1) 10MarcoAurelio: Enable signature button at NS:102 for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272479 (https://phabricator.wikimedia.org/T127688) [15:44:43] (03CR) 10Hashar: "Changes looks fine to me. The key question being what ORES service is beta cluster going to hit? Would it be production (eek) or should w" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272466 (https://phabricator.wikimedia.org/T127661) (owner: 10Ladsgroup) [15:46:16] (03CR) 10jenkins-bot: [V: 04-1] Non critical DBA pages should not send an sms to the DBA group [puppet] - 10https://gerrit.wikimedia.org/r/272478 (owner: 10Jcrespo) [15:50:54] PROBLEM - Keyholder SSH agent on tin is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [15:51:21] 6Operations, 10RESTBase, 6Services, 10Traffic, 3Mobile-Content-Service: Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387#2051279 (10GWicke) > it's clear that [?&#] aren't in the set because MW still cares about those delimiters for query/f... [15:53:02] PROBLEM - RAID on db2012 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [15:55:35] (03CR) 10Volans: "Given that es2001 has a different hardware from the new es201*, I thought that if possible would be better to test machines with same hard" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272474 (owner: 10Volans) [15:55:46] (03PS2) 10Jcrespo: Non critical DBA pages should not send an sms to the DBA group [puppet] - 10https://gerrit.wikimedia.org/r/272478 [15:56:22] RECOVERY - Keyholder SSH agent on tin is OK: OK: Keyholder is armed with all configured keys. [15:56:44] (03PS2) 10Ema: Reduce vcl_error redundancy [puppet] - 10https://gerrit.wikimedia.org/r/271961 [15:56:53] (03CR) 10Jcrespo: [C: 031] Repool of es2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272474 (owner: 10Volans) [15:57:16] 6Operations, 10ops-codfw: db2012 degraded RAID - https://phabricator.wikimedia.org/T124645#2051297 (10Papaul) 5Open>3Resolved a:5Papaul>3jcrespo Disks replacement complete [15:57:40] 6Operations, 10RESTBase, 6Services, 10Traffic, 3Mobile-Content-Service: Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387#2051304 (10BBlack) >>! In T127387#2051279, @GWicke wrote: >> it's clear that [?&#] aren't in the set because MW still... [15:58:31] (03CR) 10Ema: [C: 032 V: 032] Reduce vcl_error redundancy [puppet] - 10https://gerrit.wikimedia.org/r/271961 (owner: 10Ema) [15:58:34] (03CR) 10Volans: [C: 032] Repool of es2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272474 (owner: 10Volans) [15:58:59] (03Merged) 10jenkins-bot: Repool of es2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272474 (owner: 10Volans) [16:01:35] Ah welcome jouncebot, you be late to party [16:01:38] I gots the swat. [16:01:42] !log repooled es2001 [ T127330 ] [16:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:01:47] no SWAT today? [16:01:54] Jouncebot was late :p [16:01:59] !log volans@tin Synchronized wmf-config/db-codfw.php: Repool of es2001 (duration: 01m 39s) [16:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:02:10] bblack, Dereckson: ping for swat [16:02:14] okay, I have some patches logged, so I'll be there [16:02:18] mafk is already here :) [16:02:18] *here [16:02:38] No yurik [16:02:44] Ok, we'll start with yours mafk. [16:02:51] fine [16:03:20] if possible, nowikimedia the last one [16:03:28] requires writting stuff on the wiki, etc [16:03:56] okie dokie [16:04:13] (03CR) 10Chad: [C: 032] Throttle exception for or.wikipedia Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271974 (https://phabricator.wikimedia.org/T127599) (owner: 10MarcoAurelio) [16:04:28] ostriches: hi [16:04:29] (03CR) 10Chad: [C: 032] Enabling translation notifications at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271496 (https://phabricator.wikimedia.org/T126901) (owner: 10MarcoAurelio) [16:04:37] bblack: You'll be next :) [16:04:39] (03Merged) 10jenkins-bot: Throttle exception for or.wikipedia Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271974 (https://phabricator.wikimedia.org/T127599) (owner: 10MarcoAurelio) [16:04:40] Hi. [16:04:45] And then Dereckson :) [16:05:17] (03Merged) 10jenkins-bot: Enabling translation notifications at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271496 (https://phabricator.wikimedia.org/T126901) (owner: 10MarcoAurelio) [16:07:01] !log demon@tin Synchronized wmf-config/throttle.php: throttle exemption for or.wikipedia workshop (duration: 01m 32s) [16:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:07:17] (03PS1) 10EBernhardson: Cache more like queries for 24 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272483 (https://phabricator.wikimedia.org/T124216) [16:07:37] (03PS2) 10Jforrester: VisualEditor: Switch to Single Edit Tab mode on Hungarian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270346 (https://phabricator.wikimedia.org/T126801) [16:07:47] (03CR) 10Jforrester: "Planned for 24 hours' time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270346 (https://phabricator.wikimedia.org/T126801) (owner: 10Jforrester) [16:07:54] (03PS11) 10Ema: Maps VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [16:08:36] (03PS1) 10BBlack: cache_parsoid: remove public DNS [dns] - 10https://gerrit.wikimedia.org/r/272484 (https://phabricator.wikimedia.org/T110474) [16:09:12] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: Translation Notifications on commons (duration: 01m 31s) [16:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:53] mafk: Ok, your first 2 are done. [16:10:00] I'll test commons [16:10:05] since the first one is un-testable [16:10:15] Okie dokie [16:10:27] (03CR) 10Chad: [C: 032] Set explicit referrer policy for WMF sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255408 (https://phabricator.wikimedia.org/T87276) (owner: 10BBlack) [16:11:04] (03Merged) 10jenkins-bot: Set explicit referrer policy for WMF sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255408 (https://phabricator.wikimedia.org/T87276) (owner: 10BBlack) [16:12:09] mafk: you can test no.wikimedia DynamicPageList too [16:12:12] https://www.mediawiki.org/wiki/Extension:DynamicPageList_(Wikimedia)#Use [16:12:36] 6Operations, 10ops-codfw: db2019 has a failed disk - https://phabricator.wikimedia.org/T120073#2051369 (10Papaul) a:3jcrespo Disk replacement complete. [16:12:47] ostriches: stuff at commons seems to work [16:12:52] If you add a template on a sandbox page, and you preview it, you'll see if it's parsed. [16:13:00] Dereckson: it's not merged yet [16:13:00] DPL for nowikimedia I'll do in a second [16:13:12] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: default referrer policy (duration: 01m 31s) [16:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:13:19] bblack: And you're live. ^^^ [16:13:39] (03CR) 10Chad: [C: 032] Enable DynamicPageList for Wikimedia Norge chapter wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271252 (https://phabricator.wikimedia.org/T127161) (owner: 10MarcoAurelio) [16:13:48] ostriches: confirmed on a forced-cache-miss on enwiki article: [16:13:57] k [16:13:58] whoop whoop \o/ [16:14:10] oh? what's that? [16:14:24] (03Merged) 10jenkins-bot: Enable DynamicPageList for Wikimedia Norge chapter wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271252 (https://phabricator.wikimedia.org/T127161) (owner: 10MarcoAurelio) [16:14:36] paravoid: https://phabricator.wikimedia.org/T87276 [16:15:03] basically due to https, external sites we link to like DOI, etc weren't getting referer [16:15:14] now they get an origin-only referer (just the site name, not the article URL) [16:15:34] yeah I remember the discussion [16:15:53] I thought it was a header though, not meta [16:16:14] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: dpl on nowikimedia (duration: 01m 29s) [16:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:25] mafk: And there's your last one ^^ [16:16:31] also, did they fix the spelling or is it a typo on our side? :) [16:16:32] testng [16:16:35] *testing [16:16:42] (03PS3) 10Ladsgroup: Deploy ORES extension to Wikipedia project in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272466 [16:16:43] I guess I'll read the W3C spec [16:16:49] works [16:16:50] 6Operations, 10ops-codfw: db2012 degraded RAID - https://phabricator.wikimedia.org/T124645#2051391 (10jcrespo) 5Resolved>3Open Are you sure? Icinga says disk degraded. [16:17:00] (03CR) 10Chad: [C: 032] Enable Wikilove on az.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258671 (https://phabricator.wikimedia.org/T119727) (owner: 10Dereckson) [16:17:18] heh [16:17:32] there is a header indeed [16:18:08] the meta name is spelled right though [16:18:12] mafk: And I see where the RFC re: your last change is now closed. I'll follow up on the task for it, I don't see why we can't move forward with that tomorrow or the day after now. [16:18:16] yeah, looks like it [16:18:17] (03Merged) 10jenkins-bot: Enable Wikilove on az.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258671 (https://phabricator.wikimedia.org/T119727) (owner: 10Dereckson) [16:18:22] "Note: The header name does not share the HTTP Referer header’s misspelling." [16:18:38] https://www.w3.org/TR/referrer-policy/#referrer-policy-delivery-meta [16:18:56] ostriches: yup, thanks. Appreciated. [16:19:31] If you'd like to take it I'm fine. I won't be avalaible tomorrow for any SWAT window I think [16:19:31] (03PS4) 10Hashar: Deploy ORES extension to Wikipedia project in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272466 (owner: 10Ladsgroup) [16:19:42] (03CR) 10Hashar: [C: 032] Deploy ORES extension to Wikipedia project in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272466 (owner: 10Ladsgroup) [16:19:55] (03PS5) 10Hashar: Deploy ORES extension to Wikipedia project in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272466 (https://phabricator.wikimedia.org/T127661) (owner: 10Ladsgroup) [16:20:13] (03CR) 10Hashar: "Forgot about linking to T127661" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272466 (https://phabricator.wikimedia.org/T127661) (owner: 10Ladsgroup) [16:20:27] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: azwiki wants some wikilove too (duration: 01m 29s) [16:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:20:54] (03CR) 10Hashar: [C: 032] Deploy ORES extension to Wikipedia project in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272466 (https://phabricator.wikimedia.org/T127661) (owner: 10Ladsgroup) [16:21:01] ostriches: Good reason for deploying :D [16:21:02] Dereckson: And you're live now too with azwiki ^^^ [16:21:22] ah men sorry [16:21:28] I have sneak in a beta cluster change :(((( [16:21:33] 6Operations, 10ops-codfw: db2012 degraded RAID - https://phabricator.wikimedia.org/T124645#2051407 (10jcrespo) Ah, it is complaining about the rebuild, let's wait until it is completed to mark it as resolved. [16:21:42] It's all -labs files, I'm not syncing those :p [16:21:47] https://gerrit.wikimedia.org/r/272466 f1e7399d2 [16:21:51] Seems there is a JS issue [16:22:05] 6Operations, 10ops-codfw: db2012 degraded RAID - https://phabricator.wikimedia.org/T124645#2051408 (10jcrespo) However, the host ssh key failed, why could that be? [16:22:15] (03Merged) 10jenkins-bot: Deploy ORES extension to Wikipedia project in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272466 (https://phabricator.wikimedia.org/T127661) (owner: 10Ladsgroup) [16:22:26] ostriches: yeah no need to sync them on prod. Dont be surprised about it landing on the next pull thouhg [16:22:29] sorry bout that [16:22:57] Works with debug=true [16:22:58] I'll sync them for completion so co-masters are up2date [16:23:40] Dereckson: Hmm.... [16:23:53] (03PS2) 10BBlack: codfw: move cache_parsoid nodes to cache_upload [puppet] - 10https://gerrit.wikimedia.org/r/272318 (https://phabricator.wikimedia.org/T110472) [16:24:53] So, the extension is deployed, works if we append ?debug=true to a user page, not if not. No relevant error on the JS console. [16:25:05] Sounds like a caching issue. [16:25:16] static.php maybe [16:25:29] !log demon@tin Synchronized README: no-op for co-master sync (duration: 01m 29s) [16:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:25:41] Ah, now works. [16:26:05] ostriches: tested, works fine. [16:28:30] 6Operations, 10RESTBase, 6Services, 10Traffic, 3Mobile-Content-Service: Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387#2051489 (10GWicke) > Well for that matter, since parsing stops on the first unescape ? for the query part, and the fra... [16:28:39] Dereckson: Ok yay :) [16:28:45] Thanks for the deploy. [16:28:49] All I had to do was wait for the magic to happen :p [16:28:50] yw [16:29:23] PROBLEM - Varnish HTTP parsoid-backend - port 3128 on cp2026 is CRITICAL: Connection refused [16:29:35] Still no yurik, so bumping his from today's swat [16:29:42] PROBLEM - Varnish HTTP parsoid-frontend - port 80 on cp2026 is CRITICAL: Connection refused [16:29:46] lots of codfw errors, is that planned? [16:29:52] PROBLEM - configured eth on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:29:52] PROBLEM - grafana.wikimedia.org on krypton is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:30:02] PROBLEM - traffic-pool service on cp2026 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [16:30:12] PROBLEM - dhclient process on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:30:12] PROBLEM - Varnish HTTP parsoid-backend - port 3128 on cp2022 is CRITICAL: Connection refused [16:30:18] ignore those for cp202[26], sorry! [16:30:22] PROBLEM - traffic-pool service on cp2022 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [16:30:26] ok, I was getting scared [16:30:31] PROBLEM - Disk space on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:30:31] PROBLEM - puppet last run on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:30:33] PROBLEM - grafana-admin.wikimedia.org on krypton is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:30:42] PROBLEM - DPKG on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:30:52] PROBLEM - salt-minion processes on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:30:53] PROBLEM - Check size of conntrack table on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:31:01] PROBLEM - Varnish HTTP parsoid-frontend - port 80 on cp2022 is CRITICAL: Connection refused [16:31:02] PROBLEM - HTTPS on cp2022 is CRITICAL: Return code of 255 is out of bounds [16:31:11] PROBLEM - HTTPS on cp2026 is CRITICAL: Return code of 255 is out of bounds [16:31:12] PROBLEM - RAID on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:31:22] (03PS5) 10Volans: mariadb: Add parallel gzip package [puppet] - 10https://gerrit.wikimedia.org/r/271691 (https://phabricator.wikimedia.org/T127385) [16:31:26] mhh krypton seems just hung for me? "Entering interactive session." [16:32:21] (03CR) 10BBlack: [C: 032] codfw: move cache_parsoid nodes to cache_upload [puppet] - 10https://gerrit.wikimedia.org/r/272318 (https://phabricator.wikimedia.org/T110472) (owner: 10BBlack) [16:34:38] !log reboot krypton.eqiad.wmnet, no answer to gnt-instance console / no ssh [16:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:34:55] 6Operations, 6Services, 10scap, 3Scap3: Deploy AQS with scap3 - https://phabricator.wikimedia.org/T114999#2051520 (10greg) [16:35:05] 6Operations: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#2051523 (10greg) [16:35:31] RECOVERY - configured eth on krypton is OK: OK - interfaces up [16:35:43] RECOVERY - dhclient process on krypton is OK: PROCS OK: 0 processes with command name dhclient [16:36:02] RECOVERY - Disk space on krypton is OK: DISK OK [16:36:02] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 14 minutes ago with 0 failures [16:36:03] RECOVERY - grafana-admin.wikimedia.org on krypton is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 534 bytes in 0.006 second response time [16:36:21] RECOVERY - DPKG on krypton is OK: All packages OK [16:36:23] RECOVERY - salt-minion processes on krypton is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:36:23] RECOVERY - Check size of conntrack table on krypton is OK: OK: nf_conntrack is 0 % full [16:36:41] RECOVERY - Varnish HTTP parsoid-frontend - port 80 on cp2022 is OK: HTTP OK: HTTP/1.1 200 OK - 488 bytes in 0.073 second response time [16:36:42] RECOVERY - RAID on krypton is OK: OK: no RAID installed [16:36:51] RECOVERY - HTTPS on cp2022 is OK: SSLXNN OK - 36 OK [16:37:12] RECOVERY - Varnish HTTP parsoid-frontend - port 80 on cp2026 is OK: HTTP OK: HTTP/1.1 200 OK - 488 bytes in 0.072 second response time [16:37:13] RECOVERY - grafana.wikimedia.org on krypton is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.005 second response time [16:37:32] RECOVERY - traffic-pool service on cp2026 is OK: OK - traffic-pool is active [16:37:42] RECOVERY - Varnish HTTP parsoid-backend - port 3128 on cp2022 is OK: HTTP OK: HTTP/1.1 200 OK - 186 bytes in 0.073 second response time [16:37:52] RECOVERY - traffic-pool service on cp2022 is OK: OK - traffic-pool is active [16:38:41] RECOVERY - HTTPS on cp2026 is OK: SSLXNN OK - 36 OK [16:38:43] RECOVERY - Varnish HTTP parsoid-backend - port 3128 on cp2026 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.073 second response time [16:39:55] 6Operations, 10Ops-Access-Requests, 10Analytics: Add Analytics engineers to `deploy-service` group - https://phabricator.wikimedia.org/T127720#2051551 (10Ottomata) [16:40:34] 6Operations, 10Ops-Access-Requests, 10Analytics: Add Analytics engineers to `deploy-service` group - https://phabricator.wikimedia.org/T127720#2051582 (10Ottomata) [16:40:58] 6Operations, 10Ops-Access-Requests, 10Analytics: Add Analytics engineers to deploy-service group - https://phabricator.wikimedia.org/T127720#2051551 (10Ottomata) [16:43:13] ostriches: done w/ swat? [16:45:56] (03PS6) 10Volans: mariadb: Add parallel gzip package [puppet] - 10https://gerrit.wikimedia.org/r/271691 (https://phabricator.wikimedia.org/T127385) [16:46:58] ebernhardson: I am [16:48:12] ok, i'm going to ship one more patch i added (10min) late ... it's a single config variable [16:48:29] 6Operations, 6Services, 10scap, 3Scap3: Deploy AQS with scap3 - https://phabricator.wikimedia.org/T114999#2051619 (10greg) I reopened this as it is a tracking task for, uh, deploying aqs with scap3 :) The one that it was merged with was not about that. [16:48:30] (03CR) 10Volans: [C: 032] mariadb: Add parallel gzip package [puppet] - 10https://gerrit.wikimedia.org/r/271691 (https://phabricator.wikimedia.org/T127385) (owner: 10Volans) [16:48:52] (03CR) 10EBernhardson: [C: 032] Cache more like queries for 24 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272483 (https://phabricator.wikimedia.org/T124216) (owner: 10EBernhardson) [16:49:43] (03Merged) 10jenkins-bot: Cache more like queries for 24 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272483 (https://phabricator.wikimedia.org/T124216) (owner: 10EBernhardson) [16:50:13] 6Operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2051627 (10ssastry) @Danny_B How many pages... [16:52:00] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-production.php: Cache morelike search queries for 24h (duration: 01m 34s) [16:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:52:27] cool, 1.2s -> .3s. works :) [16:52:36] <_joe_> ebernhardson: wow really? [16:52:38] <_joe_> :) [16:52:43] <_joe_> that's great! [16:52:47] _joe_: well it's caching, thats what caching is supposed to do :) [16:53:03] <_joe_> ebernhardson: indeed [16:53:10] will have to analyze the stats that are being put into graphite to see if it hits the 80% hit rate we are guessing at from analyzing past logs [16:53:24] <_joe_> it will be interesting to see the hit rate, yes [16:56:22] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:02:16] bummer, is ^d gone? [17:03:07] (03CR) 10Krinkle: [C: 031] gdash: decom [puppet] - 10https://gerrit.wikimedia.org/r/272427 (https://phabricator.wikimedia.org/T104365) (owner: 10Ori.livneh) [17:04:45] 6Operations, 10Ops-Access-Requests, 10Analytics: Add Analytics engineers to deploy-service group - https://phabricator.wikimedia.org/T127720#2051551 (10Krenair) Which users count as 'analytics engineers'? everyone in analytics-admins? [17:05:13] ostriches, around? [17:06:21] bblack, reg https://phabricator.wikimedia.org/T125841#2047847 .. so, you are saying you can consistently reproduce this on https://sk.wiktionary.org/wiki/duplikova%C5%A5 ? [17:07:57] yurik: no, meeting. [17:11:14] (03PS1) 10Muehlenhoff: Remove outdated comment [puppet] - 10https://gerrit.wikimedia.org/r/272495 [17:11:17] subbu: yes, consistently, when querying an mw1xxx server directly [17:12:21] bblack, i am lazy right now .. but is there a curl command that you can give me to hit this from an external ip? or would i need to be on the cluster to do this? [17:16:24] (03PS1) 10Eevans: alert only on external (parity w/ dashboards) [puppet] - 10https://gerrit.wikimedia.org/r/272498 [17:16:45] subbu: you'd have to be inside WMF networks somewhere, yeah [17:16:56] although you could do a cache-busting query arg too [17:17:13] e.g. https://sk.wiktionary.org/wiki/duplikova%C5%A5?asdf=laksjdglakjsdglakj [17:17:26] the point of bypassing varnish is just to confirm MW is emitting that, not just stale varnish cache [17:17:52] ah right. that does it. [17:18:02] thanks. [17:18:32] 6Operations, 10Ops-Access-Requests, 10Analytics: Add Analytics engineers to deploy-service group - https://phabricator.wikimedia.org/T127720#2051753 (10Ottomata) Hm, on second look, we do already have an aqs-admins group. Perhaps we can reuse that group for deployment. Looking into how this would work. [17:20:16] (03CR) 10BryanDavis: [C: 031] "I'm not sure how this is functionally different from the $? test but the change at least makes the code read a bit nicer." [puppet] - 10https://gerrit.wikimedia.org/r/264692 (owner: 10Tim Landscheidt) [17:20:33] PROBLEM - puppet last run on mw2168 is CRITICAL: CRITICAL: puppet fail [17:21:32] 6Operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2051770 (10ssastry) After looking at T125841... [17:22:27] 6Operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2051785 (10PeterBowman) I'm seeing this now... [17:22:53] greg-g: Can I quickly push a backport? [17:23:41] Fix for T127462 [17:26:00] on mediawiki.org I've this behavior: "Could not retrieve notifications. Please try again. (Error [62e0d64d] Exception Caught: No agent associated with notification with id '52564' of type 'edit-thank')" [17:26:15] and https://www.mediawiki.org/wiki/Special:Notifications : Exception encountered, of type "DomainException" [17:37:03] 6Operations, 6Collaboration-Team-Backlog, 10Notifications: DomainException on Special:Notifications on mediawiki.org - https://phabricator.wikimedia.org/T127728#2051854 (10Dereckson) [17:37:57] A stacktrace would be useful to fill the issue correctly ^ [17:38:08] one sec [17:39:17] 6Operations, 6Collaboration-Team-Backlog, 10Notifications: DomainException on Special:Notifications on mediawiki.org - https://phabricator.wikimedia.org/T127728#2051878 (10hoo) [17:39:35] Thanks. [17:41:03] hoo: btw, sure, that's fine (was in a meeting) [17:41:15] Great :) [17:48:19] elukey: hahaha, that's great. So I scream bloody murder about the Ganglia plugin not working on Jessie and it turns out it's because I quietly removed it three months ago. [17:48:33] RECOVERY - puppet last run on mw2168 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [17:48:50] the lesson is to never trust your former self! [17:52:43] ori: morning! [17:52:52] (03CR) 10GWicke: [C: 031] alert only on external (parity w/ dashboards) [puppet] - 10https://gerrit.wikimedia.org/r/272498 (owner: 10Eevans) [17:53:28] ori: why was the ganglia support removed?? I didn't fully get the commit message :( [17:55:54] (03PS1) 10BBlack: restore debdeploy hieradata for codfw parsoid [puppet] - 10https://gerrit.wikimedia.org/r/272505 [17:55:55] elukey: because the ganglia plugin is predicated on the assumption of there being a single, canonical instance on port 6379, and the move to redis::instance violated that assumption: there may now be arbitrarily many redis instances on a host. [17:57:30] <_joe_> ori: we should make the plugin aware, and move to diamond... [17:57:59] i did updated the diamond collector [17:58:02] *update [17:58:04] ori: ahhhh right now it makes sense! [17:58:07] it is multi-instance compatible [17:58:53] (03CR) 10BBlack: [C: 032] restore debdeploy hieradata for codfw parsoid [puppet] - 10https://gerrit.wikimedia.org/r/272505 (owner: 10BBlack) [17:59:02] <_joe_> ori: so let's remove thos graphs from ganglia? [17:59:15] ori: can we use the actual plugin until we have one single instance running on 6379? And possibly work on diamond/gmond before adding more processes? [17:59:23] RECOVERY - DPKG on restbase-test2002 is OK: All packages OK [17:59:34] it is not ideal but we'd have metrics in the meantime [18:00:58] elukey: the diamond collector is multi-instance compatible; see manifests/role/jobqueue_redis.pp [18:01:31] ori: yep sorry I lost your comment between the lines :) [18:02:00] bbiab, heading to the office. [18:03:32] !log hoo@tin Synchronized php-1.27.0-wmf.13/extensions/Wikidata: Reset entity access counts between parser runs (duration: 03m 01s) [18:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:04:04] !log corrected dpkg installation status for cassandra on restbase-test200[12] [18:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:04:11] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2052063 (10greg) >>! In T126700#2047271, @Krinkle wrote: >>>! In T126700#2047117, @greg wrote: >> Update: we are tentatively planning to deploy wmf.1... [18:05:09] (03PS2) 10Krinkle: Set $wgResourceBasePath to "/w" for small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271709 (https://phabricator.wikimedia.org/T99096) [18:05:15] (03CR) 10Krinkle: [C: 032] Set $wgResourceBasePath to "/w" for small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271709 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [18:05:22] RECOVERY - DPKG on restbase-test2001 is OK: All packages OK [18:05:40] (03CR) 10Krinkle: [C: 04-1] "Discussing strategy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271709 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [18:06:35] 6Operations: fix all "top-scope variable being used without an explicit namespace" across the puppet repo - https://phabricator.wikimedia.org/T125042#2052074 (10Dzahn) @scfc regarding the remaining issues, i think we should just put "lint-ignore" lines around the tricky ones, since it's such a small percentage r... [18:13:12] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: puppet fail [18:14:44] RECOVERY - puppet last run on restbase-test2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:17:46] (03PS2) 10Dzahn: toollabs/mailrelay: fix top-scope vars without namespace [puppet] - 10https://gerrit.wikimedia.org/r/271740 [18:18:27] ok milimetric, continuing...i'm going to try to make the aqs-admins group able to deploy... [18:18:37] (03CR) 10Dzahn: [C: 032] "thanks for testing, Tim" [puppet] - 10https://gerrit.wikimedia.org/r/271740 (owner: 10Dzahn) [18:19:25] ottomata: ok, but Petr might have found the problem in the meantime [18:19:31] oh? [18:19:34] wanna hang out? [18:19:36] the restbase problem? [18:19:37] sure [18:20:59] (03PS7) 10Dzahn: admin: add ppchelko to restbase-roots [puppet] - 10https://gerrit.wikimedia.org/r/269369 (https://phabricator.wikimedia.org/T126283) [18:21:23] (03CR) 10Dzahn: [C: 032] "approved in meeting https://office.wikimedia.org/wiki/Operations/Operations_Meeting_Notes/TechOps-2016-02-22#Access_Requests" [puppet] - 10https://gerrit.wikimedia.org/r/269369 (https://phabricator.wikimedia.org/T126283) (owner: 10Dzahn) [18:25:10] (03PS1) 10Milimetric: Fix bad module reference in AQS [puppet] - 10https://gerrit.wikimedia.org/r/272511 [18:26:15] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2052198 (10Krinkle) Just looking into load.php request and noticed a regression there as well. 1/4th of the backend time (or 170ms) is being spent i... [18:26:25] (03CR) 10Ottomata: [C: 032] Fix bad module reference in AQS [puppet] - 10https://gerrit.wikimedia.org/r/272511 (owner: 10Milimetric) [18:28:08] (03PS1) 10Dzahn: admin: add bast-only group for ppchelko [puppet] - 10https://gerrit.wikimedia.org/r/272513 (https://phabricator.wikimedia.org/T126283) [18:29:07] (03CR) 10Krinkle: [C: 032] Set $wgResourceBasePath to "/w" for small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271709 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [18:29:57] (03Merged) 10jenkins-bot: Set $wgResourceBasePath to "/w" for small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271709 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [18:32:16] !log krinkle@tin Synchronized wmf-config/InitialiseSettings.php: T99096: Enable wmfstatic for small wikis (duration: 01m 43s) [18:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:32:24] (03PS2) 10Dzahn: admin: add bast-only group for ppchelko [puppet] - 10https://gerrit.wikimedia.org/r/272513 (https://phabricator.wikimedia.org/T126283) [18:33:05] _joe_: am trying to find docs, but, does hiera work with defines? [18:33:21] can I set a hiera variable ina role scope, that will fill in a define's parameters? [18:33:34] RECOVERY - puppet last run on restbase-test2001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [18:33:49] <_joe_> ottomata: nope [18:33:57] hm [18:34:01] ok [18:34:16] <_joe_> ottomata: you cannot have autolookups, I mean [18:34:21] <_joe_> but you can still do [18:34:24] (03PS3) 10Dzahn: admin: add bast-only group for ppchelko [puppet] - 10https://gerrit.wikimedia.org/r/272513 (https://phabricator.wikimedia.org/T126283) [18:34:39] <_joe_> class foo { $bar: file { $bar:...}{ [18:34:50] (03CR) 10Dzahn: [C: 032] admin: add bast-only group for ppchelko [puppet] - 10https://gerrit.wikimedia.org/r/272513 (https://phabricator.wikimedia.org/T126283) (owner: 10Dzahn) [18:34:51] <_joe_> class foo($bar) { file { $bar:...}{ [18:34:53] <_joe_> sorry [18:35:12] <_joe_> and then autolookup $::foo::bar with hiera [18:36:49] right, parameters for the lass [18:36:50] class [18:36:57] i'll do that then, the role class i'm looking at already has params [18:37:02] thanks [18:38:19] 6Operations, 10Ops-Access-Requests, 6Services, 13Patch-For-Review: Requesting restbase-roots access to RESTBase cluster for Petr Pchelko - https://phabricator.wikimedia.org/T126283#2052267 (10Dzahn) approved in meeting (https://office.wikimedia.org/wiki/Operations/Operations_Meeting_Notes/TechOps-2016-02-2... [18:39:13] hmm, ok _joe_ uhhh, the params i'm overriding are in role::deployment::services [18:39:18] which is included in role::deployment::server [18:39:23] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [18:39:33] should I then put the hiera in role/common/deployment/server.yaml [18:39:38] but set the variable as [18:39:46] role::deployment::services::keyholder_group [18:39:47] ? [18:40:08] 6Operations, 10Ops-Access-Requests, 6Services, 13Patch-For-Review: Requesting restbase-roots access to RESTBase cluster for Petr Pchelko - https://phabricator.wikimedia.org/T126283#2052274 (10Dzahn) 5Open>3Resolved [18:40:13] or, should I change the include in role::deployment::server to use the role keyword inside of that class [18:40:29] and put the hiera in role/common/deployment/services as just keyholder_group: ... [18:40:30] ? [18:45:36] # == Class role::wikimetrics # This is the production wikimetrics role <-- does anyone know more about this? this is in labs even though it's called prod, right? [18:45:46] and # This is the staging specific wikimetrics role is also labs? [18:46:47] (03PS1) 10Ottomata: Use aqs-admins group for AQS deployment via deploy-service user [puppet] - 10https://gerrit.wikimedia.org/r/272516 [18:47:01] "# Production wikimetrics instance (in labs) needs a mysql client" hrmmmm [18:47:06] mutante: maybe, i wrote it years ago, we were originally going to productionize it, but then it was decided that it could and should stay in labs [18:47:48] ottomata: ah, ok! let me try to find if any instances use it, with that "watroles" tool [18:48:04] as usual i just want to move stuff around and not break it [18:48:18] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM, but please expand dashboard links, easier to understand what's going on (e.g. the shortened version points to grafana-admin which is" [puppet] - 10https://gerrit.wikimedia.org/r/272498 (owner: 10Eevans) [18:48:27] moving stuff out of /manifests/role/ that is [18:48:30] _joe_: , like this: https://gerrit.wikimedia.org/r/#/c/272516/1/hieradata/role/common/deployment/server.yaml ? [18:48:36] aye [18:48:49] mutante: did you notice that there is no more manifests/role/analytics* ?!?! are you proud of me?? :D [18:50:34] ottomata: that's awesoem :) [18:50:38] yes [18:51:09] (03CR) 10Ottomata: "Will hiera role work this way?! Let's find out..." [puppet] - 10https://gerrit.wikimedia.org/r/272516 (owner: 10Ottomata) [18:51:10] re: wikimetrics.. it says an instance called "wikimetrics-staging1" has the production role, an instance called ""wikimetrics-staging" (without 1) has the staging role [18:51:16] (03CR) 10Ottomata: [C: 032] "Will hiera role work this way?! Let's find out..." [puppet] - 10https://gerrit.wikimedia.org/r/272516 (owner: 10Ottomata) [18:51:54] ottomata: I like to think you merged to the tune of 'will it blend' commercials [18:52:01] (03PS1) 10BBlack: Bugfix: wrong value format for wgReferrerPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272517 (https://phabricator.wikimedia.org/T87276) [18:52:02] heheh [18:54:59] (03CR) 10Krinkle: "It seems Chrome requires the value without a dash in crossorigin? origin-when-crossorigin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272517 (https://phabricator.wikimedia.org/T87276) (owner: 10BBlack) [18:55:35] (03PS1) 10Ottomata: Add otto to aqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/272518 [18:56:35] (03CR) 10Dzahn: "turns out the "production" role is on _staging1_, while the "staging" role is on _staging_. :p" [puppet] - 10https://gerrit.wikimedia.org/r/271737 (owner: 10Dzahn) [18:57:56] (03CR) 10Ottomata: [C: 032] Add otto to aqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/272518 (owner: 10Ottomata) [19:01:50] 6Operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2052375 (10Legoktm) a:3Legoktm Grumble gru... [19:22:48] (03PS2) 10Dzahn: deactivate voyagewiki.[com|org] [dns] - 10https://gerrit.wikimedia.org/r/254058 [19:23:11] bblack: Did you want to go ahead and land that fix for origins? [19:24:21] (03CR) 10Dzahn: [C: 032] "traffic absolutely minimal, not announced anywhere, the only mention is it was once a proposed name that lost against wikivoyage (https://" [dns] - 10https://gerrit.wikimedia.org/r/254058 (owner: 10Dzahn) [19:24:28] ostriches: not sure yet, some discussions ongoing. apparently on top of my original "totally wrong value format" bug, there's also some standards-questions about the format of the corrected value, too :/ [19:25:39] 6Operations, 10domains: traffic stats for typo domains - https://phabricator.wikimedia.org/T124237#2052479 (10Dzahn) voyagewiki.com/org also removing. traffic absolutely minimal, the only mention it got was during the naming process where it lost against "wikivoyage" https://meta.wikimedia.org/wiki/Wikivoyag... [19:28:04] bblack: Okie dokie [19:28:37] (03CR) 10Jforrester: [C: 031] Bugfix: wrong value format for wgReferrerPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272517 (https://phabricator.wikimedia.org/T87276) (owner: 10BBlack) [19:29:18] (03PS2) 10Eevans: alert only on external (parity w/ dashboards) [puppet] - 10https://gerrit.wikimedia.org/r/272498 [19:36:52] (03CR) 10Krinkle: [C: 04-1] "Per previous comment." [puppet] - 10https://gerrit.wikimedia.org/r/270725 (https://phabricator.wikimedia.org/T125381) (owner: 10Phedenskog) [19:43:27] let's break this: "Opera Mini on Android using Special:Search" [19:44:16] rephrases that. "let's remove some cruft from DNS that is only used by 0.000272%" [19:44:44] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 57.69% of data above the critical threshold [5000000.0] [19:46:12] (03PS1) 10Ladsgroup: Move ORES settings to beta features part [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272526 [19:49:57] mutante: Hmm context? [19:50:27] (03CR) 10Ladsgroup: "I'm not sure if it can fix the bug." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272526 (owner: 10Ladsgroup) [19:51:06] ostriches: context is https://phabricator.wikimedia.org/T120143 [19:51:27] ostriches: basically if i can merge https://gerrit.wikimedia.org/r/#/c/256597/ [19:51:38] or if i need more +1 from mobile [19:51:41] than i already have [19:54:04] (03CR) 10BBlack: [C: 031] delete www.m.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/256597 (https://phabricator.wikimedia.org/T120143) (owner: 10Dzahn) [19:54:16] FWIW! [19:57:53] mutante: kill with extreme prejudice! :p [19:58:22] :) thanks both of you, i will [19:58:25] (03PS1) 10Ottomata: Pass $service_name to service::deploy::scap in service::node [puppet] - 10https://gerrit.wikimedia.org/r/272527 [19:59:32] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [20:00:50] (03CR) 10Thcipriani: [C: 031] Pass $service_name to service::deploy::scap in service::node [puppet] - 10https://gerrit.wikimedia.org/r/272527 (owner: 10Ottomata) [20:04:02] (03CR) 10Ottomata: [C: 032] Pass $service_name to service::deploy::scap in service::node [puppet] - 10https://gerrit.wikimedia.org/r/272527 (owner: 10Ottomata) [20:12:30] (03CR) 10DarTar: "Krinkle reported that Chrome and Firefox support both spellings, but given that the one with the hyphen is the one complying with the reco" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272517 (https://phabricator.wikimedia.org/T87276) (owner: 10BBlack) [20:19:06] (03CR) 10BBlack: "Yeah we talked this over a bit, and it seems like the dashed one is the better option overall." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272517 (https://phabricator.wikimedia.org/T87276) (owner: 10BBlack) [20:28:34] (03CR) 10Tim Landscheidt: "@bd808: "$?" will never be anything but 0 at that point. The script has "set -e" which means that if the "git rebase" call fails, the scr" [puppet] - 10https://gerrit.wikimedia.org/r/264692 (owner: 10Tim Landscheidt) [20:29:53] (03PS1) 10Eevans: make statsd metrics prefix configurable [puppet] - 10https://gerrit.wikimedia.org/r/272536 (https://phabricator.wikimedia.org/T127747) [20:31:43] (03CR) 10BryanDavis: "> The script has "set -e"" [puppet] - 10https://gerrit.wikimedia.org/r/264692 (owner: 10Tim Landscheidt) [20:35:34] (03PS5) 10Dzahn: delete www.m.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/256597 (https://phabricator.wikimedia.org/T120143) [20:36:47] greg-g, would you mind if i start services a bit early today? There is a number of moving parts that I would like to be make sure work ok [20:37:06] (03CR) 10Dzahn: [C: 032] "also see comments on T120143#1846328 ff" [dns] - 10https://gerrit.wikimedia.org/r/256597 (https://phabricator.wikimedia.org/T120143) (owner: 10Dzahn) [20:38:07] yurik: sounds good [20:38:07] 6Operations, 7Mobile, 13Patch-For-Review: Investigate if www.m.wikipedia.org needs to stay around - https://phabricator.wikimedia.org/T120143#2052756 (10Dzahn) investigation conclued, no it does not need to stay around [20:38:35] 6Operations, 7Mobile, 13Patch-For-Review: Investigate if www.m.wikipedia.org needs to stay around - https://phabricator.wikimedia.org/T120143#2052758 (10Dzahn) 5Open>3Resolved [20:38:38] greg-g, ok, starting [20:39:02] 6Operations, 7Mobile: Investigate if www.m.wikipedia.org needs to stay around - https://phabricator.wikimedia.org/T120143#1846295 (10Dzahn) [20:40:38] 6Operations, 7Mobile: Investigate if login.m.wikipedia.org needs to stay around - https://phabricator.wikimedia.org/T123431#2052761 (10Dzahn) apparently already deleted Host login.m.wikipedia.org not found: 3(NXDOMAIN) [20:41:18] 6Operations, 10Traffic, 7Mobile: Investigate if www.m.wikipedia.org needs to stay around - https://phabricator.wikimedia.org/T120143#2052763 (10Dzahn) [20:41:32] 6Operations, 10Traffic, 7Mobile: Investigate if login.m.wikipedia.org needs to stay around - https://phabricator.wikimedia.org/T123431#2052764 (10Dzahn) [20:42:20] bblack: do you remember deleting that ^ ? [20:42:49] looks like it was there a couple weeks ago but not now [20:42:52] mutante: no, there's some confusion or something [20:42:56] bblack-mba:~ bblack$ host login.m.wikimedia.org [20:42:57] login.m.wikimedia.org has address 208.80.153.236 [20:43:11] oh P [20:43:19] yea, wikipedia.org or confusion [20:43:29] I don't think login.m.wikipedia.org ever existed [20:43:39] login.m.wikimedia.org is the one I've been wondering about for various reasons [20:43:52] (for https://phabricator.wikimedia.org/T111967 ) [20:44:03] and that also makes more sense since the referenced ticket is all about wikimedia.org [20:44:14] just the www.m. was actually wikiPedia [20:44:28] ok [20:45:18] 6Operations, 10Traffic, 7Mobile: Investigate if login.m.wikimedia.org needs to stay around - https://phabricator.wikimedia.org/T123431#2052772 (10Dzahn) [20:45:53] 6Operations, 10Traffic, 7Mobile: Investigate if login.m.wikimedia.org needs to stay around - https://phabricator.wikimedia.org/T123431#1929602 (10Dzahn) renaming ticket, seems this should have been about wikimedia.org all this time and never wikipedia.org, unlike the linked "www.m" ticket which actually wiki... [20:46:36] bblack: but either way i had this: [20:46:38] @oxygen:/srv/log/webrequest# jq .uri_host /srv/log/webrequest/sampled-1000.json | grep "login.m" | wc -l [20:46:41] 0 with "login.m" [20:46:58] maybe it also needs data from hive [20:47:15] 6Operations, 10Ops-Access-Requests, 10Analytics, 13Patch-For-Review: Allow aqs-admins to deploy via scap using deploy-service ssh eky - https://phabricator.wikimedia.org/T127720#2052793 (10Ottomata) [20:47:20] 6Operations, 10Ops-Access-Requests, 10Analytics, 13Patch-For-Review: Allow aqs-admins to deploy via scap using deploy-service ssh key - https://phabricator.wikimedia.org/T127720#2051551 (10Ottomata) [20:47:30] 6Operations, 10Ops-Access-Requests, 10Analytics, 13Patch-For-Review: Allow aqs-admins to deploy via scap using deploy-service ssh key - https://phabricator.wikimedia.org/T127720#2051551 (10Ottomata) 5Open>3Resolved [20:48:18] mutante: I've grepped around for it and checked other logs in the past too. I'm fairly-well convinced nobody's actually using or has used it in the past. [20:48:49] not 100% sure, but fairly sure, that it was created by someone because it seemed to make sense, but then all mobile-related uses of central login ended up using login.wikimedia.org anyways [20:49:40] bblack: i think so too, we just need the last 5% of certainty, maybe hive query is the most accurate we can do? [20:50:23] sure [20:50:38] ok [20:50:50] !log deployed and restarted kartotherian - https://gerrit.wikimedia.org/r/#/c/272425/ [20:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:51:09] will update that ticket later [20:51:15] thanks! [20:54:23] (03CR) 10GWicke: [C: 031] cache_parsoid: remove public DNS [dns] - 10https://gerrit.wikimedia.org/r/272484 (https://phabricator.wikimedia.org/T110474) (owner: 10BBlack) [20:55:24] (03CR) 10GWicke: [C: 031] parsoidcache: remove from LVS [puppet] - 10https://gerrit.wikimedia.org/r/272322 (https://phabricator.wikimedia.org/T110472) (owner: 10BBlack) [20:56:58] (03CR) 10Subramanya Sastry: [C: 031] cache_parsoid: remove public DNS [dns] - 10https://gerrit.wikimedia.org/r/272484 (https://phabricator.wikimedia.org/T110474) (owner: 10BBlack) [20:58:52] (03CR) 10Subramanya Sastry: [C: 031] parsoidcache: remove from LVS [puppet] - 10https://gerrit.wikimedia.org/r/272322 (https://phabricator.wikimedia.org/T110472) (owner: 10BBlack) [20:59:03] !log yurik@tin Synchronized php-1.27.0-wmf.14/extensions/Graph/lib/graph2.compiled.js: https://gerrit.wikimedia.org/r/#/c/272472/ (duration: 01m 40s) [20:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160222T2100). [21:00:18] no parsoid deploy today. [21:00:43] no mobileapps deploy today [21:03:45] (03PS6) 10Dzahn: add AAAA record for argon (irc,rc streams) [dns] - 10https://gerrit.wikimedia.org/r/214506 (https://phabricator.wikimedia.org/T105422) [21:04:32] I'm deploying [21:05:45] !log yurik@tin Synchronized php-1.27.0-wmf.13/extensions/Graph/: Graph ext https://gerrit.wikimedia.org/r/#/c/272473/ (duration: 01m 43s) [21:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:10:11] jouncebot: next [21:10:11] In 2 hour(s) and 49 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160223T0000) [21:10:12] (03PS1) 10Ori.livneh: xhgui: increase memory limit to 512M [puppet] - 10https://gerrit.wikimedia.org/r/272601 [21:10:25] (03PS2) 10Ori.livneh: xhgui: increase memory limit to 512M [puppet] - 10https://gerrit.wikimedia.org/r/272601 [21:10:31] (03CR) 10Ori.livneh: [C: 032 V: 032] xhgui: increase memory limit to 512M [puppet] - 10https://gerrit.wikimedia.org/r/272601 (owner: 10Ori.livneh) [21:11:51] (03CR) 10Chad: [C: 032] Remove live-1.5 symlink to w/ directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271072 (owner: 10Chad) [21:11:59] MaxSem: Imma see what happens ^ :P [21:12:23] wee donneed no water... [21:12:31] (03Merged) 10jenkins-bot: Remove live-1.5 symlink to w/ directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271072 (owner: 10Chad) [21:13:35] Hehe, how does one sync a symlink removal from /srv/mediawiki-staging without a full scap? You `sync-dir .` :P :P [21:15:19] !log demon@tin Started scap: removing live-1.5 symlink [21:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:15:45] MaxSem: Actually that's slower cuz it's gonna find all the PHP files in all those directories first :p [21:18:01] ostriches, when I broke it I actually used dsh to avoid scap [21:18:14] it kinda allowed it to happen quickly [21:18:17] To just sync-common everywhere? [21:18:23] ...AND BE READY TO BE REVERTED [21:18:43] nope, just rm :P [21:18:48] hehehe [21:18:55] I'll undo it with dsh if I break it [21:19:06] always leave a back door [21:19:21] (03PS1) 10Ori.livneh: xhgui: enable PHP opcache [puppet] - 10https://gerrit.wikimedia.org/r/272603 [21:19:49] (03PS2) 10Ori.livneh: xhgui: enable PHP opcache [puppet] - 10https://gerrit.wikimedia.org/r/272603 [21:20:32] Shit. [21:20:33] Fuck [21:20:40] that didn't go well [21:20:47] https://www.mediawiki.org/wiki/Manual_talk:Custom_edit_buttons 404 Not Found [21:20:56] Yeah reverting [21:20:57] Right now [21:21:06] Fixed. [21:21:12] :) [21:21:13] Well fuckity fuck fuck [21:21:16] thanks ostriches [21:21:18] What the HELL is still using that shit? [21:21:43] did someone break https://meta.wikimedia.org/wiki/Special:RecentChanges ? [21:21:44] (03CR) 10Krinkle: "Many docroots refer to this. This has been removed twice and broke things twice. This makes three." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271072 (owner: 10Chad) [21:21:46] (03PS1) 10Chad: Revert "Remove live-1.5 symlink to w/ directory" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272604 [21:21:46] or is it just me? [21:21:51] aude: reload [21:22:01] ok [21:22:04] (03CR) 10Chad: [C: 032 V: 032] "Already reverted on cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272604 (owner: 10Chad) [21:22:09] says 404 file not found [21:22:15] It should be fixed already. [21:22:16] hard-reload [21:22:17] https://meta.wikimedia.org/wiki/Special:RecentChanges?jkl works [21:22:19] It was only broken for seconds [21:22:23] even wikitech is gone https://wikitech.wikimedia.org/wiki/ [21:22:26] ok, now it's good :) [21:22:35] Right :) [21:22:36] ok, yeah [21:22:38] we need to purge caches [21:22:41] bblack: hey [21:22:50] fuck [21:22:52] right [21:22:54] stop messing with the symlinks !!!!!!!!!! [21:23:01] why wikitech [21:23:02] how [21:23:05] cached? [21:23:07] yep [21:23:19] 5minutes [21:23:28] text && "/wiki" => 5min cache [21:23:45] it's not just cache [21:23:46] wikitech-static .. [21:23:59] wikitech is still broken [21:24:01] && 404 [21:24:01] it's not just caching [21:24:02] hopefully the varnishes dont cache the 403 [21:24:38] https://wikitech-static.wikimedia.org/wiki/Varnish#One-off_purges_.28bans.29 [21:24:40] in wikitech's case, it's not caching [21:24:41] ^ docs on purging [21:25:09] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech [21:25:13] I'm getting You don't have permission to access /wiki/ on this server. trying to access wikitech [21:25:25] good to know icinga-wm has a check for wikitech [21:25:28] known problem SMalyshev [21:25:33] ostriches, lrwxrwxrwx 1 mwdeploy wikidev 23 Feb 2 14:25 w -> /srv/mediawiki/live-1.5 [21:25:34] ah, I see you know it already [21:25:47] a ban on `obj.status == 403` is probably warranted [21:25:49] could wikitech content still have been under a /live-1.5 directory which would have been pruned? [21:25:54] but first we have to fix the underlying issue [21:25:55] good to know that we have -static [21:26:10] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit nfs-exports is failed [21:26:16] for wikitech it'll need a 403 ban [21:26:18] I'm on silver looking at it [21:26:20] for prod we need a 404 ban I suppose [21:26:36] It didn't get caught in my dsh fix. [21:26:38] Fixing manually [21:26:49] (03PS1) 10Ottomata: Move Hive and Oozie to analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/272605 (https://phabricator.wikimedia.org/T110090) [21:26:53] fixed now I think [21:26:56] !log ran sync-common on silver [21:26:59] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (72740 200000s) [21:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:27:08] Wikitech fixed. [21:27:13] yep [21:27:19] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [21:27:21] Yeah my dsh fix skipped it, weird. [21:27:45] (03CR) 10Ori.livneh: [C: 032] xhgui: enable PHP opcache [puppet] - 10https://gerrit.wikimedia.org/r/272603 (owner: 10Ori.livneh) [21:27:46] right.. /me waves the smoke away from the servers ... [21:27:54] (03CR) 10jenkins-bot: [V: 04-1] Move Hive and Oozie to analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/272605 (https://phabricator.wikimedia.org/T110090) (owner: 10Ottomata) [21:28:23] !log demon@tin Finished scap: removing live-1.5 symlink (duration: 13m 03s) [21:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:28:44] ^ That was the original scap [21:29:06] !log undid previous scap with dsh, live-1.5 is still used. [21:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:29:10] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:29:16] https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?panelId=9&fullscreen [21:30:20] AH00037: Symbolic link not allowed or link target not accessible: /srv/mediawiki/docroot/wikisource.org/w is what we got [21:31:09] 404s died down, so no cache bans necessary https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?panelId=9&fullscreen [21:33:00] https://phabricator.wikimedia.org/P2651 [21:33:07] That's our symlink hell in mw-config [21:33:25] ./noc/conf should die [21:34:13] (03PS2) 10Ottomata: Move Hive and Oozie to analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/272605 (https://phabricator.wikimedia.org/T110090) [21:34:21] That's the least of my worries here. [21:34:26] It's all the indirect symlinking [21:34:38] !log deployed graphoid https://gerrit.wikimedia.org/r/#/c/272602/ [21:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:34:46] ori: https://phabricator.wikimedia.org/P2651#10956 specifically [21:35:30] ostriches: yeah I know, I am intimately familiar with it. I am pretty sure most of the outages I caused, big and small, were around the whole /a/common /apache clusterfuck [21:35:47] (03PS1) 10Ottomata: Make MySQL instance on analytics1015 the master [puppet] - 10https://gerrit.wikimedia.org/r/272606 (https://phabricator.wikimedia.org/T110090) [21:35:54] (03CR) 10jenkins-bot: [V: 04-1] Move Hive and Oozie to analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/272605 (https://phabricator.wikimedia.org/T110090) (owner: 10Ottomata) [21:35:57] So, those symlinks (to symlinks) are what broke me [21:36:06] maybe we can get all the "noc" stuff out of there [21:36:11] to simplify [21:36:15] +1 mutante [21:36:27] (03PS3) 10Ottomata: Move Hive and Oozie to analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/272605 (https://phabricator.wikimedia.org/T110090) [21:37:10] 6Operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2053064 (10Legoktm) The links changed becaus... [21:37:40] (03CR) 10jenkins-bot: [V: 04-1] Move Hive and Oozie to analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/272605 (https://phabricator.wikimedia.org/T110090) (owner: 10Ottomata) [21:38:23] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [21:38:38] Oh shush you [21:38:56] jynus: you there? [21:40:14] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [21:40:22] !log demon@tin Synchronized README: no-op for co-master sync (duration: 01m 31s) [21:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:41:36] 6Operations, 10ops-eqiad: testing: r430 server / h800 controller / md1200 shelf - https://phabricator.wikimedia.org/T127490#2053074 (10RobH) [21:43:16] 6Operations, 10ops-eqiad: testing: r430 server / h800 controller / md1200 shelf - https://phabricator.wikimedia.org/T127490#2045136 (10RobH) Chris: Please note that I changed the process; depending if the test succeeds or fails. Successful Test: * Leave system online so other opsen can run additional tests.... [21:43:50] 6Operations, 10ops-eqiad: testing: r430 server / h800 controller / md1200 shelf - https://phabricator.wikimedia.org/T127490#2053091 (10RobH) [21:44:25] ori: #til those symlinks aren't even relative. So if we /tried/ to move MW out of /srv/mediawiki, we'd break shit. [21:44:39] eg: ./docroot/wiktionary.org/w -> /srv/mediawiki/live-1.5 [21:44:57] (which should probably read -> ../w) [21:45:03] 6Operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2053096 (10Legoktm) Oh, and that change made... [21:47:12] (03PS1) 10MaxSem: Third time's the charm: kill live-1.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272607 [21:47:25] ostriches, ^^^ :D [21:47:37] ostriches: I think the most sensible way to try out such changes is to make sure nobody is deploying; cherry-pick the change on the deployment host; run sync-common on mw1017; then browse around with X-Wikimedia-Header:1 header. [21:47:46] Yeah [21:47:49] I'm so not touching that again today tho [21:48:06] Carthago delenda est [21:48:13] haha [21:48:40] rofl [21:48:51] also, wtf is docroot/skel-1.5? [21:50:56] seriously [21:52:06] uh [21:52:11] gah I used to know [21:52:18] sure don't any more though [21:52:31] I'm up for helping MaxSem with that change if you are not opposed, let's fucking kill that thing [21:52:38] ostriches: ^ [21:52:44] Oh if we wanna do it lez do it. [21:53:37] (03CR) 10Ori.livneh: [C: 032] Third time's the charm: kill live-1.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272607 (owner: 10MaxSem) [21:53:56] nobody sync please [21:54:03] (03Merged) 10jenkins-bot: Third time's the charm: kill live-1.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272607 (owner: 10MaxSem) [21:55:25] MaxSem: staged on mw1017 [21:55:45] Hmm, why is X-Wikimedia-Header not sending me to mw1017 [21:56:23] can't you turn off syncs with uh that flag? [21:56:52] ostriches: X-Wikimedia-Debug [21:57:00] ostriches: it works for all wikis that are handled by the apache pool, so it excludes wikitech [21:57:02] https://phabricator.wikimedia.org/D36 [21:57:03] Oh I copy+pasted ori :) [21:57:55] * ori runs sync-common on silver [21:58:00] (03PS3) 10Andrew Bogott: Updates to designate/mdns/pdns setup for Labs internal dns [puppet] - 10https://gerrit.wikimedia.org/r/271797 (https://phabricator.wikimedia.org/T124680) [21:58:58] MaxSem: https://github.com/search?q=%40wikimedia+live-1.5&type=Code&utf8=%E2%9C%93 [21:59:02] MaxSem, i'm done with tons of depls, taking a break [21:59:16] all reports now go to max [22:00:15] (03CR) 10Tim Landscheidt: "@bd808: They work very well :-)." [puppet] - 10https://gerrit.wikimedia.org/r/264692 (owner: 10Tim Landscheidt) [22:00:23] ori: operations-mediawiki-multiversion is deprecated. [22:00:31] (it was the repo before we merged into wmf-config) [22:00:31] nod [22:02:57] !log ori@tin Synchronized w: I1d4f90533: Third time's the charm: kill live-1.5 (duration: 01m 43s) [22:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:03:52] !log ori@tin Started scap: I1d4f90533: Third time's the charm: kill live-1.5 [22:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:05:09] (03PS2) 10Dzahn: wikimetrics: rename prod role, move to module/role [puppet] - 10https://gerrit.wikimedia.org/r/271737 [22:06:12] (03CR) 10Dzahn: [C: 032] wikimetrics: rename prod role, move to module/role [puppet] - 10https://gerrit.wikimedia.org/r/271737 (owner: 10Dzahn) [22:06:15] RECOVERY - RAID on db2012 is OK: OK: optimal, 1 logical, 2 physical [22:06:50] (03PS1) 10Aaron Schulz: Enable async secondary swift writes for non-"big" wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272611 [22:07:58] who broke wikipedia ? [22:08:13] Request from 10.20.0.105 via cp1055 cp1055 ([10.64.32.107]:3128), Varnish XID 4290192285 [22:08:13] Forwarded for: 81.24.121.242, 10.20.0.107, 10.20.0.107, 10.20.0.105 [22:08:14] Error: 503, Service Unavailable at Mon, 22 Feb 2016 22:07:37 GMT [22:08:16] bake now [22:08:20] back now [22:10:10] (03PS1) 10Eevans: disable package-installed initscript [puppet] - 10https://gerrit.wikimedia.org/r/272612 (https://phabricator.wikimedia.org/T127365) [22:12:00] yurik: 5xxs levels are normal; you just got unlucky. [22:12:07] or you're hitting a buggy code-path. [22:12:23] ori, i got it twice in a row i think, but sure, might have been a hickup of sorts [22:12:27] not to worry :) [22:12:33] i'm lucky that way :) [22:13:41] (03PS1) 10BryanDavis: Add LANG to /etc/defaults/puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/272613 [22:14:57] !log ori@tin Finished scap: I1d4f90533: Third time's the charm: kill live-1.5 (duration: 11m 05s) [22:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:15:02] (03CR) 10Dzahn: "when looking at "watroles, these instances use this role:" [puppet] - 10https://gerrit.wikimedia.org/r/271737 (owner: 10Dzahn) [22:15:13] (03PS4) 10Andrew Bogott: Updates to designate/mdns/pdns setup for Labs internal dns [puppet] - 10https://gerrit.wikimedia.org/r/271797 (https://phabricator.wikimedia.org/T124680) [22:15:15] (03PS1) 10Andrew Bogott: designate: Open firewall to axfr traffic from pdns hosts. [puppet] - 10https://gerrit.wikimedia.org/r/272615 (https://phabricator.wikimedia.org/T124680) [22:16:33] 6Operations, 10ops-eqiad: testing: r430 server / h800 controller / md1200 shelf - https://phabricator.wikimedia.org/T127490#2053241 (10Cmjohnson) Okay, I will update ticket once completed [22:17:15] (03CR) 10jenkins-bot: [V: 04-1] designate: Open firewall to axfr traffic from pdns hosts. [puppet] - 10https://gerrit.wikimedia.org/r/272615 (https://phabricator.wikimedia.org/T124680) (owner: 10Andrew Bogott) [22:17:31] (03PS1) 10Ori.livneh: xhgui: profile 1:10,000 requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272616 [22:17:51] (03CR) 10Ottomata: "This fixed the problem in beta labs, but I'm not sure it should be applied to production. Thoughts?" [puppet] - 10https://gerrit.wikimedia.org/r/272613 (owner: 10BryanDavis) [22:20:43] PROBLEM - Host es2010 is DOWN: PING CRITICAL - Packet loss = 100% [22:22:51] MaxSem, ori: This time with no breakage :) [22:22:56] Yay for killing 10yo tech debt! [22:23:06] 10+, even [22:23:26] gone?? [22:23:31] congrats! [22:24:40] (03CR) 10BryanDavis: "If the problem has never been seen in production then we can try making this change as a local commit on deployment-puppetmaster. I think " [puppet] - 10https://gerrit.wikimedia.org/r/272613 (owner: 10BryanDavis) [22:25:51] 7Blocked-on-Operations, 6Operations, 6Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2053318 (10MBinder_WMF) @chasemp just tagging you because your comment was last. Know of any progress towards the plan to unblock this? [22:28:39] (03CR) 10Dzahn: "andrew fixed the role config in the LDAP backend directly, so the instances now use the new role name" [puppet] - 10https://gerrit.wikimedia.org/r/271737 (owner: 10Dzahn) [22:29:01] \o/ [22:29:27] (03PS5) 10Andrew Bogott: Updates to designate/mdns/pdns setup for Labs internal dns [puppet] - 10https://gerrit.wikimedia.org/r/271797 (https://phabricator.wikimedia.org/T124680) [22:29:29] (03PS2) 10Andrew Bogott: designate: Open firewall to axfr traffic from pdns hosts. [puppet] - 10https://gerrit.wikimedia.org/r/272615 (https://phabricator.wikimedia.org/T124680) [22:30:46] (03PS3) 10Dzahn: Tools: Move role classes to module role [puppet] - 10https://gerrit.wikimedia.org/r/269902 (owner: 10Tim Landscheidt) [22:31:47] greg-g: can I reserve the next hour and a half for perf-related syncs, with tgr and Krinkle? [22:32:41] (03CR) 10Dzahn: [C: 032] "has been tested on tools-beta" [puppet] - 10https://gerrit.wikimedia.org/r/269902 (owner: 10Tim Landscheidt) [22:33:58] ostriches: could you approve on behalf of greg-g? [22:34:04] re: > greg-g: can I reserve the next hour and a half for perf-related syncs, with tgr and Krinkle? [22:34:34] ori: yeah, that's fine [22:34:42] thanks [22:35:05] (03PS1) 10Ottomata: [WIP] Add $cluster param to varnish::instance and set explicitly in role [puppet] - 10https://gerrit.wikimedia.org/r/272619 [22:35:08] * ostriches +1s greg-g's approval [22:35:12] Looks guys, I have a working keyboard! [22:35:58] (03CR) 10Ottomata: [C: 04-1] "Still WIP, needs Brandon's eyes." [puppet] - 10https://gerrit.wikimedia.org/r/272619 (owner: 10Ottomata) [22:37:28] (03PS1) 10Hashar: Glance policy: grant manage_image_cache permission [puppet] - 10https://gerrit.wikimedia.org/r/272621 (https://phabricator.wikimedia.org/T127755) [22:39:24] (03CR) 10Andrew Bogott: [C: 032] "this is clearly harmless." [puppet] - 10https://gerrit.wikimedia.org/r/272621 (https://phabricator.wikimedia.org/T127755) (owner: 10Hashar) [22:42:04] !log powercycling es2010 [22:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:45:13] (03PS3) 10Andrew Bogott: designate: Open firewall to axfr traffic from pdns hosts. [puppet] - 10https://gerrit.wikimedia.org/r/272615 (https://phabricator.wikimedia.org/T124680) [22:46:34] 6Operations, 10ops-codfw, 10DBA: es2010 controller issue - https://phabricator.wikimedia.org/T127769#2053409 (10jcrespo) [22:46:53] (03CR) 10Andrew Bogott: [C: 032] designate: Open firewall to axfr traffic from pdns hosts. [puppet] - 10https://gerrit.wikimedia.org/r/272615 (https://phabricator.wikimedia.org/T124680) (owner: 10Andrew Bogott) [22:47:58] ACKNOWLEDGEMENT - Host es2010 is DOWN: PING CRITICAL - Packet loss = 100% Jcrespo https://phabricator.wikimedia.org/T127769 [22:48:59] (03CR) 10Dzahn: "confirmed noop on: tools-bastion-05, tools-docker-registry-01, tools-flannel-etcd-03, tools-exec-1201" [puppet] - 10https://gerrit.wikimedia.org/r/269902 (owner: 10Tim Landscheidt) [22:49:54] (03PS6) 10Andrew Bogott: Updates to designate/mdns/pdns setup for Labs internal dns [puppet] - 10https://gerrit.wikimedia.org/r/271797 (https://phabricator.wikimedia.org/T124680) [22:52:05] RECOVERY - RAID on db2019 is OK: OK: optimal, 1 logical, 6 physical [22:52:47] (03PS1) 10Volans: Depooled es2010, controller issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272628 (https://phabricator.wikimedia.org/T127769) [22:56:14] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 61.54% of data above the critical threshold [5000000.0] [22:58:49] (03PS1) 10Krinkle: wmfstatic: Set MW_NO_SESSION to 'warn' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272630 [22:59:03] (03Abandoned) 10Ottomata: [WIP] Add $cluster param to varnish::instance and set explicitly in role [puppet] - 10https://gerrit.wikimedia.org/r/272619 (owner: 10Ottomata) [22:59:26] (03CR) 10Gergő Tisza: [C: 031] wmfstatic: Set MW_NO_SESSION to 'warn' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272630 (owner: 10Krinkle) [22:59:45] (03CR) 10Krinkle: [C: 032] wmfstatic: Set MW_NO_SESSION to 'warn' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272630 (owner: 10Krinkle) [23:00:39] (03PS2) 10Tim Landscheidt: Tools: Remove obsolete classes [puppet] - 10https://gerrit.wikimedia.org/r/272441 [23:02:00] (03Merged) 10jenkins-bot: wmfstatic: Set MW_NO_SESSION to 'warn' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272630 (owner: 10Krinkle) [23:04:57] !log krinkle@tin Synchronized w/static.php: Set MW_NO_SESSION to warn (duration: 01m 34s) [23:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:08:59] (03PS2) 10Dzahn: quarry: Move role classes to module role [puppet] - 10https://gerrit.wikimedia.org/r/270097 (owner: 10Tim Landscheidt) [23:09:43] (03CR) 10Dzahn: [C: 032] quarry: Move role classes to module role [puppet] - 10https://gerrit.wikimedia.org/r/270097 (owner: 10Tim Landscheidt) [23:10:50] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2021309 (10ori) >>! In T126700#2050885, @Anomie wrote: >>>! In T126700#2050715, @ori wrote: >> @Anomie, when I snoop Redis GETs I see something even... [23:11:04] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [23:11:46] (03CR) 10Dzahn: "confirmed noop on: quarry-main-01.quarry, quarry-runner-01, quarry-runner-02" [puppet] - 10https://gerrit.wikimedia.org/r/270097 (owner: 10Tim Landscheidt) [23:12:03] (03CR) 10Dzahn: "ok, thanks i merged yours instead https://gerrit.wikimedia.org/r/#/c/270097/" [puppet] - 10https://gerrit.wikimedia.org/r/260187 (owner: 10Dzahn) [23:12:25] (03Abandoned) 10Dzahn: quarry: use one file per class, autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/260187 (owner: 10Dzahn) [23:13:20] (03CR) 10Dzahn: "on which instances does this run please" [puppet] - 10https://gerrit.wikimedia.org/r/270102 (owner: 10Tim Landscheidt) [23:50:46] (03PS7) 10Andrew Bogott: Updates to designate/mdns/pdns setup for Labs internal dns [puppet] - 10https://gerrit.wikimedia.org/r/271797 (https://phabricator.wikimedia.org/T124680) [23:52:16] (03CR) 10jenkins-bot: [V: 04-1] Updates to designate/mdns/pdns setup for Labs internal dns [puppet] - 10https://gerrit.wikimedia.org/r/271797 (https://phabricator.wikimedia.org/T124680) (owner: 10Andrew Bogott) [23:52:38] (03CR) 10DarTar: [C: 031] "using correct hyphenated value, per discussion at https://phabricator.wikimedia.org/T87276#2053177" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272517 (https://phabricator.wikimedia.org/T87276) (owner: 10BBlack)