[00:10:27] (03PS3) 10Tim Starling: For all apache access logs, use the WMF cache log format [puppet] - 10https://gerrit.wikimedia.org/r/268022 [00:11:31] (03CR) 10Tim Starling: "%{msec_frac}t is also 2.2.30+ :( At least that one is forwards-compatible." [puppet] - 10https://gerrit.wikimedia.org/r/268022 (owner: 10Tim Starling) [00:14:26] 6operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#1992430 (10RobH) A spinning disk can be securely wiped, where SSDs aren't really securely wiped in most instances. (Trim support is varied and not regularly applied across differing ssd vendors/models.) I'm not sure a h... [00:20:39] (03CR) 10Ori.livneh: [C: 031] "Looks good. I tested it on trusty and precise. Be aware that after Puppet applies the change, it will send Apache a USR1 (graceful) signal" [puppet] - 10https://gerrit.wikimedia.org/r/268022 (owner: 10Tim Starling) [00:50:44] PROBLEM - puppet last run on ms-fe1002 is CRITICAL: CRITICAL: Puppet has 1 failures [00:57:13] (03PS1) 10EBernhardson: Create pool counter for CirrusSearch completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268029 (https://phabricator.wikimedia.org/T125547) [01:15:45] RECOVERY - puppet last run on ms-fe1002 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [02:29:04] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [02:30:15] RECOVERY - Host mw2027 is UP: PING OK - Packet loss = 0%, RTA = 36.69 ms [02:50:56] (03PS1) 10Yuvipanda: tools: Make flannel use proper etcd hosts on proxy too [puppet] - 10https://gerrit.wikimedia.org/r/268041 [02:51:21] (03PS2) 10Yuvipanda: tools: Make flannel use proper etcd hosts on proxy too [puppet] - 10https://gerrit.wikimedia.org/r/268041 [02:51:30] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Make flannel use proper etcd hosts on proxy too [puppet] - 10https://gerrit.wikimedia.org/r/268041 (owner: 10Yuvipanda) [03:03:11] (03PS1) 10Tim Landscheidt: Tools: Deploy root web automatically again [puppet] - 10https://gerrit.wikimedia.org/r/268043 [03:04:51] (03PS2) 10Yuvipanda: dynamicproxy: Remove obsolete code [puppet] - 10https://gerrit.wikimedia.org/r/267523 (owner: 10Tim Landscheidt) [03:05:01] (03CR) 10Yuvipanda: [C: 032 V: 032] dynamicproxy: Remove obsolete code [puppet] - 10https://gerrit.wikimedia.org/r/267523 (owner: 10Tim Landscheidt) [03:05:59] (03PS3) 10Yuvipanda: shinken: Add role::labs::instance as hostgroup to all instances [puppet] - 10https://gerrit.wikimedia.org/r/267039 (https://phabricator.wikimedia.org/T123271) (owner: 10Tim Landscheidt) [03:06:05] (03CR) 10Yuvipanda: [C: 032 V: 032] shinken: Add role::labs::instance as hostgroup to all instances [puppet] - 10https://gerrit.wikimedia.org/r/267039 (https://phabricator.wikimedia.org/T123271) (owner: 10Tim Landscheidt) [03:07:09] (03PS2) 10Yuvipanda: shinken: Indent and align generated configuration [puppet] - 10https://gerrit.wikimedia.org/r/267426 (owner: 10Tim Landscheidt) [03:08:57] (03CR) 10Yuvipanda: [C: 032 V: 032] shinken: Indent and align generated configuration [puppet] - 10https://gerrit.wikimedia.org/r/267426 (owner: 10Tim Landscheidt) [03:09:15] (03PS2) 10Yuvipanda: Tools: Deploy root web automatically again [puppet] - 10https://gerrit.wikimedia.org/r/268043 (owner: 10Tim Landscheidt) [03:09:22] (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Deploy root web automatically again [puppet] - 10https://gerrit.wikimedia.org/r/268043 (owner: 10Tim Landscheidt) [03:25:04] 6operations, 6Services, 10Traffic, 5Patch-For-Review: Decom parsoidcache cluster - https://phabricator.wikimedia.org/T110472#1992753 (10BBlack) [03:25:08] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1992751 (10BBlack) 5Open>3stalled Re: timing - per https://lists.wikimedia.org/pipermail/wikitech-l/2016-Jan... [03:33:36] PROBLEM - puppet last run on mc2008 is CRITICAL: CRITICAL: puppet fail [03:35:31] (03PS2) 10Yuvipanda: toollabs: Do not hardcode Host header [puppet] - 10https://gerrit.wikimedia.org/r/267402 [03:35:51] (03CR) 10jenkins-bot: [V: 04-1] toollabs: Do not hardcode Host header [puppet] - 10https://gerrit.wikimedia.org/r/267402 (owner: 10Yuvipanda) [03:36:21] (03PS3) 10Yuvipanda: toollabs: Do not hardcode Host header [puppet] - 10https://gerrit.wikimedia.org/r/267402 [03:42:23] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: puppet fail [04:00:25] RECOVERY - puppet last run on mc2008 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [04:09:25] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:18:15] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 13.04% of data above the critical threshold [100000000.0] [04:23:10] (03CR) 10Tim Landscheidt: toollabs: Do not hardcode Host header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/267402 (owner: 10Yuvipanda) [04:43:04] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [05:06:43] (03PS4) 10Yuvipanda: toollabs: Do not hardcode Host header [puppet] - 10https://gerrit.wikimedia.org/r/267402 [05:07:03] (03CR) 10jenkins-bot: [V: 04-1] toollabs: Do not hardcode Host header [puppet] - 10https://gerrit.wikimedia.org/r/267402 (owner: 10Yuvipanda) [05:08:33] (03PS5) 10Yuvipanda: toollabs: Do not hardcode Host header [puppet] - 10https://gerrit.wikimedia.org/r/267402 [05:09:24] (03CR) 10Yuvipanda: toollabs: Do not hardcode Host header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/267402 (owner: 10Yuvipanda) [05:56:01] (03CR) 10Dzahn: [C: 031] "yes!" [puppet] - 10https://gerrit.wikimedia.org/r/267929 (owner: 10Chad) [05:57:49] (03CR) 10Dzahn: "from puppet/modules/git/manifests/clone.pp" [puppet] - 10https://gerrit.wikimedia.org/r/267929 (owner: 10Chad) [06:00:54] (03PS1) 10EBernhardson: A/B/C test of control vs textcat vs accept-lang + textcat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268048 (https://phabricator.wikimedia.org/T121542) [06:01:19] (03CR) 10jenkins-bot: [V: 04-1] A/B/C test of control vs textcat vs accept-lang + textcat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268048 (https://phabricator.wikimedia.org/T121542) (owner: 10EBernhardson) [06:02:50] (03PS2) 10EBernhardson: A/B/C test of control vs textcat vs accept-lang + textcat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268048 (https://phabricator.wikimedia.org/T121542) [06:31:06] PROBLEM - puppet last run on mw1073 is CRITICAL: CRITICAL: puppet fail [06:32:06] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:35] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:56] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:36] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:57:25] RECOVERY - puppet last run on mw1073 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:58:35] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:55] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:22:12] 10Ops-Access-Requests, 6operations, 10DBA: Grant mysql client access to testreduce_vd and testreduce_0715 databases - https://phabricator.wikimedia.org/T125435#1993024 (10jcrespo) Let me create a new user. And give me your ruthenium login. I will setup mysql-client install and a .my.cnf on puppet. [07:34:43] (03PS1) 10Yuvipanda: tools: Add kube-system service account to abac [puppet] - 10https://gerrit.wikimedia.org/r/268050 [07:35:02] (03CR) 10jenkins-bot: [V: 04-1] tools: Add kube-system service account to abac [puppet] - 10https://gerrit.wikimedia.org/r/268050 (owner: 10Yuvipanda) [07:37:15] (03PS2) 10Yuvipanda: tools: Add kube-system service account to abac [puppet] - 10https://gerrit.wikimedia.org/r/268050 [07:38:09] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Add kube-system service account to abac [puppet] - 10https://gerrit.wikimedia.org/r/268050 (owner: 10Yuvipanda) [07:42:37] 6operations, 6Release-Engineering-Team, 3Scap3: Depool proxies temporarily while scap is ongoing to avoid taxing those nodes - https://phabricator.wikimedia.org/T125629#1993054 (10jcrespo) 3NEW [08:10:10] (03PS1) 10Jcrespo: Repool db1063 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268053 [08:12:43] (03CR) 10Jcrespo: [C: 032] Repool db1063 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268053 (owner: 10Jcrespo) [08:24:19] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Repool db1063 with low weight (duration: 01m 20s) [08:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:31:11] (03PS2) 10Giuseppe Lavagetto: scap::master: add commit-msg hook [puppet] - 10https://gerrit.wikimedia.org/r/267894 [08:31:23] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] scap::master: add commit-msg hook [puppet] - 10https://gerrit.wikimedia.org/r/267894 (owner: 10Giuseppe Lavagetto) [08:33:09] (03PS2) 10Giuseppe Lavagetto: deploy master: recurse submodules on clone [puppet] - 10https://gerrit.wikimedia.org/r/267929 (owner: 10Chad) [08:34:36] (03CR) 10Alexandros Kosiaris: [C: 04-1] Beta: Add cxserver registry to Beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266668 (owner: 10KartikMistry) [08:34:56] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] deploy master: recurse submodules on clone [puppet] - 10https://gerrit.wikimedia.org/r/267929 (owner: 10Chad) [08:39:23] !log disabling puppet on iodine, mendelevium, OTRS migration [08:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:40:46] !log stop exim4, cron, apache2 on iodine, mendelevium [08:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:41:26] (03PS1) 10Giuseppe Lavagetto: scap::master: fix commit hook reference [puppet] - 10https://gerrit.wikimedia.org/r/268054 [08:41:59] (03CR) 10Giuseppe Lavagetto: [C: 032] scap::master: fix commit hook reference [puppet] - 10https://gerrit.wikimedia.org/r/268054 (owner: 10Giuseppe Lavagetto) [08:42:06] (03CR) 10Giuseppe Lavagetto: [V: 032] scap::master: fix commit hook reference [puppet] - 10https://gerrit.wikimedia.org/r/268054 (owner: 10Giuseppe Lavagetto) [08:42:55] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures [08:44:11] !log stop slave on db2011, db1020's (m2-master) slave, for OTRS migration. DO NOT ENABLE [08:44:13] jynus_: ^ [08:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:44:38] I can't find an alert on icinga for this btw [08:44:53] seems like dbstores have an alert for m2 slave lag, but not db2011 [08:45:14] there is not [08:45:37] ok, cool [08:45:57] one less change somebody apart from you finds out and runs an ill informed start slave [08:46:04] s/change/chance/ [08:46:57] <_joe_> akosiaris: otrs upgrade today? [08:47:05] yes [08:47:10] do you want me to create a binary backup first? [08:47:14] \o/ [08:47:23] jynus: if you can, that would be awesome [08:47:36] I am not sure if I can, only if there is space [08:47:53] errr [08:47:56] sudo vgs [08:47:56] /dev/tank/snap09110815: read failed after 0 of 4096 at 1521311744000: Input/output error [08:47:57] ... [08:48:00] that does not sounds good [08:48:05] on db1020 [08:48:11] that sould be normal [08:48:15] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:48:16] if it is very old [08:48:49] ok then [08:49:11] I checed it and was going to delete it, but got distracted [08:49:31] I see 11g free at the vg btw [08:49:40] let me run a binary backup and leave it on the own host [08:49:44] should I create a snapshot ? or that does not make sense ? [08:49:55] if my backups works, no need [08:49:59] ok [08:50:04] waiting for you then [08:51:34] (03PS1) 10Alexandros Kosiaris: Remove mendelevium's OTRS config override [puppet] - 10https://gerrit.wikimedia.org/r/268055 (https://phabricator.wikimedia.org/T74109) [08:51:35] !log starting mysql backup on db1020 (/srv/backups) [08:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:51:50] I can give you an ETA [08:52:12] but read and writes should be continuing as usual for now [08:52:44] !log stop otrs-daemon on mendelevium [08:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:53:04] btw, migration procedure is in https://etherpad.wikimedia.org/p/otrs-migration if anyone is interested [08:53:51] (03CR) 10Alexandros Kosiaris: [C: 032] Remove mendelevium's OTRS config override [puppet] - 10https://gerrit.wikimedia.org/r/268055 (https://phabricator.wikimedia.org/T74109) (owner: 10Alexandros Kosiaris) [08:55:30] (03PS2) 10Alexandros Kosiaris: Remove mendelevium's OTRS config override [puppet] - 10https://gerrit.wikimedia.org/r/268055 (https://phabricator.wikimedia.org/T74109) [08:55:43] (03PS3) 10Alexandros Kosiaris: Remove mendelevium's OTRS config override [puppet] - 10https://gerrit.wikimedia.org/r/268055 (https://phabricator.wikimedia.org/T74109) [08:55:48] (03CR) 10Alexandros Kosiaris: [V: 032] Remove mendelevium's OTRS config override [puppet] - 10https://gerrit.wikimedia.org/r/268055 (https://phabricator.wikimedia.org/T74109) (owner: 10Alexandros Kosiaris) [08:58:11] akosiaris, the eta is a bit over an hour. If that is not acceptable, we can create a snapshot [08:58:43] in other words, 10:20 UTC [08:58:51] I think I can live with that. [09:00:02] as it is not a dedicated host, writes will continue for other databases, which makse snapshots more risky [09:00:16] in perf and running out of space [09:02:52] good news about binary backups is that they take 0 seconds to recover [09:03:04] yay! [09:04:03] (03PS1) 10Alexandros Kosiaris: Update mendelevium's OTRS config [puppet] - 10https://gerrit.wikimedia.org/r/268056 (https://phabricator.wikimedia.org/T74109) [09:04:22] (03PS2) 10Alexandros Kosiaris: Update mendelevium's OTRS config [puppet] - 10https://gerrit.wikimedia.org/r/268056 (https://phabricator.wikimedia.org/T74109) [09:04:35] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Update mendelevium's OTRS config [puppet] - 10https://gerrit.wikimedia.org/r/268056 (https://phabricator.wikimedia.org/T74109) (owner: 10Alexandros Kosiaris) [09:11:48] 6operations: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#1993217 (10elukey) >>! In T123711#1972487, @Dzahn wrote: > list of servers: Lock Manager Redis, need a change in operations/mediawiki-config beforehand > mc1001.eqiad.wmnet: True > mc1002.eqiad.wmnet: True > mc... [09:14:31] <_joe_> win 36 [09:20:54] (03PS7) 10KartikMistry: Beta: Add cxserver registry to Beta [puppet] - 10https://gerrit.wikimedia.org/r/266668 [09:22:29] (03CR) 10Alexandros Kosiaris: [C: 032] ticket.wikimedia.org: Lower TTL down to 5M [dns] - 10https://gerrit.wikimedia.org/r/267871 (https://phabricator.wikimedia.org/T74109) (owner: 10Alexandros Kosiaris) [09:22:34] (03PS2) 10Alexandros Kosiaris: ticket.wikimedia.org: Lower TTL down to 5M [dns] - 10https://gerrit.wikimedia.org/r/267871 (https://phabricator.wikimedia.org/T74109) [09:22:41] <_joe_> akosiaris: whoo [09:22:49] (03CR) 10Alexandros Kosiaris: [V: 032] ticket.wikimedia.org: Lower TTL down to 5M [dns] - 10https://gerrit.wikimedia.org/r/267871 (https://phabricator.wikimedia.org/T74109) (owner: 10Alexandros Kosiaris) [09:28:22] akosiaris, 224G/379G [09:28:49] nice [09:36:53] (03PS1) 10Jcrespo: Add echo tables to the list of private tables [puppet] - 10https://gerrit.wikimedia.org/r/268060 (https://phabricator.wikimedia.org/T125591) [09:41:36] I would like to do at some point some maintenance on otrs db, maybe we can reduce the size of that a bit [09:44:16] gehel: hi! how far in the onboarding were you able to get yesterday? [09:45:05] 6operations, 10DBA, 6Labs, 5Patch-For-Review: Set up additional filters for Echo tables - https://phabricator.wikimedia.org/T125591#1993287 (10jcrespo) Related: T119154 [09:45:19] godog: it suprisingly well. I managed to login to my first wikimedia server in lab [09:49:05] <_joe_> !log doing some basic load test on appservers in eqiad [09:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:50:09] expect a small glitch in gerrit/otrs in some minutes while backups finish (hopfully only for 1 second) [09:50:34] !log restarting neodymium for kernel update [09:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:51:12] it was less than 1 second [09:51:37] akosiaris, 160203 09:51:01 innobackupex: completed OK! [09:51:45] you have green light [09:53:00] !log m2 backup finished on /srv/backups/2016-02-03_08-51-06, filename 'db1020-bin.000842', position 220103947 [09:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:54:04] (03PS1) 10ArielGlenn: dumps: add labtestwiki to list of wikis we don't dump [puppet] - 10https://gerrit.wikimedia.org/r/268064 [09:54:34] (03PS2) 10ArielGlenn: dumps: add labtestwiki to list of wikis we don't dump [puppet] - 10https://gerrit.wikimedia.org/r/268064 [09:56:19] (03CR) 10ArielGlenn: [C: 032] dumps: add labtestwiki to list of wikis we don't dump [puppet] - 10https://gerrit.wikimedia.org/r/268064 (owner: 10ArielGlenn) [09:57:06] PROBLEM - salt-minion processes on ms-be2007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:57:59] gehel: also you may want to subscribe to ops@ mailing list and engineering@ if not already, once you have an irc cloak and nick registered we can add you to additional channels [10:01:11] (03CR) 10Joal: [C: 031] "LGTM (but I'm no restbase/puppet expert)" [puppet] - 10https://gerrit.wikimedia.org/r/267924 (https://phabricator.wikimedia.org/T124947) (owner: 10Eevans) [10:02:32] godog: engineering, already done [10:03:05] godog: ops@, can't find it in the list. It's hidden somewhere ? [10:03:42] gehel: yeah maybe it isn't listed, https://lists.wikimedia.org/mailman/listinfo/Ops [10:03:47] 6operations, 6Analytics-Kanban, 10hardware-requests, 5Patch-For-Review: 8 x 3 SSDs for AQS nodes. - https://phabricator.wikimedia.org/T124947#1993367 (10JAllemandou) I +1ed the CR for changing restbase read consistency to one (it'll be good even with SSDs). The change on code for replication factor set to... [10:03:51] godog: got it ... [10:04:39] godog: seems I need aproval to join ops@. Should I just wait ? Or do I need to ping someone ? [10:05:53] gehel: yeah should be happening in <1d, also you'll need a task similar to this https://phabricator.wikimedia.org/T122925 to request access, some steps might be done already tho [10:10:15] godog: I'll have a look and create a ticket. How can I find which groups I need access to? [10:10:33] godog: I should probably as ebernhardson [10:11:45] RECOVERY - salt-minion processes on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:11:52] !log reboot francium for kernel update [10:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:14:22] (03PS3) 10ArielGlenn: Support bzip2 compression format [puppet] - 10https://gerrit.wikimedia.org/r/262423 (https://phabricator.wikimedia.org/T118397) (owner: 10Lokal Profil) [10:16:42] gehel: yep that makes sense [10:18:03] (03CR) 10ArielGlenn: [C: 032] Support bzip2 compression format [puppet] - 10https://gerrit.wikimedia.org/r/262423 (https://phabricator.wikimedia.org/T118397) (owner: 10Lokal Profil) [10:19:57] jynus: database schema migration ongoing... it is going to take a few hours IIRC [10:20:34] do you want me to do something, except the usual db monitoring? [10:20:40] nope [10:20:43] I am monitoring as well btw [10:20:54] ALTER TABLE article ADD a_message_id_md5 ... this should take a while [10:21:21] I think that may actually in the end do most of the maintenance I wanted to do [10:21:33] as table should be recreated [10:21:41] even better! what more could you ask ? :P [10:21:49] (03PS2) 10ArielGlenn: salt: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266981 (owner: 10Dzahn) [10:22:06] sadly, there is a huge ibdata1, so it will not help with saving space [10:22:32] both s1 and s2 are very old installs, and probably one of the fist misc servers to replace [10:22:52] m1 and m2 you mean [10:22:55] right ? [10:22:58] yes, sorry [10:24:09] at some point i will need your assistance and mut*nte's to do some cleanup of old, unused/test dbs [10:25:17] (03PS1) 10Filippo Giunchedi: swift: point swift to local datacenter imagescalers [puppet] - 10https://gerrit.wikimedia.org/r/268080 [10:27:35] 6operations, 10Dumps-Generation: Make dumps run via cron on each snapshot host - https://phabricator.wikimedia.org/T107750#1993406 (10ArielGlenn) Small/big wiki dumps were hung on start of dump run for labtestwiki which can't be dumped from snapshots. I added that to the list of dbs to skip and those jobs are... [10:27:51] (03CR) 10ArielGlenn: [C: 032] salt: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266981 (owner: 10Dzahn) [10:35:04] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1993413 (10Addshore) >>! In T124418#1985370, @ori wrote: > Distribution of (indirect) callers of `HTMLCacheU... [10:35:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 721 [10:35:18] (03CR) 10Merlijn van Deen: "For future reference: the automatic deployment was removed when NFS was removed from the webproxy hosts." [puppet] - 10https://gerrit.wikimedia.org/r/268043 (owner: 10Tim Landscheidt) [10:37:28] <_joe_> !log ending the load test on the eqiad apaches [10:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:38:05] 6operations, 7Monitoring, 7Tracking: consolidate graphite metrics monitoring frontends into grafana - https://phabricator.wikimedia.org/T125644#1993425 (10fgiunchedi) 3NEW a:3fgiunchedi [10:40:14] RECOVERY - check_mysql on db1008 is OK: Uptime: 1278115 Threads: 2 Questions: 7343492 Slow queries: 8640 Opens: 3064 Flush tables: 2 Open tables: 400 Queries per second avg: 5.745 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:40:16] gehel: also you'll need a registered irc nick and an irc cloak if you haven't one already, https://meta.wikimedia.org/wiki/IRC/Cloaks [10:41:23] godog: nick is registered, I requested a cloak, so it should be on the way [10:43:22] (03PS1) 10Filippo Giunchedi: grafana: add dashboard import tool [puppet] - 10https://gerrit.wikimedia.org/r/268082 (https://phabricator.wikimedia.org/T125644) [10:43:34] <_joe_> well, the cloak tends to take time [10:44:13] <_joe_> so you might set ENFORCE [10:44:16] <_joe_> /msg NickServ SET ENFORCE ON [10:48:28] !log depooling restbase1001 for kernel/Java update [10:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:49:02] gehel: o/ - let me know if I can help (analytics ops) [10:53:08] (03PS1) 10Jcrespo: Repool db1063 at 100% load; depool db1067 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268083 [10:55:49] (03CR) 10Jcrespo: [C: 032] Repool db1063 at 100% load; depool db1067 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268083 (owner: 10Jcrespo) [10:57:21] jynus: schema upgrade done, now running the "migration script" [10:57:29] https://github.com/OTRS/otrs/blob/rel-3_3/UPGRADING.md#database-migration-script [10:57:41] still in 3.3... need to to that for 4.0 and 5.0 as well [10:57:42] sigh [10:58:02] Step 3 of 13: Generate MessageID md5sums... [10:58:19] * jynus crosses fingers [10:58:20] (03CR) 10Addshore: "Cool!" [puppet] - 10https://gerrit.wikimedia.org/r/268082 (https://phabricator.wikimedia.org/T125644) (owner: 10Filippo Giunchedi) [11:00:21] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Repool db1063 at 100% load; depool db1067 for maintenance (duration: 01m 16s) [11:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:04:57] !log OTRS database upgraded to 3.3, moving on with 4.0 [11:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:09:31] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1993526 (10daniel) >>! In T124418#1993413, @Addshore wrote: > In the last 20 minutes we had roughly 7000 edi... [11:11:06] !log repooling restbase1001 [11:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:14:11] (03CR) 10Aude: [C: 04-1] Enable data type mathematical expression on wikidata.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267405 (https://phabricator.wikimedia.org/T124931) (owner: 10Llyrian) [11:15:15] (03PS1) 10Filippo Giunchedi: grafana: import varnish-http-errors [puppet] - 10https://gerrit.wikimedia.org/r/268085 (https://phabricator.wikimedia.org/T125644) [11:16:16] (03PS1) 10Aude: Enable math data type on test wikidata + test wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268086 [11:18:01] addshore: cool! re: wmde/grafana-dashboards, we could import them in puppet too if that's easier [11:18:24] (03CR) 10Filippo Giunchedi: "sample dashboard add: https://gerrit.wikimedia.org/r/#/c/268085/" [puppet] - 10https://gerrit.wikimedia.org/r/268082 (https://phabricator.wikimedia.org/T125644) (owner: 10Filippo Giunchedi) [11:18:25] godog: yeh that would be cool! although the issue would be if we want to change one we have to wait for someone from ops to merge it [11:18:35] which would not be ideal :/ [11:20:06] addshore: yeah that's a good point, I don't know what grafana does when changing a dashboard that lives on the filesystem too, ideally the puppet merge would be needed just to 'freeze' it [11:20:20] dont we have a command to perform hiera() lookups ? [11:20:39] the built-in "hiera" is not properly configured [11:20:49] * aude waves [11:20:52] hashar: yeah it is in utils/ iirc [11:21:09] did we resolve the problems with tin [11:21:10] ? [11:21:11] godog: in puppet.git ? [11:21:12] godog: I mean, another option would be to freeze the dashboard into a different repo that more people have access to? and make puppet pull it from therE? [11:21:27] godog: got it danke [11:24:02] addshore: well, these days with puppetswat, these kinds of trivial changes get merged pretty quickly [11:24:16] addshore: yup that's an option too, I'm also wondering what grafana would do when changing an existing dashboard from the ui that's also in git [11:24:42] when did you start gehel ? [11:24:53] Krenair: this Monday [11:24:59] godog: they 're readonly, home is in git [11:25:04] gehel, you're going to be learning for a while :) [11:25:11] as in, the "Home" dashboard [11:25:26] only if they are marked as readonly! ;) [11:25:34] Krenair: I love learning! But at some point I'm going to need to actually do something... [11:25:44] Krenair: I know, I still have time for that... [11:26:02] and akosiaris I agree,but if it is possible to have them in a seperate repo / not have to get ops to merge things that must be a win for everyone? ;) [11:26:38] addshore: depends on how convoluted the setup becomes. If it is not, then yes [11:28:22] akosiaris: shouldnt be too convoluted. puppet can pull the repo to somewhere and copy the dashboards from that location to the actually grafana store (rather than just the copy in puppet [11:29:05] godog: ^^ cloning the repo to a location that is not the actual grafana store and doing a copy would also get rid of the issue of people maybe editing things that are stored in git? [11:29:12] (03CR) 10Filippo Giunchedi: [C: 04-1] "pending: figure out how to change a dashboard that's in git from the ui (without puppet changing it back) and submit/review the changes to" [puppet] - 10https://gerrit.wikimedia.org/r/268082 (https://phabricator.wikimedia.org/T125644) (owner: 10Filippo Giunchedi) [11:29:44] ahhh [11:30:36] addshore: wouldn't fix it fully, unless perhaps the editing is done on another grafana instance (?) [11:35:39] (03PS1) 10ArielGlenn: fix variable name in dump cron job for hugewikis [puppet] - 10https://gerrit.wikimedia.org/r/268088 [11:35:54] godog: the poor deployment-ms-be01 and deployment-ms-be02 on beta cluster are complaining and want me to run 'dpkg --configure -a'. I guess the instances are still a work in progress and I can ignore ? [11:37:05] hashar: no it shouldn't, looks like it was the kernel upgrade, I just ran dpkg [11:38:07] (03CR) 10ArielGlenn: [C: 032] fix variable name in dump cron job for hugewikis [puppet] - 10https://gerrit.wikimedia.org/r/268088 (owner: 10ArielGlenn) [11:38:13] godog: looks better, thanks. I am not doing any further command on those hosts, I dont want to break them while you polish them up [11:40:22] hashar: nah that's fine it should work already as is [11:40:30] upgrading :-} [11:40:40] (03CR) 10JanZerebecki: [C: 031] Enable math data type on test wikidata + test wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268086 (owner: 10Aude) [11:44:02] akosiaris: good luck with the OTRS upgrade :) */me bring ak a coffee* [11:44:37] godog: hmmmm [11:45:02] (03CR) 10Filippo Giunchedi: "did the statsd client change in T121231 (or upstream https://github.com/sivy/node-statsd/pull/61) made it in too? generally +1 for this!" [puppet] - 10https://gerrit.wikimedia.org/r/267917 (owner: 10Mobrovac) [11:45:16] !log restarting and reconfiguring mysql at db1067 [11:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:46:00] Steinsplitter: thanks :-) [11:46:50] (03PS1) 10Ema: Build-depend on pkg-config to please autoreconf [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/268089 (https://phabricator.wikimedia.org/T124281) [11:48:04] (03CR) 10Ema: [C: 032 V: 032] Build-depend on pkg-config to please autoreconf [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/268089 (https://phabricator.wikimedia.org/T124281) (owner: 10Ema) [11:48:43] (03PS1) 10Jcrespo: Reconfigure db1067 (depooled) [puppet] - 10https://gerrit.wikimedia.org/r/268090 [11:49:21] (03CR) 10Jcrespo: [C: 032] Reconfigure db1067 (depooled) [puppet] - 10https://gerrit.wikimedia.org/r/268090 (owner: 10Jcrespo) [11:54:06] I think there may be some issues with cirrussearch autocomplete, it spikes at times [11:54:26] jynus: I noticed that yesterday as well [11:55:11] (03CR) 10Filippo Giunchedi: "note: pending related phab access ticket" [puppet] - 10https://gerrit.wikimedia.org/r/267919 (owner: 10Gehel) [11:55:22] ReduceSearchPhaseException[Failed to execute phase [merge], [reduce] ]; nested: EsRejectedExecutionException[rejected execution (queue capacity 1000) on org.elasticsearch.action.search.type.TransportSearchQueryAndFetchAction$AsyncAction [11:55:40] guess the queue is full and reject [12:00:08] (03PS4) 10Filippo Giunchedi: Raise file upload limit to 2047 MB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266544 (https://phabricator.wikimedia.org/T116514) (owner: 10TheDJ) [12:03:21] (03CR) 10Gehel: "Phabricator task T125651 created, some more details needed." [puppet] - 10https://gerrit.wikimedia.org/r/267919 (owner: 10Gehel) [12:06:04] hashar: we are rebooting elastic machines in eqiad I think it's the cause, the cluster is already very busy without reboots so... [12:06:36] <_joe_> yeah I see depools happening [12:10:15] PROBLEM - Host elastic1006 is DOWN: PING CRITICAL - Packet loss = 100% [12:10:40] ^ that's me, I marked it as downtime in icinga? [12:10:43] * aude noticed that extensions have wmf/1.27.0-wmf.12 branch and wmf/1.27.0-wmf12 branch [12:11:07] (03PS6) 10ArielGlenn: new salt runner to accept/delete/check status of salt key [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) [12:11:32] core only has wmf/1.27.0-wmf.12 branch and appears to be using wmf/1.27.0-wmf.12 of extensions [12:11:34] fixed, maybe I set the downtime to short and it expired [12:11:43] (so probably ok) [12:12:36] RECOVERY - Host elastic1006 is UP: PING OK - Packet loss = 0%, RTA = 2.27 ms [12:13:50] aude: yeah we screwed the branching yesterday [12:13:54] the branch with a dot should be the proper one [12:18:14] (03PS1) 10Jcrespo: Repool db1067 at low weight; depool db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268093 [12:19:46] hashar: ok [12:28:39] (03PS7) 10ArielGlenn: new salt runner to accept/delete/check status of salt key [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) [12:32:53] can someone tell what i am doing wrong with indentation of => in https://gerrit.wikimedia.org/r/#/c/267864/ ? [12:33:16] WARNING: indentation of => is not properly aligned [12:33:41] or where these rules come from? [12:40:28] aude, perhaps you added too many spaces? [12:40:51] or there's a tab in there somewhere? [12:41:00] hm, no, gerrit would show that [12:42:44] Krenair: i just don't know how many spaces is correct and seems to vary [12:42:58] * aude guess 4 spaces [12:45:06] (03CR) 10Physikerwelt: [C: 031] Enable math data type on test wikidata + test wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268086 (owner: 10Aude) [12:45:19] 7Blocked-on-Operations, 6operations, 6Services, 6WMDE-Analytics-Engineering, 7Graphite: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#1993645 (10Addshore) [12:45:59] aude, I think you only need to have one space for the longest one, and then align the rest with that [12:46:28] (03PS1) 10DCausse: Return more like search queries to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268097 [12:47:00] Krenair: yeah, think i just figured that out [12:47:05] and will try it [12:47:13] 6operations, 10Datasets-General-or-Unknown, 6WMDE-Analytics-Engineering, 10Wikidata, 5Patch-For-Review: Push dumps.wm.o logs files to stat1002 - https://phabricator.wikimedia.org/T118739#1993650 (10Addshore) Any ETA on when we could start the rsync @ArielGlenn ? [12:49:20] (03PS3) 10Alexandros Kosiaris: misc-web: Route ticket.wikimedia.org to mendelevium [puppet] - 10https://gerrit.wikimedia.org/r/267868 (https://bugzilla.wikimedia.org/74109) [12:49:22] (03PS3) 10Alexandros Kosiaris: otrs: OTRS search slow, increase between_bytes_timeout [puppet] - 10https://gerrit.wikimedia.org/r/267867 (https://bugzilla.wikimedia.org/74109) [12:49:24] (03PS2) 10Alexandros Kosiaris: otrs: Route OTRS email to mendelevium [puppet] - 10https://gerrit.wikimedia.org/r/267888 (https://phabricator.wikimedia.org/T74109) [12:49:26] (03PS1) 10Alexandros Kosiaris: otrs: redirect iodine to the test database [puppet] - 10https://gerrit.wikimedia.org/r/268098 [12:49:38] jenkins is happy with my patch now [12:52:29] (03CR) 10Aude: "we will deploy this in a few hours during swat, assuming problems yesterday with deployments are resolved. (which is why we didn't do this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268086 (owner: 10Aude) [12:53:16] (03CR) 10Gehel: [C: 031] "Looks good to me (but what do I know). I'm pretty sure that this single line change has impacts that I cannot even imagine ;-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268097 (owner: 10DCausse) [12:57:30] (03PS8) 10ArielGlenn: new salt runner to accept/delete/check status of salt key [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) [12:58:45] (03CR) 10jenkins-bot: [V: 04-1] new salt runner to accept/delete/check status of salt key [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) (owner: 10ArielGlenn) [13:06:25] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Search-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1993677 (10akosiaris) 5Open>3Resolved I 've updated the ACL for codfw as well. I 've tested it already and seems to... [13:21:37] 6operations, 10Analytics-Cluster: Kafka Broker disk usage is imbalanced - https://phabricator.wikimedia.org/T99105#1993728 (10elukey) Actual status: kafka1014.eqiad.wmnet: Filesystem Size Used Avail Use% Mounted on /dev/sda3 1.8T 618G 1.2T 35% /var/spool/kafka/a /dev/sdb3 1.8T... [13:25:03] (03CR) 10Jcrespo: [C: 032] Repool db1067 at low weight; depool db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268093 (owner: 10Jcrespo) [13:27:48] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Repool db1067 at low weight; depool db1054 (duration: 01m 16s) [13:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:36:28] (03CR) 10Filippo Giunchedi: "LGTM, will it auto-create containers when doing async replication too?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266609 (https://phabricator.wikimedia.org/T91869) (owner: 10Aaron Schulz) [13:36:50] (03PS9) 10ArielGlenn: new salt runner to accept/delete/check status of salt key [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) [13:38:19] 6operations, 10Traffic, 5Patch-For-Review: Create separate packages for required vmods - https://phabricator.wikimedia.org/T124281#1993797 (10ema) The initial WMF packaging of [[https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/varnish/libvmod-netmapper | libvmod-netmapper]] and [[https://... [13:39:54] !log restarting and reconfiguring mysql at db1054 [13:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:40:05] (03CR) 10Filippo Giunchedi: [C: 031] "moved to less than 2^31 -1 as suggested on the ticket" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266544 (https://phabricator.wikimedia.org/T116514) (owner: 10TheDJ) [13:41:51] !log powercycle ms-be2015 [13:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:43:17] 6operations, 10Analytics-Cluster: Kafka Broker disk usage is imbalanced - https://phabricator.wikimedia.org/T99105#1993811 (10JAllemandou) @ottomata: It indeed seems that message keys are actually input to the partitioner :) [13:45:05] (03PS1) 10Jcrespo: Reconfigure db1054 (depooled) [puppet] - 10https://gerrit.wikimedia.org/r/268105 [13:45:18] (03PS2) 10ArielGlenn: git deploy cleanup to toss minion from redis [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/219852 [13:46:04] (03CR) 10BBlack: [C: 031] Raise file upload limit to 2047 MB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266544 (https://phabricator.wikimedia.org/T116514) (owner: 10TheDJ) [13:47:03] (03CR) 10Jcrespo: [C: 032] Reconfigure db1054 (depooled) [puppet] - 10https://gerrit.wikimedia.org/r/268105 (owner: 10Jcrespo) [13:48:03] 6operations, 10Analytics-Cluster: Kafka Broker disk usage is imbalanced - https://phabricator.wikimedia.org/T99105#1993827 (10JAllemandou) Having 'per-schema' topics might also impact. [13:55:10] 6operations, 10ops-codfw: ms-be2015 doesn't come up after reboot - https://phabricator.wikimedia.org/T125383#1993842 (10fgiunchedi) indeed I'm seeing this for a long time now ``` Scanning for devices. Please wait, this may take several minutes... ``` @papaul could you check the hardware/controller for iss... [13:56:59] (03PS3) 10Filippo Giunchedi: Adding user gehel (Guillaume Lederrey) to user list [puppet] - 10https://gerrit.wikimedia.org/r/267919 (https://phabricator.wikimedia.org/T125651) (owner: 10Gehel) [13:57:42] gehel: hi! FYI I've amended the commit message on the above to reference the ticket so the two are linked [13:58:03] godog: yep, just saw that. Thanks ! [13:58:23] 6operations, 10ops-codfw: ms-be2015 doesn't come up after reboot - https://phabricator.wikimedia.org/T125383#1993851 (10jcrespo) bblack resolved recently an issue by doing a hard reset on the management interface. Probably completely unrelated, but worth a try. [14:00:03] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1993865 (10BBlack) Well then apparently the 10/s edits to all projects number I found before is complete bun... [14:02:33] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1993869 (10BBlack) So, current thinking is that at least one of (maybe two of?) the bumps are from moving wh... [14:09:06] 7Puppet, 6operations, 10Salt, 5Patch-For-Review: Make it possible for wmf-reimage to work seamlessly with a non-local salt master - https://phabricator.wikimedia.org/T124761#1993885 (10ArielGlenn) I've got that going now, have a look at the code; you can ask for status of a key, accept or delete, and you g... [14:10:54] !log rebooting graphite1001 for kernel update [14:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:15:32] (03PS1) 10Jcrespo: Depool db1060, repool db1054, repool db1067 with original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268115 [14:28:39] !log investigating uwsgi processes for graphite-web not coming up after reboot [14:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:37:51] (03PS7) 10Krinkle: [WIP] Implement /w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) [14:46:12] (03PS3) 10ArielGlenn: git deploy cleanup to toss minion from redis [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/219852 [14:46:43] hashar: yes sadly we have two sets of hardware in the elasticsearch cluster, when the old hardware maxes out the new hardware is at ~30% capacity...but elasticsearch has no way to distribute the load more evenly :( those old machines are being replaced in the next few months though [14:47:05] (03CR) 10ArielGlenn: "tested on local salt/trebuchet cluster and wfm, anyone want to play with it in deployment-prep?" [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/219852 (owner: 10ArielGlenn) [14:49:04] ebernhardson: ah thank you for the follow up. And I guess we can shard the load by wiki ? [14:49:25] hashar: only sortof, the shard distribution algorithm is randomized [14:49:53] ebernhardson: and can't be weighted I guess :D such a shame [14:49:55] hashar: it doesn't take into account hardware or really anything other than the # of shards on each box. manybubbles started up a patch to fix that, but it was rejected [14:50:14] hashar: even if they could be weighted, our main load generater is enwiki and it's shards are on 28 out of 31 boxes [14:50:25] !log rebooting cp1008 for kernel [14:50:27] would be nice to weight things on the query end at least .. but alas :( [14:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:50:50] ebernhardson: though if you had enwiki shards solely on new hardware that would ease the task of the old ones? [14:50:50] (03PS2) 10Jcrespo: Repool db1054 with low weight, repool db1067 with original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268115 [14:52:27] (03CR) 10Jcrespo: [C: 032] Repool db1054 with low weight, repool db1067 with original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268115 (owner: 10Jcrespo) [14:52:43] ebernhardson: I am asking out of curiosity. At least the warning is known ;} [14:53:04] hashar: its sadly known :( as to putting the load on only the new boxes, I suppose i hadn't actually considered that [14:53:30] it's possible, would probably take some munging around to make disk space work out [14:54:26] (03PS8) 10Krinkle: [WIP] Implement /w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) [14:55:58] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Repool db1054 with low weight, repool db1067 with original weight (duration: 01m 22s) [14:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:04:15] (03CR) 10Krinkle: "@Tim: Replied on Patch Set 1." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [15:08:10] 10Ops-Access-Requests, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#1993992 (10Gehel) [15:09:47] 7Blocked-on-Operations, 6operations, 10Parsoid, 10Salt, 6Scrum-of-Scrums: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#1994024 (10ArielGlenn) Who can I coordinate on this for testing, now that we are on the nice ha... [15:10:35] (03PS1) 10Jcrespo: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268122 [15:11:19] !log depooling cp1060 temporarily from cache_mobile varnish backends [15:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:59] 6operations, 10Salt: take steps outlined at techops offiste to (try to) address salt reliability - https://phabricator.wikimedia.org/T115292#1994052 (10ArielGlenn) 5Open>3Resolved Huh, well it looks like salt is reliable after just moving to its own server with some patches to the master and minion. So wh... [15:15:07] !log rebooting cp1060 (depooled/downtimed) [15:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:09] 10Ops-Access-Requests, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#1994057 (10EBernhardson) also: analytics-search-users - for access to deploy and debug search jobs on the analytics cluster Ma... [15:19:00] 6operations, 10Salt: salt broken after the upgrade - https://phabricator.wikimedia.org/T100502#1994059 (10ArielGlenn) Here's an old comment (I wonder when I wrote this as a draft): I'm still working on this. So far: converted many instances to use the regular name instead of the ec2id name; andrew has convert... [15:21:29] !log mira: cloning 1.27.0-wmf.12 (no link updates) [15:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:55] (03CR) 10Jcrespo: [C: 032] Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268122 (owner: 10Jcrespo) [15:22:52] 10Ops-Access-Requests, 6operations, 10DBA: Grant mysql client access to testreduce_vd and testreduce_0715 databases - https://phabricator.wikimedia.org/T125435#1994066 (10ssastry) >>! In T125435#1993024, @jcrespo wrote: > Let me create a new user. And give me your ruthenium login. I will setup mysql-client i... [15:23:49] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Depool db1060 (duration: 00m 43s) [15:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:25:30] 10Ops-Access-Requests, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#1994072 (10Gehel) [15:25:59] "file has vanished: "/php-1.27.0-wmf.12/.git/modules/extensions/CentralAuth/objects/pack/tmp_pack_3VfhZu" (in common)" [15:26:20] just read hashar's log [15:26:34] should I retry? [15:26:37] doh [15:26:50] why does a sync-file sync everything anyway;..... [15:26:55] (mw updated ok) [15:27:04] it was the masters [15:27:05] or maybe that is the sync of mira staging to tin [15:27:11] yes [15:27:14] I am doing the massive git clone on mira [15:27:20] so yeah [15:27:22] would make sense [15:27:27] ok, I wait [15:27:34] then I resync again to be sure [15:28:01] not resync [15:28:05] it is just rsync that build the list and complains some files disappeared in between [15:28:12] the next sync-file will catch it up [15:28:24] in progress still [15:28:24] I will have to do it anyway for the repool [15:28:30] so I can wait [15:28:50] I am not convince the way we keep in the two co master is a good solution [15:28:56] seems racy [15:29:08] and this is why I have to avoid mediawiki and embrace conftool [15:29:13] it is at extensions/WikimediaMessages [15:29:17] soon(TM) [15:29:26] doing skins [15:30:23] !log MediaWiki 1.27.0-wmf.12, from 1.27.0-wmf.12, successfully checked out. [15:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:35] godog, _joe_: I got my unaffiliated IRC cloak. I remember that the next step was to ask access to a few private channels ? [15:31:20] jynus: clone completed [15:32:24] thanks, I will wait for the repool, I am confident that mediawikis are ok [15:32:35] (03PS1) 10Filippo Giunchedi: uwsgi: create /run/uwsgi, including at boot [puppet] - 10https://gerrit.wikimedia.org/r/268126 [15:32:55] !log restart and reconfigure mysql in db1060 [15:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:51] (03CR) 10jenkins-bot: [V: 04-1] uwsgi: create /run/uwsgi, including at boot [puppet] - 10https://gerrit.wikimedia.org/r/268126 (owner: 10Filippo Giunchedi) [15:35:57] godog: thx for the invite ! [15:36:01] (03PS1) 10Alexandros Kosiaris: conftool: Remove tilerator from conftool data [puppet] - 10https://gerrit.wikimedia.org/r/268127 [15:36:32] stupid scripts [15:37:20] PLEASE DO NOT SCAP / DEPLOY [15:37:20] jynus: ! [15:37:24] the stupid script failed [15:37:32] Updating current branch pointer... [15:37:45] add the flag, as we were told [15:37:54] that will prevent accidents [15:39:22] sync.flag [15:39:49] (but do nor worry, I will not sync until told otherwise) [15:41:05] (03PS1) 10ArielGlenn: send web server logs from dataset hosts to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) [15:41:05] should be fixed [15:41:50] !log mira symlink pointing to current version got changed to wmf.11 by the checkoutMediaWiki script. Manually changed to proper wmf.10 https://phabricator.wikimedia.org/T125475#1994078 [15:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:42:52] (03PS2) 10Filippo Giunchedi: uwsgi: create /run/uwsgi, including at boot [puppet] - 10https://gerrit.wikimedia.org/r/268126 [15:43:17] (03CR) 10jenkins-bot: [V: 04-1] send web server logs from dataset hosts to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [15:43:21] jynus: should be good now :} [15:43:45] if at any time there is any issue, just write a message here [15:43:51] yeah [15:44:03] I will stop the preparation work now [15:44:10] there is a swat starting in a few [15:44:31] yeah, I usually avoid duing anything at those hours [15:44:50] phuedx: I am doing some preparation work for the train later. So far just cloned the repositories [15:45:19] hashar: cool :) [15:45:20] phuedx: and I am stopping now, thank you to have pointed about the swat [15:45:40] that's what i'm here for [15:45:56] I should fix checkoutMediaWiki to clone from the previous branch to save time. [15:46:07] * ostriches continues to lurk and wake up [15:46:31] quick q: what does a deployer need to do for beta-cluster-only config changes? [15:46:40] do they need to be sync'd or does the beta cluster pick them up? [15:46:50] bribe Jenkins [15:46:56] they need to be merged in [15:47:00] THEN NOTHING WOULD EVER BE DEPLOYED [15:47:14] does jenkins actually ever do anything without being bribed? [15:47:51] and the long story is: Gerrit emits an event "change-merged" which is consumed by Zuul which in turns trigger a Gearman function which is associated to the beta-mediawiki-config-update-eqiad job [15:48:09] the job runs, report back to gearman -> zuul -> gerrit [15:48:19] AND trigger a run of scap on beta [15:48:33] so in short: get it merged on operations/mediawiki-config and it will be magically deployed on beta [15:49:10] (03PS2) 10ArielGlenn: send web server logs from dataset hosts to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) [15:49:31] good for beta :) [15:49:34] hashar: does the deployment host need to be updated too? [15:49:43] so no one sees a commit that they weren't expecting [15:50:13] 6operations, 7Monitoring: switch diamond to use graphite line protocol - https://phabricator.wikimedia.org/T121861#1994105 (10fgiunchedi) on second thought we should be fine with collectors that send counters too, what happens is that graphite will store the number as is and there shouldn't be a change in sem... [15:51:25] long story very short: i'll get https://gerrit.wikimedia.org/r/#/c/267812/ merged at the end of the swat window [15:51:35] (beta-only config change) [15:53:26] (03PS1) 10Jcrespo: Reconfigure db1060 (depooled) [puppet] - 10https://gerrit.wikimedia.org/r/268135 [15:53:35] 7Blocked-on-Operations, 6operations, 10Parsoid, 10Salt, 6Scrum-of-Scrums: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#1994112 (10ssastry) We have a deploy today. So I can give this a spin. [15:53:44] (03CR) 10Alexandros Kosiaris: [C: 032] conftool: Remove tilerator from conftool data [puppet] - 10https://gerrit.wikimedia.org/r/268127 (owner: 10Alexandros Kosiaris) [15:54:37] (03PS2) 10Jcrespo: Reconfigure db1060 (depooled) [puppet] - 10https://gerrit.wikimedia.org/r/268135 [15:55:05] PROBLEM - Disk space on analytics1027 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=87%) [15:55:27] phuedx: even if it is a beta only change, you would want to pull it on the prod deployment server so people are not confused :D [15:55:37] ta [15:56:10] 6operations, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1994115 (10mark) Are we able to use all backends across both eqiad/codfw simultaneously? Perhaps 2x 3 backend machines would be reasonable in that case? [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160203T1600). [16:00:04] Kelson bd808 bmansurov bblack phuedx aude dcausse: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:09] 6operations, 10ops-codfw: db2012 degraded RAID - https://phabricator.wikimedia.org/T124645#1994119 (10Papaul) p:5Triage>3Normal [16:00:12] (03CR) 10Jcrespo: [C: 032] Reconfigure db1060 (depooled) [puppet] - 10https://gerrit.wikimedia.org/r/268135 (owner: 10Jcrespo) [16:00:34] * aude waves [16:00:35] o/ [16:00:56] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: puppet fail [16:00:59] 6operations, 7Monitoring: switch diamond to use graphite line protocol - https://phabricator.wikimedia.org/T121861#1994121 (10fgiunchedi) I've temporarily switched diamond to graphite on `filippo-test-trusty.eqiad.wmflabs` to see the net effect it has (if any) on the metrics [16:02:33] o/ [16:03:17] I can SWAT today. bd808 or bma around? [16:03:24] yup [16:03:41] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267807 (owner: 10Jdlrobson) [16:03:45] thcipriani: I just snuck in a second beta cluster only config change too [16:03:58] bd808: HEY HEY [16:04:01] bd808: kk, np. [16:04:21] Hi. [16:04:24] thcipriani: kelson asks if you could consider trivial enough the patch to merge it without its presence or see yourself it it looks good [16:04:30] (03Merged) 10jenkins-bot: Just use the default MobileFrontend specified page actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267807 (owner: 10Jdlrobson) [16:04:57] thcipriani: ^ i'm on that [16:05:33] phuedx: on https://gerrit.wikimedia.org/r/#/q/262893,n,z you mean? [16:05:38] awesome. I don't even have to pretend I now how to test it [16:06:13] thcipriani: i mean 267807, i just saw the merge [16:06:24] bd808: warnings shouldn't be showing up in the logs [16:06:25] sec [16:06:30] thcipriani: phuedx knows the mobile fe stuff better than I do. I was just a proxy to keep things moving forward [16:06:39] bd808: Warning: array_diff() expects parameter 1 to be an array or collection in /srv/mediawiki/wmf-config/mobile.php on line 95 [16:06:46] we're seeing a whole bunch of those apparently [16:06:47] phuedx: ah, gotcha :) [16:07:05] phuedx: yeah that's what it will fix once it hits beta cluster [16:07:34] (03PS8) 10Eevans: [production]: match restbase config to current Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/266297 (https://phabricator.wikimedia.org/T123869) [16:07:39] bd808: so do you just merge beta-only changes and let the deployer know? (genuine question, no snark, i'm unsure of the etiquette) [16:07:44] phuedx: we're waiting on this jenkins job I think -- https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/ [16:08:12] phuedx: no, they need to be handled just like prod changes (let the deployer merge) [16:08:22] (03PS1) 10Subramanya Sastry: ruthenium services: tweaks based on changes to Parsoid & testreduce [puppet] - 10https://gerrit.wikimedia.org/r/268141 [16:08:24] ah [16:08:26] ta [16:08:36] but they are easier to get clearance for on a normal day [16:08:55] 6operations, 10RESTBase, 5Patch-For-Review: Reduce log spam by removing non-operational cassandra IPs from seeds - https://phabricator.wikimedia.org/T123869#1994135 (10Eevans) If there are no objections, I'll move forward with applying this in production (https://gerrit.wikimedia.org/r/#/c/266297/) first thi... [16:09:20] (03CR) 10Subramanya Sastry: [C: 04-1] "Will clear my -1 once https://gerrit.wikimedia.org/r/#/c/267786/ is merged." [puppet] - 10https://gerrit.wikimedia.org/r/268141 (owner: 10Subramanya Sastry) [16:09:53] jenkins is starting the beta cluster scap [16:09:55] RECOVERY - Disk space on analytics1027 is OK: DISK OK [16:10:11] _joe_, ema: I don't suppose a DNS change could be squeezed into puppet swat, could it? [16:10:47] <_joe_> Krenair: it depends :) [16:10:59] https://gerrit.wikimedia.org/r/#/c/267886/ is what I had in mind [16:11:00] blerg...sync-masters sure is hanging a while... [16:11:02] 6operations, 6Analytics-Kanban, 10hardware-requests, 5Patch-For-Review: 8 x 3 SSDs for AQS nodes. - https://phabricator.wikimedia.org/T124947#1994142 (10Eevans) >>! In T124947#1991215, @Ottomata wrote: > Hold on this, it seems will be replacing the aqs1xxx nodes since they are out of warranty. Moar memory... [16:11:14] thcipriani: hmmm [16:11:21] well there it goes [16:11:35] !log thcipriani@mira Synchronized wmf-config/mobile.php: SWAT: Just use the default MobileFrontend specified page actions. Part I [[gerrit:267807]] (duration: 02m 14s) [16:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:51] yuck. back to 2m for master sync? [16:12:12] evidently :( [16:12:12] (03PS1) 10Ottomata: Remove webrequest_mobile from list of webrequest topics that camus imports [puppet] - 10https://gerrit.wikimedia.org/r/268142 (https://phabricator.wikimedia.org/T122651) [16:12:22] <_joe_> Krenair: simple enough to go in PSwat [16:13:00] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Just use the default MobileFrontend specified page actions. Part II [[gerrit:267807]] (duration: 01m 18s) [16:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:13:06] ^ phuedx bd808 sync'd [16:13:17] (03CR) 10Ottomata: [C: 032 V: 032] Remove webrequest_mobile from list of webrequest topics that camus imports [puppet] - 10https://gerrit.wikimedia.org/r/268142 (https://phabricator.wikimedia.org/T122651) (owner: 10Ottomata) [16:13:49] (03CR) 10Subramanya Sastry: "https://github.com/wikimedia/mediawiki-services-parsoid-testreduce/commit/77c1951c77891309244edb889353029357c23859 is the change in the te" [puppet] - 10https://gerrit.wikimedia.org/r/268141 (owner: 10Subramanya Sastry) [16:13:57] jenkins is still chewing on the beta cluster sync that will tell us if the undef var problem is gone for master/.12 [16:14:07] Dereckson: I'd like to get https://gerrit.wikimedia.org/r/#/c/262893/ out the door, I know Kelson has been waiting a while, could you check that patch after I do the needful on that patch? [16:15:12] Okay. [16:15:20] Dereckson: thank you [16:15:21] phuedx: do you know where to look on en.m to ensure that upload buttons aren't popping up with that config change? [16:15:32] bd808: i'm on it [16:15:39] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262893 (https://phabricator.wikimedia.org/T122995) (owner: 10Mdann52) [16:16:13] bd808: upload isn't appearing in the mobile menu on en.m (as i'd expect!) [16:16:21] (03PS1) 10Ottomata: Remove old oozie logs in /var/log/oozie after they are rotated [puppet] - 10https://gerrit.wikimedia.org/r/268143 [16:16:23] sweet. [16:16:23] (03Merged) 10jenkins-bot: Add 2 sites to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262893 (https://phabricator.wikimedia.org/T122995) (owner: 10Mdann52) [16:17:36] ok, menu items are as expected on stable/beta [16:17:38] (03PS3) 10Tim Landscheidt: Tools: Fix argument quoting in jlocal [puppet] - 10https://gerrit.wikimedia.org/r/266935 [16:17:44] bd808, thcipriani ^ [16:17:46] thanks y'all [16:17:55] phuedx: thank you for checking :) [16:17:57] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/268126 (owner: 10Filippo Giunchedi) [16:18:24] (03PS3) 10Tim Landscheidt: puppetmaster: Fix git-sync-upstream for unclean rebases [puppet] - 10https://gerrit.wikimedia.org/r/264692 [16:18:31] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Add 2 sites to $wgCopyUploadsDomains [[gerrit:262893]] (duration: 01m 18s) [16:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:42] ^ Dereckson check please [16:19:14] Testing. [16:19:37] 6operations, 10Traffic, 5Patch-For-Review: Create separate packages for required vmods - https://phabricator.wikimedia.org/T124281#1994184 (10ema) Varnish 4 has [[https://github.com/lkarsten/libvmod-ipcast/blob/master/README.rst | std.ip() as a built-in function]] . We probably can skip porting libvmod-ipcas... [16:20:28] oh, I am here for SWAT, just haven't been watching this channel! [16:20:43] bblack: :) [16:20:46] thcipriani: tested [16:20:47] I was just about to ping you [16:20:51] Dereckson: thank you! [16:20:52] works fine [16:21:14] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267234 (https://phabricator.wikimedia.org/T110472) (owner: 10BBlack) [16:21:21] bd808: also: https://logstash-beta.wmflabs.org/#/dashboard/temp/AVKn5gA5IJ_1wf3JTz2a [16:21:40] Dereckson: I appreciate you taking the time. That patch has been shuffled around a few times :\ [16:21:59] (03Merged) 10jenkins-bot: MW parsoid URLs: s/parsoidcache/parsoid/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267234 (https://phabricator.wikimedia.org/T110472) (owner: 10BBlack) [16:22:25] You're welcome. [16:22:40] committing back home [16:22:44] (03PS2) 10Tim Landscheidt: shinken: Only regenerate configuration when there are changes [puppet] - 10https://gerrit.wikimedia.org/r/267423 [16:23:03] phuedx: looks like ours is gone, but there is a wikibase one still trending on the jobrunners [16:23:12] hrrrm [16:23:25] bblack: syncing now [16:23:33] aude: Can you look into "Warning: in_array() expects parameter 2 to be an array or collection in /srv/mediawiki/wmf-config/Wikibase.php on line 120" in beta cluster? [16:23:42] looking [16:23:50] 6operations, 10Traffic, 5Patch-For-Review: Create separate packages for required vmods - https://phabricator.wikimedia.org/T124281#1994206 (10BBlack) @ema +1 [16:24:30] !log thcipriani@mira Synchronized wmf-config/CommonSettings.php: SWAT: MW parsoid URLs: s/parsoidcache/parsoid/ [[gerrit:267234]] (duration: 01m 18s) [16:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:24:34] aude: Oh. $wgMFQueryPropModules comes from extension.json now. Not visible in that scope [16:24:40] phuedx: ^ [16:24:45] bblack: sync'd! check please. [16:25:14] so we need to rearrange things so that config isn't done until after the json files are loaded [16:25:38] * aude is somewhat confused what's been deployed [16:26:03] aude: this change will be in .13 on group0 & group1 today [16:26:11] so we just need to get it fixed for the train [16:26:13] thcipriani: check will take a little while, but risk is low [16:26:18] the error is on beta? [16:26:20] yes [16:26:27] ok makes sense [16:26:32] bblack: ack, thanks. [16:26:36] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:27:06] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267776 (https://phabricator.wikimedia.org/T124220) (owner: 10Bmansurov) [16:27:52] (03Merged) 10jenkins-bot: Remove section collapsing config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267776 (https://phabricator.wikimedia.org/T124220) (owner: 10Bmansurov) [16:28:37] !log OTRS migration to 4.0 completed, starting upgrade to 5.0 [16:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:29:23] Weird spike in error logs briefly: User::loadFromDatabase x.x.x.x 1205 Lock wait timeout exceeded [16:32:10] !log thcipriani@mira Synchronized wmf-config/mobile.php: SWAT: Remove section collapsing config [[gerrit:267776]] (duration: 01m 18s) [16:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:15] ^ phuedx check please [16:33:28] thcipriani: should be a nop -- en.m looks clean with no js errors [16:33:39] are the error logs clean? (which tools should i use?) [16:34:24] phuedx: I mainly use https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki-errors and https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor [16:34:31] ah [16:34:35] i didn't know about the former [16:34:58] 6operations, 10ops-codfw: ms-be2015 doesn't come up after reboot - https://phabricator.wikimedia.org/T125383#1994249 (10Papaul) @fgiunchedi I have the same message "error: unknow filesystem" [16:35:19] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268086 (owner: 10Aude) [16:35:28] thanks thcipriani [16:35:36] phuedx: thank you! [16:36:02] (03Merged) 10jenkins-bot: Enable math data type on test wikidata + test wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268086 (owner: 10Aude) [16:36:11] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1994252 (10BBlack) heh so: T113192 -> https://gerrit.wikimedia.org/r/#/c/258365/5 is probably the Jan 20 bump. [16:38:46] thcipriani: checked reasonably, but hard to get a solid confirmation other than "nobody is screaming"... [16:39:14] bblack: :D sounds good. Thank you. [16:39:38] (03CR) 10Ottomata: [C: 032] Remove old oozie logs in /var/log/oozie after they are rotated [puppet] - 10https://gerrit.wikimedia.org/r/268143 (owner: 10Ottomata) [16:39:45] !log thcipriani@mira Synchronized wmf-config: SWAT: Enable math data type on test wikidata + test wikipedias [[gerrit:268086]] (duration: 01m 18s) [16:39:48] ^ aude check please [16:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:40:05] checking [16:40:16] 6operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Ability to switch Traffic infrastructure Tier-1 to codfw manually - https://phabricator.wikimedia.org/T125510#1994278 (10Krinkle) [16:40:26] (03PS1) 10BryanDavis: Wikibase: Defer modifying wgMFQueryPropModules until wfLoadExtensions runs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268147 [16:41:12] thcipriani: looks ok enough [16:41:16] (03CR) 10Mobrovac: "@Filippo, we chose to go with a different statsd library that had batching support out of the box. It's been running in production for wee" [puppet] - 10https://gerrit.wikimedia.org/r/267917 (owner: 10Mobrovac) [16:41:20] aude: thank you [16:41:22] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265447 (owner: 10Aude) [16:41:29] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 7Monitoring: Create response time monitoring for WDQS endpoint - https://phabricator.wikimedia.org/T119915#1994282 (10Addshore) It should be noted we just had an partial outage for 6/7 hours without us noticed ;) wdqs1002 seemed to totally d... [16:41:41] (03CR) 10GWicke: [C: 031] RESTBase: enable metrics batching [puppet] - 10https://gerrit.wikimedia.org/r/267917 (owner: 10Mobrovac) [16:42:13] aude: I *think* https://gerrit.wikimedia.org/r/#/c/268147 will fix the HHVM warning we are seeing in beta cluster that I expect to show up in prod today with .12 as well [16:42:23] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Figure out and document the datacenter switchover process - https://phabricator.wikimedia.org/T124670#1994284 (10Krinkle) [16:42:25] phuedx: ^ [16:42:36] thanks bd808 [16:43:16] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Figure out and document the datacenter switchover process - https://phabricator.wikimedia.org/T124670#1962060 (10Krinkle) Might be a duplicate of {T114398}. [16:43:22] (03Merged) 10jenkins-bot: Remove unused/no longer existing item-create oauth grant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265447 (owner: 10Aude) [16:44:09] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1994305 (10BBlack) [16:44:36] (03PS2) 10BBlack: eqiad: remove last cache_mobile frontend [puppet] - 10https://gerrit.wikimedia.org/r/267230 (https://phabricator.wikimedia.org/T122651) [16:45:33] (03CR) 10Phuedx: [C: 031] Wikibase: Defer modifying wgMFQueryPropModules until wfLoadExtensions runs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268147 (owner: 10BryanDavis) [16:45:47] (03CR) 10BryanDavis: "This will be needed to go with 1.27.0-wmf.12 which is scheduled to roll to group0 and group1 today. Somebody needs to double check that I'" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268147 (owner: 10BryanDavis) [16:45:51] !log thcipriani@mira Synchronized wmf-config/CommonSettings.php: SWAT: Remove unused/no longer existing item-create oauth grant [[gerrit:265447]] (duration: 01m 18s) [16:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:46:14] ^ aude guessing it should be a no-op, but if you can check, check please [16:46:23] blerg. Just left :( [16:47:08] aude: guessing it should be a no-op, but if you can check, check please https://gerrit.wikimedia.org/r/#/c/265447/ [16:47:25] ok [16:47:29] 7Blocked-on-Operations, 6operations, 10Parsoid, 10Salt, 6Scrum-of-Scrums: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#1994310 (10ArielGlenn) Great, let me know if you see this or any other errors so I can track th... [16:47:48] (03CR) 10BBlack: [C: 032] eqiad: remove last cache_mobile frontend [puppet] - 10https://gerrit.wikimedia.org/r/267230 (https://phabricator.wikimedia.org/T122651) (owner: 10BBlack) [16:47:50] i still see it on https://meta.wikimedia.org/wiki/Special:OAuth/grants but might be caching [16:48:18] or it's set in yet anothe rplace now [16:48:58] yep CommonSettings.php:$wgGrantPermissions['createeditmovepage']['item-create'] = true; [16:49:16] can make another patch later once the authentication stuff is sorted out [16:50:19] aude: kk, just confirmed that the patch did in fact sync out, FWIW [16:50:24] dcausse: ping for SWAT [16:50:30] thcipriani: hi! [16:50:32] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: /srv/mediawiki-staging broken on both scap masters - https://phabricator.wikimedia.org/T125506#1994318 (10greg) a:3demon Other than the incident report, we are now "back to normal" and deploying SWAT deploys and plan to do the train (1.27-wmf.12... [16:50:42] thcipriani: item-create was/is set in 2 places [16:50:57] second place probably added since i made my patch [16:51:02] but it's ok for now [16:51:21] aude: ok. thanks for checking. [16:51:30] "This can be removed after 1.27.0-wmf.11 is everywhere" [16:51:45] applies to the whole set of wgMWOAuthGrantPermissions [16:51:49] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268097 (owner: 10DCausse) [16:51:50] so think we are ok [16:52:24] (03CR) 10Mobrovac: [C: 031] RESTBase and Labs DNS configuration for ady.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/268016 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [16:52:31] (03Merged) 10jenkins-bot: Return more like search queries to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268097 (owner: 10DCausse) [16:55:49] !log thcipriani@mira Synchronized wmf-config/CirrusSearch-production.php: SWAT: Return more like search queries to codfw [[gerrit:268097]] (duration: 01m 17s) [16:55:49] 6operations, 10Traffic, 7Pybal: pybal etcd coroutine crashed - https://phabricator.wikimedia.org/T125397#1994342 (10BBlack) Just triggered another one directly and immediately. The sequence in this case is: 1. Merged https://gerrit.wikimedia.org/r/#/c/267230/ (removes 1/9 servers from dc=eqiad,cluster=cach... [16:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:56:07] dcausse: patch sync'd check please [16:56:23] thcipriani: looks good I can see queries in codfw, thanks! [16:56:30] dcausse: awesome. Thanks! [16:57:29] (03PS1) 10Papaul: Remove caesium from DNS Bug:T125165 [dns] - 10https://gerrit.wikimedia.org/r/268149 (https://phabricator.wikimedia.org/T125165) [16:57:34] !log mira: updating /srv/mediawiki-staging/php-1.27.0-wmf.12 (prep deployment train) [16:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:00:04] _joe_ ema: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160203T1700). Please do the needful. [17:00:04] mobrovac mdholloway bearND mobrovac Krenair: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:01:40] 6operations, 10DBA, 10MediaWiki-Configuration, 6Release-Engineering-Team, and 3 others: codfw is in read only according to mediawiki - https://phabricator.wikimedia.org/T124795#1994378 (10Krinkle) [17:02:00] 6operations, 10DBA, 10MediaWiki-Configuration, 6Release-Engineering-Team, and 3 others: codfw is in read only according to mediawiki - https://phabricator.wikimedia.org/T124795#1966220 (10Krinkle) [17:02:09] !log restarting pybal on lvs1003 T125397 [17:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:02:18] ugh [17:02:23] !log restarting pybal on lvs1004 (not 1003!) T125397 [17:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:03:35] _joe_: here they are - mdholloway deployed the code jsut now [17:03:38] ema: ^^ [17:03:42] !log mobileapps deployed 68e38ec [17:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:03:53] <_joe_> ok [17:05:00] (03PS2) 10Ema: MobileApps: Change RESTBase URI [puppet] - 10https://gerrit.wikimedia.org/r/267392 (https://phabricator.wikimedia.org/T125252) (owner: 10Mobrovac) [17:05:10] 10Ops-Access-Requests, 6operations: add subbu to parsoid-roots - https://phabricator.wikimedia.org/T125166#1994403 (10mark) Let's do this - Approved. (I'll get to the other tasks tomorrow as soon as I get out of budget frenzy ;) [17:05:30] (03CR) 10Ema: [C: 032 V: 032] MobileApps: Change RESTBase URI [puppet] - 10https://gerrit.wikimedia.org/r/267392 (https://phabricator.wikimedia.org/T125252) (owner: 10Mobrovac) [17:06:37] (03CR) 10Krinkle: [C: 04-1] "-1 pending Aaron's review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo) [17:07:22] (03PS1) 10Elukey: Add subbu to a temporary parsoid-root group. Bug: T125166 [puppet] - 10https://gerrit.wikimedia.org/r/268156 (https://phabricator.wikimedia.org/T125166) [17:09:11] (03CR) 10jenkins-bot: [V: 04-1] Add subbu to a temporary parsoid-root group. Bug: T125166 [puppet] - 10https://gerrit.wikimedia.org/r/268156 (https://phabricator.wikimedia.org/T125166) (owner: 10Elukey) [17:09:23] ema: _joe_: just run puppet on one of the scb100x hosts and restart, so we can test [17:09:49] mobrovac: running puppet on scb1001 [17:09:53] kk [17:10:09] (03CR) 10Dzahn: [C: 031] "+1 for intention and technical part, except nitpick i would say "parsoid-test-roots" because all other groups end in -admin(s) or -roots" [puppet] - 10https://gerrit.wikimedia.org/r/268156 (https://phabricator.wikimedia.org/T125166) (owner: 10Elukey) [17:10:29] (03CR) 10Jcrespo: "> as masters as that creates a splitbrain scenario" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo) [17:10:58] (03CR) 10Jcrespo: "BTW, this patch is not yet ready for review, I have to fix a couple of things." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo) [17:11:21] ema: ok, the change is in, have you restarted mobileapps there perhaps? [17:11:37] <_joe_> mobrovac: scb1001 has the new version, can you tell us if everything is ok there? [17:11:47] <_joe_> mobrovac: puppet does [17:11:54] <_joe_> config change => service restart [17:11:56] (03PS4) 10BryanDavis: git deploy cleanup to toss minion from redis [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/219852 (https://phabricator.wikimedia.org/T74319) (owner: 10ArielGlenn) [17:12:04] mobrovac: it looks like it restarted already [17:12:11] _joe_: touche', testing now [17:12:13] ema: ^^ [17:13:02] ema: _joe_: ok, life's good again [17:13:13] we can proceed to scb1002 [17:13:15] <_joe_> mobrovac: going on scb1002 [17:13:55] godog: still around? [17:14:17] mobrovac: yup I'm here [17:14:26] * mobrovac testing on scb1002 [17:14:35] <_joe_> mobrovac: it's restarted now [17:15:05] _joe_: ema: tested, works! thnx [17:15:12] <_joe_> ok, next! [17:15:16] (03PS1) 10Elukey: Add subbu to a temporary parsoid-root group. Bug: T125166 [puppet] - 10https://gerrit.wikimedia.org/r/268163 (https://phabricator.wikimedia.org/T125166) [17:15:41] godog: i'd like you here for the next patch (rb batching) [17:15:46] _joe_: next one is mobrovac again :) [17:15:50] (03CR) 10BryanDavis: "Not tested but the code reads well. Thanks for working on this Ariel." (031 comment) [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/219852 (https://phabricator.wikimedia.org/T74319) (owner: 10ArielGlenn) [17:16:07] mobrovac: yep, let me +1 too [17:16:14] <_joe_> godog: thanks [17:16:21] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/267917 (owner: 10Mobrovac) [17:16:35] <_joe_> ema: this is restbase, so a config change will /not/ restart the service [17:16:47] <_joe_> (this can all be figured from puppet btw) [17:17:00] (03CR) 10jenkins-bot: [V: 04-1] Add subbu to a temporary parsoid-root group. Bug: T125166 [puppet] - 10https://gerrit.wikimedia.org/r/268163 (https://phabricator.wikimedia.org/T125166) (owner: 10Elukey) [17:17:18] <_joe_> ema: so first we puppet-merge, then we can do a small salt trick to make puppet + restarts go [17:17:34] (03PS2) 10Ema: RESTBase: enable metrics batching [puppet] - 10https://gerrit.wikimedia.org/r/267917 (owner: 10Mobrovac) [17:18:19] (03CR) 10Ema: [C: 032 V: 032] RESTBase: enable metrics batching [puppet] - 10https://gerrit.wikimedia.org/r/267917 (owner: 10Mobrovac) [17:18:20] _joe_: godog: ema: i suggest to apply the patch and restart only one node, then wait 5 mins or so so that godog can verify the packets on the receiving end are ok [17:18:36] <_joe_> mobrovac: that was the idea, yes [17:19:20] _joe_, mobrovac: puppet-merged [17:19:51] ok i'll apply in on rb1001 [17:19:56] <_joe_> mobrovac: hold on [17:19:59] <_joe_> can I apply it? [17:20:01] sure [17:20:07] i was about to press enter [17:20:08] :) [17:20:09] (03CR) 10Dzahn: [C: 04-1] "+1 for the new group name, the user name is not "subbu" though, it's ssastry. that's why jenkins fails with this:" [puppet] - 10https://gerrit.wikimedia.org/r/268163 (https://phabricator.wikimedia.org/T125166) (owner: 10Elukey) [17:20:16] saved by the bell [17:21:53] <_joe_> mobrovac: done [17:22:00] <_joe_> btw for cases like the mobileapps one [17:22:07] ok, i'll restart [17:22:09] (03PS1) 10Elukey: Add Subbu to a temporary parsoid-root group. Bug: T125166 [puppet] - 10https://gerrit.wikimedia.org/r/268167 (https://phabricator.wikimedia.org/T125166) [17:22:09] <_joe_> we have scripts named mobileapps-deploy [17:22:19] <_joe_> which you can run and will do everything for you [17:22:23] (03PS1) 10Alexandros Kosiaris: otrs: Collapse all VirtualHosts into one [puppet] - 10https://gerrit.wikimedia.org/r/268168 [17:22:29] !log restbase restarting rb1001 [17:22:30] <_joe_> deploy code, run puppet, restart the service, check the health [17:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:22:41] <_joe_> mobrovac: I already restarted it... [17:22:45] (03PS2) 10Alexandros Kosiaris: otrs: Collapse all VirtualHosts into one [puppet] - 10https://gerrit.wikimedia.org/r/268168 [17:22:48] _joe_: yup, that's super nice, the problem is i'm not root on scb100x [17:22:52] _joe_: ah ok [17:22:53] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] otrs: Collapse all VirtualHosts into one [puppet] - 10https://gerrit.wikimedia.org/r/268168 (owner: 10Alexandros Kosiaris) [17:22:59] <_joe_> mobrovac: oh well.... [17:23:14] (03CR) 10Dzahn: [C: 031] Add Subbu to a temporary parsoid-root group. Bug: T125166 [puppet] - 10https://gerrit.wikimedia.org/r/268167 (https://phabricator.wikimedia.org/T125166) (owner: 10Elukey) [17:23:35] mobrovac: metrics look batched to me now, though it looks like the statsd client is clamping partial metrics once the mtu is met [17:23:54] hm [17:25:00] <_joe_> godog: meeh [17:25:07] <_joe_> mobrovac: should we revert? [17:25:44] the impact is likely minimal btw, it isn't always happening but not my call [17:26:29] _joe_: godog: let's revert, losing metrics is not what cool kids do [17:26:41] <_joe_> cool kids make metrics up [17:26:46] haha [17:26:47] (03PS1) 10Jcrespo: Repool db1060 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268170 [17:26:56] <_joe_> ok fine by me. [17:26:59] i'll prepare a revert godog _joe_ [17:27:05] Sorry for the gerrit spam, I am not able to use git [17:27:08] :) [17:27:12] (03CR) 10Dzahn: [C: 04-2] "use https://gerrit.wikimedia.org/r/#/c/268167/ instead" [puppet] - 10https://gerrit.wikimedia.org/r/268156 (https://phabricator.wikimedia.org/T125166) (owner: 10Elukey) [17:27:19] <_joe_> elukey: np [17:27:41] (03CR) 10Dzahn: [C: 04-2] "superseeded by https://gerrit.wikimedia.org/r/#/c/268167/" [puppet] - 10https://gerrit.wikimedia.org/r/268163 (https://phabricator.wikimedia.org/T125166) (owner: 10Elukey) [17:28:21] (03PS1) 10Mobrovac: Revert "RESTBase: enable metrics batching" [puppet] - 10https://gerrit.wikimedia.org/r/268171 [17:28:31] _joe_: ema: ^^ [17:28:31] (03Abandoned) 10Elukey: Add subbu to a temporary parsoid-root group. Bug: T125166 [puppet] - 10https://gerrit.wikimedia.org/r/268156 (https://phabricator.wikimedia.org/T125166) (owner: 10Elukey) [17:28:46] (03Abandoned) 10Elukey: Add subbu to a temporary parsoid-root group. Bug: T125166 [puppet] - 10https://gerrit.wikimedia.org/r/268163 (https://phabricator.wikimedia.org/T125166) (owner: 10Elukey) [17:28:54] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "It seemed to be losing metrics, even if marginally." [puppet] - 10https://gerrit.wikimedia.org/r/268171 (owner: 10Mobrovac) [17:29:47] thnx _joe_, i'll run puppet [17:29:54] hashar, ok to deploy a quick repool, right? [17:29:57] (03PS2) 10Dzahn: Add Subbu to a temporary parsoid-root group. Bug: T125166 [puppet] - 10https://gerrit.wikimedia.org/r/268167 (https://phabricator.wikimedia.org/T125166) (owner: 10Elukey) [17:29:59] <_joe_> mobrovac: ok [17:30:04] jynus: I guess [17:30:08] <_joe_> just rb1001 needs restarting [17:30:13] yup [17:30:24] <_joe_> Krenair: you're up next I guess [17:30:26] I am disconnecting for a couple hours for dinner / kids etc... [17:30:32] _joe_: ema: note that Krenair's patch contains an rb config change [17:30:33] (03CR) 10Jcrespo: [C: 032] Repool db1060 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268170 (owner: 10Jcrespo) [17:30:51] so wait until i do the puppet run and restart of rb1001 [17:31:00] <_joe_> yeah that's the second one I guess [17:31:03] will yells loud here whenever I start the mess. 9pm CET / noon PST [17:32:44] (03CR) 10Elukey: [C: 032] Add Subbu to a temporary parsoid-root group. Bug: T125166 [puppet] - 10https://gerrit.wikimedia.org/r/268167 (https://phabricator.wikimedia.org/T125166) (owner: 10Elukey) [17:32:56] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Repool db1060 after maintenance (duration: 01m 20s) [17:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:33:22] 6operations, 10RESTBase, 7Graphite, 7service-runner: restbase should send metrics in batches - https://phabricator.wikimedia.org/T121231#1994541 (10fgiunchedi) a similar change has been merged today (and reverted) in https://gerrit.wikimedia.org/r/#/c/267917/ though the batching seems to truncate partial m... [17:33:25] (03CR) 10Elukey: [V: 032] Add Subbu to a temporary parsoid-root group. Bug: T125166 [puppet] - 10https://gerrit.wikimedia.org/r/268167 (https://phabricator.wikimedia.org/T125166) (owner: 10Elukey) [17:33:52] mobrovac: ^ I guess we can continue the discussion on that ticket [17:34:01] yup godog [17:36:49] (03PS4) 10Alexandros Kosiaris: otrs: OTRS search slow, increase between_bytes_timeout [puppet] - 10https://gerrit.wikimedia.org/r/267867 (https://bugzilla.wikimedia.org/74109) [17:36:56] (03CR) 10Alexandros Kosiaris: [C: 032] otrs: OTRS search slow, increase between_bytes_timeout [puppet] - 10https://gerrit.wikimedia.org/r/267867 (https://bugzilla.wikimedia.org/74109) (owner: 10Alexandros Kosiaris) [17:37:03] (03CR) 10Alexandros Kosiaris: [V: 032] otrs: OTRS search slow, increase between_bytes_timeout [puppet] - 10https://gerrit.wikimedia.org/r/267867 (https://bugzilla.wikimedia.org/74109) (owner: 10Alexandros Kosiaris) [17:37:27] (03CR) 10BBlack: [C: 031] Add ady language to DNS [dns] - 10https://gerrit.wikimedia.org/r/267886 (https://phabricator.wikimedia.org/T125501) (owner: 10Alex Monk) [17:37:36] (03PS4) 10Alexandros Kosiaris: misc-web: Route ticket.wikimedia.org to mendelevium [puppet] - 10https://gerrit.wikimedia.org/r/267868 (https://bugzilla.wikimedia.org/74109) [17:37:44] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] misc-web: Route ticket.wikimedia.org to mendelevium [puppet] - 10https://gerrit.wikimedia.org/r/267868 (https://bugzilla.wikimedia.org/74109) (owner: 10Alexandros Kosiaris) [17:38:25] mutante: merged 8e7e19c as well [17:38:34] akosiaris: thank :) [17:38:40] akosiaris: we _just_ talked about that :) [17:38:40] thanks [17:38:55] sorry, not reading backlog right now [17:39:15] (03CR) 10Mobrovac: [C: 04-1] "Retracting my earlier +1. Why is DNS being updated for Labs, but the config change for RESTBase is for prod?" [puppet] - 10https://gerrit.wikimedia.org/r/268016 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [17:39:26] <_joe_> oh ok [17:39:32] oh [17:39:38] _joe_: ema: Krenair: jsut -1'ed https://gerrit.wikimedia.org/r/#/c/268016/ [17:39:38] <_joe_> ema: we should stop puppetswat here then :) [17:39:47] fair enough [17:39:50] <_joe_> Krenair: I'd tie your dns change to that patch? [17:39:55] <_joe_> or I can send it out now [17:41:06] (03PS2) 10Alexandros Kosiaris: ticket.wikimedia.org: Move over to misc-web [dns] - 10https://gerrit.wikimedia.org/r/267872 (https://bugzilla.wikimedia.org/74109) [17:41:12] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ticket.wikimedia.org: Move over to misc-web [dns] - 10https://gerrit.wikimedia.org/r/267872 (https://bugzilla.wikimedia.org/74109) (owner: 10Alexandros Kosiaris) [17:41:29] <_joe_> Krenair: actually, there is an issue with the dns change as well [17:42:29] _joe_: just go ahead and +2 the DNS change like normal, and I'll go clean up in the wake of it [17:42:38] <_joe_> bblack: uhm ok [17:42:49] sorry, best answer I have! [17:43:18] assume it works, it should work, there's a bug that needs documenting, that I can work around for us for now [17:43:57] (03PS2) 10Giuseppe Lavagetto: Add ady language to DNS [dns] - 10https://gerrit.wikimedia.org/r/267886 (https://phabricator.wikimedia.org/T125501) (owner: 10Alex Monk) [17:44:03] so, as usual, things were not as easy as they looked [17:44:22] (03PS2) 10Alexandros Kosiaris: otrs: disable SessionCheckRemoteIP [puppet] - 10https://gerrit.wikimedia.org/r/242789 (https://phabricator.wikimedia.org/T87217) (owner: 10Faidon Liambotis) [17:44:34] <_joe_> jynus: well actually, things were quite straightforward [17:44:39] (03PS3) 10Alexandros Kosiaris: otrs: disable SessionCheckRemoteIP [puppet] - 10https://gerrit.wikimedia.org/r/242789 (https://phabricator.wikimedia.org/T87217) (owner: 10Faidon Liambotis) [17:44:46] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] otrs: disable SessionCheckRemoteIP [puppet] - 10https://gerrit.wikimedia.org/r/242789 (https://phabricator.wikimedia.org/T87217) (owner: 10Faidon Liambotis) [17:45:33] (03CR) 10Giuseppe Lavagetto: [C: 032] Add ady language to DNS [dns] - 10https://gerrit.wikimedia.org/r/267886 (https://phabricator.wikimedia.org/T125501) (owner: 10Alex Monk) [17:46:11] <_joe_> bblack: merged [17:48:23] 6operations, 10Analytics, 10Analytics-Cluster: Kafka Broker disk usage is imbalanced - https://phabricator.wikimedia.org/T99105#1994605 (10Nuria) [17:50:17] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1994618 (10jcrespo) All slaves (except dbstore1001) have been prepared. [17:51:09] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1994622 (10jcrespo) Disk is at 2% right now. [17:51:18] (03CR) 10Faidon Liambotis: [C: 031] otrs: Route OTRS email to mendelevium [puppet] - 10https://gerrit.wikimedia.org/r/267888 (https://phabricator.wikimedia.org/T74109) (owner: 10Alexandros Kosiaris) [17:52:48] (03CR) 10Alexandros Kosiaris: [C: 032] otrs: Route OTRS email to mendelevium [puppet] - 10https://gerrit.wikimedia.org/r/267888 (https://phabricator.wikimedia.org/T74109) (owner: 10Alexandros Kosiaris) [17:52:50] (03PS3) 10Alexandros Kosiaris: otrs: Route OTRS email to mendelevium [puppet] - 10https://gerrit.wikimedia.org/r/267888 (https://phabricator.wikimedia.org/T74109) [17:52:52] (03CR) 10Alexandros Kosiaris: [V: 032] otrs: Route OTRS email to mendelevium [puppet] - 10https://gerrit.wikimedia.org/r/267888 (https://phabricator.wikimedia.org/T74109) (owner: 10Alexandros Kosiaris) [17:53:07] 10Ops-Access-Requests, 6operations: add subbu to parsoid-roots - https://phabricator.wikimedia.org/T125166#1994652 (10elukey) As discussed with Ops, we don't want to block Subbu and I just made the change to add a new group (containing only Subbu) to ruthenium: https://gerrit.wikimedia.org/r/#/c/268167/ elu... [17:53:27] _joe_: ady change should be live in DNS now [17:57:17] akosiaris, I think we should leave the slave off for now, and only activate if tomorrow to discard regressions [17:57:28] jynus: agreed [17:59:15] (03CR) 10Alex Monk: "They're both changes that need making for the new site to be added... nothing wrong with them being one commit" [puppet] - 10https://gerrit.wikimedia.org/r/268016 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [17:59:59] 6operations, 10Wikimedia-DNS, 5Patch-For-Review: adding new languages to DNS langs.tmpl doesn't work until zone template is edited as well - https://phabricator.wikimedia.org/T97051#1994679 (10BBlack) For now, one way to workaround this is: Run authdns-update as normal, which takes care of some important st... [18:00:37] I will start the IO log now, however, because max_log_days is set to 2 [18:00:39] 6operations, 10Analytics, 10Analytics-Cluster: Kafka Broker disk usage is imbalanced - https://phabricator.wikimedia.org/T99105#1994684 (10Ottomata) p:5Normal>3Lowest [18:00:45] and that is risky [18:02:21] !log starting slave IO thread on db2010 [18:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:02:54] (that does not update the database, but prevents having to to clone the entire server) [18:02:56] 6operations, 10RESTBase, 7Graphite, 7service-runner: restbase should send metrics in batches - https://phabricator.wikimedia.org/T121231#1994689 (10mobrovac) Instead of going with our own patched client, we decided to go with [hot-shots](https://github.com/brightcove/hot-shots), which has out-of-the-box su... [18:04:31] !log previous announcement was for db2011, not db2010 [18:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:10:04] 10Ops-Access-Requests, 6operations: add subbu to parsoid-roots - https://phabricator.wikimedia.org/T125166#1994730 (10Dzahn) a:3elukey @ssastry you got root ``` @ruthenium:~# id ssastry uid=2316(ssastry) gid=500(wikidev) groups=500(wikidev),702(parsoid-admin),772(parsoid-test-roots) @ruthenium:~# cat /et... [18:10:10] 10Ops-Access-Requests, 6operations: add subbu to parsoid-roots - https://phabricator.wikimedia.org/T125166#1994732 (10Dzahn) 5Open>3Resolved [18:16:23] !log restarting pybal on lvs1001 [18:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:18:46] (03PS1) 10Aude: Put $wgMFQueryPropModules and $wgMFSearchApiModules in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268182 (https://phabricator.wikimedia.org/T120197) [18:23:07] (03PS2) 10Aude: Put $wgMFQueryPropModules and $wgMFSearchApiModules in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268182 (https://phabricator.wikimedia.org/T120197) [18:23:50] 6operations: setup/deploy oresrdb1001-oresrdb1002 - https://phabricator.wikimedia.org/T125562#1994829 (10RobH) [18:25:22] (03PS3) 10Andrew Bogott: openstack: fix typo, "spandby-server" for glance [puppet] - 10https://gerrit.wikimedia.org/r/266979 (owner: 10Dzahn) [18:26:46] (03PS2) 10BryanDavis: Wikibase: Defer modifying wgMFQueryPropModules until wfLoadExtensions runs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268147 (https://phabricator.wikimedia.org/T125672) [18:27:53] (03CR) 10Andrew Bogott: [C: 032] openstack: fix typo, "spandby-server" for glance [puppet] - 10https://gerrit.wikimedia.org/r/266979 (owner: 10Dzahn) [18:29:43] (03CR) 10BryanDavis: "Aude has a different approach in Ic08d1f83031f3324040bd395d64d624c71c0b8b9 that also catches something I've missed here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268147 (https://phabricator.wikimedia.org/T125672) (owner: 10BryanDavis) [18:30:45] (03PS3) 10BryanDavis: Wikibase: Defer modifying wgMFQueryPropModules until wfLoadExtensions runs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268147 (https://phabricator.wikimedia.org/T125672) [18:30:50] 6operations, 10RESTBase, 7Graphite, 7service-runner: restbase should send metrics in batches - https://phabricator.wikimedia.org/T121231#1994852 (10Pchelolo) Created a PR for `hot-shots` lib https://github.com/brightcove/hot-shots/pull/14 , and the maintainer of that package is really fast: the PR is alrea... [18:32:01] 6operations, 6Release-Engineering-Team, 3Scap3: Depool proxies temporarily while scap is ongoing to avoid taxing those nodes - https://phabricator.wikimedia.org/T125629#1994855 (10demon) p:5Triage>3Normal [18:34:04] greg-g: can I slip in a beta only config change that has been waiting for a couple of days due to the crazy? [18:34:17] bd808: yeah, now's fine [18:34:26] swat went without hitch, so, I feel better :) [18:34:34] !log rebooting californium for kernel update [18:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:35:11] (03PS3) 10BryanDavis: Experiment one: Labs stripping HTML in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267812 (https://phabricator.wikimedia.org/T124959) (owner: 10Jdlrobson) [18:35:18] (03CR) 10BryanDavis: [C: 032] Experiment one: Labs stripping HTML in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267812 (https://phabricator.wikimedia.org/T124959) (owner: 10Jdlrobson) [18:35:50] (03Merged) 10jenkins-bot: Experiment one: Labs stripping HTML in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267812 (https://phabricator.wikimedia.org/T124959) (owner: 10Jdlrobson) [18:35:51] jdlrobson: getting ready to sync your beta cluster config change [18:35:57] bd808: \o/ [18:36:20] (03PS3) 10Aude: Put $wgMFQueryPropModules and $wgMFSearchApiModules in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268182 (https://phabricator.wikimedia.org/T120197) [18:37:54] (03CR) 10ArielGlenn: [C: 031] dataset: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266966 (owner: 10Dzahn) [18:38:02] PROBLEM - OTRS SMTP on iodine is CRITICAL: Connection refused [18:38:10] !log bd808@mira Synchronized wmf-config/InitialiseSettings-labs.php: Experiment one: Labs stripping HTML in beta (360e5af) (duration: 01m 19s) [18:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:38:15] (03CR) 10Aude: "now for enabling on wikidata when we are ready, we can just remove the setting entirely (from here + InitialiseSettings.php)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267405 (https://phabricator.wikimedia.org/T124931) (owner: 10Llyrian) [18:38:18] that's old OTRS, known [18:40:33] ACKNOWLEDGEMENT - OTRS SMTP on iodine is CRITICAL: Connection refused daniel_zahn https://phabricator.wikimedia.org/T74109 [18:40:44] (03Abandoned) 10BryanDavis: Wikibase: Defer modifying wgMFQueryPropModules until wfLoadExtensions runs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268147 (https://phabricator.wikimedia.org/T125672) (owner: 10BryanDavis) [18:41:30] 6operations, 10OTRS: upgrade iodine to jessie or find a new host with jessie for OTRS - https://phabricator.wikimedia.org/T105125#1994873 (10Dzahn) [18:41:33] 6operations, 10OTRS, 5Patch-For-Review, 7user-notice: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1994872 (10Dzahn) [18:42:58] 6operations, 10OTRS: upgrade iodine to jessie or find a new host with jessie for OTRS - https://phabricator.wikimedia.org/T105125#1436914 (10Dzahn) OTRS switched to me mendelevium.eqiad.wmnet today , see T105125 i think that means Alex also resolved this ticket here [18:46:17] (03CR) 10ArielGlenn: [C: 031] mediawiki/jobrunner: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266967 (owner: 10Dzahn) [18:49:49] <_joe_> bblack: cool, it works [18:50:12] (03PS1) 10Andrew Bogott: Fixes to the glance image backup cron [puppet] - 10https://gerrit.wikimedia.org/r/268187 [18:50:53] (03PS3) 10Dzahn: dataset: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266966 [18:51:28] (03CR) 10Dzahn: [C: 032] dataset: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266966 (owner: 10Dzahn) [18:53:07] (03PS2) 10Giuseppe Lavagetto: Revert "tin: disable l10nupdate until we figure out if it works with HHVM" [puppet] - 10https://gerrit.wikimedia.org/r/268020 (owner: 10Chad) [18:56:31] (03CR) 10Giuseppe Lavagetto: [C: 032] Revert "tin: disable l10nupdate until we figure out if it works with HHVM" [puppet] - 10https://gerrit.wikimedia.org/r/268020 (owner: 10Chad) [19:00:02] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [19:01:11] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:01:47] (03PS2) 10Andrew Bogott: Fixes to the glance image backup cron [puppet] - 10https://gerrit.wikimedia.org/r/268187 [19:01:53] thanks for syncing that bd808 [19:07:32] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:08:25] !log depooling restbase1002 for kernel/Java update [19:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:08:32] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:09:04] !log deployed patch for T125684 [19:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:13:41] (03CR) 10ArielGlenn: "I can't see the grafana::dashboard class used anywhere, Ori set that up in Ia558d44249db98cedf9d2187709c0de6e8ed595a so you want to ask hi" [puppet] - 10https://gerrit.wikimedia.org/r/266978 (owner: 10Dzahn) [19:21:26] (03PS3) 10Andrew Bogott: Fixes to the glance image backup cron [puppet] - 10https://gerrit.wikimedia.org/r/268187 [19:24:07] !log starting train deployment of 1.27.0-wmf.12 [19:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:24:57] !log Applying security patches on mira [19:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:25:47] is this group 0, group 1? [19:27:27] group 0 [19:27:31] settle for half an hour [19:27:33] then group 1 [19:27:38] cool, thanks hash [19:27:40] *hashar [19:27:42] wish us luck [19:27:49] and godspeed! [19:29:35] !log Create patches to update wikiversions.json [19:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:29:49] ah we have a ton of scripts that is nice [19:30:38] !log repooling restbase1002 [19:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:31:33] updateWikiversions group0 php-1.27.0-wmf.12 points to .11 grmblblb [19:32:16] (03CR) 10Andrew Bogott: [C: 032] Fixes to the glance image backup cron [puppet] - 10https://gerrit.wikimedia.org/r/268187 (owner: 10Andrew Bogott) [19:32:20] (03PS6) 10Yuvipanda: toollabs: Do not hardcode Host header [puppet] - 10https://gerrit.wikimedia.org/r/267402 [19:35:07] (03CR) 10JanZerebecki: [C: 031] Put $wgMFQueryPropModules and $wgMFSearchApiModules in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268182 (https://phabricator.wikimedia.org/T120197) (owner: 10Aude) [19:35:31] !log mira: manually fixed /php and /w/static/current symlinks to point back to .10 (wikiversions migrated them to .11 which we skip) [19:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:38:21] hmmm, that's annoying re wikiversions [19:39:52] !log hot patch OTRS installation with https://github.com/OTRS/otrs/commit/c7ea6d64e02518e166fbac02f42f25dacad54342 [19:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:40:03] mutante: see ? I told you it was too early to tell ^ [19:40:56] akosiaris: :/ [19:41:30] (03PS1) 10Hashar: Group0 to 1.27.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268196 [19:42:22] greg-g: Imma file a task. [19:42:33] It's pretty naive and just assumes +/- 1 for the most part [19:43:09] (03CR) 10Hashar: "I followed the instruction. Note the updateWikiversions group0 php-1.27.0-wmf.12 script did set the symlink to .11 I manually reverted t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268196 (owner: 10Hashar) [19:44:32] at least it got the wikiversions.json properly [19:44:37] hashar: https://phabricator.wikimedia.org/T125672 [19:45:01] eek [19:45:11] we had a related swat for some other MobileFrontend unset variables [19:45:54] !log https://phabricator.wikimedia.org/T125672 blocking wmf.12 "Notice: Undefined variable: wgMFQueryPropModules in /srv/mediawiki/wmf-config/Wikibase.php on line 120" [19:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:45:59] (03PS1) 10Aude: Re-enable math data type on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268199 [19:46:10] There's a patch for this. [19:46:13] Was supposed to merge. [19:46:19] aude has https://gerrit.wikimedia.org/r/268182 [19:46:42] (03CR) 10Chad: [C: 032] Put $wgMFQueryPropModules and $wgMFSearchApiModules in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268182 (https://phabricator.wikimedia.org/T120197) (owner: 10Aude) [19:46:56] * aude also has https://gerrit.wikimedia.org/r/#/c/268199/ (beta only) to fix accidentally disabling this on beta [19:47:12] (03CR) 10Yuvipanda: [C: 032] toollabs: Do not hardcode Host header [puppet] - 10https://gerrit.wikimedia.org/r/267402 (owner: 10Yuvipanda) [19:47:37] (03Merged) 10jenkins-bot: Put $wgMFQueryPropModules and $wgMFSearchApiModules in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268182 (https://phabricator.wikimedia.org/T120197) (owner: 10Aude) [19:47:50] (03PS2) 10Hashar: Re-enable math data type on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268199 (owner: 10Aude) [19:48:04] (03CR) 10Hashar: [C: 032] Re-enable math data type on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268199 (owner: 10Aude) [19:48:30] (03Merged) 10jenkins-bot: Re-enable math data type on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268199 (owner: 10Aude) [19:48:38] thanks [19:48:42] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 58.62% of data above the critical threshold [5000000.0] [19:48:50] the beta -labs.php ones are the easy ones :D [19:48:51] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [19:48:51] !log halting puppet on carbon for a few minutes to livehack a partition recipe change in netboot.cfg [19:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:48:56] yeah [19:49:08] ostriches: are you deploying the https://gerrit.wikimedia.org/r/#/c/268182/ [19:49:12] Yes [19:49:16] It's syncing now [19:49:33] with multiple files we use sync-dir don't we? [19:49:43] !log demon@mira Synchronized wmf-config/: fix wikibase/mobilefrontend config (duration: 01m 19s) [19:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:49:50] looks like [19:50:06] (03CR) 10ArielGlenn: "I don't know where they think this variable is set. But $shapeline_exists doesn't seem to be checked anywhere, fwiw. Maybe that line can " [puppet] - 10https://gerrit.wikimedia.org/r/266971 (owner: 10Dzahn) [19:50:28] hashar: Easiest, if you're expecting them to be in a certain order. [19:51:23] those "Could not connect to server "rdb1005..." are spammy [19:53:42] Um, this isn't going away.... [19:54:24] they have been around for ages [19:54:37] so I am now at https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Restore_symlinks_on_deployment_server [19:54:46] No I'm not talking about that. [19:54:59] Undefined variable? [19:55:01] oh [19:55:13] It's defined in InitialiseSettings... [19:55:43] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1995251 (10Yurik) [19:55:55] initialisesettings needs to be synced [19:57:09] I sync'd the whole directory [19:57:13] * ostriches touches and resyncs [19:58:22] !log demon@mira Synchronized wmf-config/InitialiseSettings.php: touch (duration: 01m 19s) [19:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:58:38] Thereee we go [19:58:48] looks better [19:59:01] ostriches: are deployments unfrozen? [19:59:09] Yeah [19:59:14] We unfroze last night. [19:59:21] Er, this morning? [19:59:22] w/e [19:59:23] :) [19:59:59] though it is frozen for the wmf.12 deployment going on hehe [20:00:04] hashar: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160203T2000). [20:00:09] ori: thanks for your help yesterday with the pooling/versionning etc [20:00:09] nod [20:00:28] hashar: no problem! thank you! [20:01:23] (03PS1) 10Jdlrobson: Config change 2: Suppress HTML from initial stable views on BC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268202 (https://phabricator.wikimedia.org/T124959) [20:01:37] ah [20:01:44] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:01:47] ostriches: looks like they are gone for good no [20:02:02] Yep [20:02:38] ori: On that subject, if you've got anything you'd like to add (especially under actionables), please feel free https://wikitech.wikimedia.org/wiki/Incident_documentation/20160202-deployment-server-loss [20:02:46] hashar, ostriches: holy mountain of fatals! that's just group0, right? [20:02:47] (03CR) 10jenkins-bot: [V: 04-1] Config change 2: Suppress HTML from initial stable views on BC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268202 (https://phabricator.wikimedia.org/T124959) (owner: 10Jdlrobson) [20:03:00] (I'll e-mail in a bit when I'm done wordsmithing and such) [20:03:02] havent pushed .12 yet [20:03:09] (03CR) 10Ori.livneh: [C: 04-1] grafana: fix top-scope var without namespace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266978 (owner: 10Dzahn) [20:03:11] It was everything. Config. [20:03:22] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [20:03:26] I can't see fatals though [20:03:45] Notice: Undefined variable: wmgMFQueryPropModules in /srv/mediawiki/wmf-config/mobile.php on line 102 [20:03:46] Notice: Undefined variable: wmgMFSearchAPIParams in /srv/mediawiki/wmf-config/mobile.php on line 103 [20:03:46] oh scheiße, ok. looks to have subsided [20:04:07] I guess you call them fatal because they show up in the fatal monitor ? [20:05:13] so next would be the wikiversion patch https://gerrit.wikimedia.org/r/#/c/268196/ [20:05:20] no symlink update [20:07:38] (03CR) 10Hashar: "Links are point to .10" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268196 (owner: 10Hashar) [20:09:38] ostriches: marxarelli if you could double check i havent screwed up the current symlinks ^^ [20:09:41] on mira [20:10:27] w/static/1.27.0-wmf.12 is untracked [20:11:10] Doing sync-common on mira itself would fix those [20:11:18] Well, fix the broken symlinks [20:11:36] ah [20:11:40] I forgot to add them to the patch [20:12:24] (03PS2) 10Jdlrobson: Config change 2: Suppress HTML from initial stable views on BC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268202 (https://phabricator.wikimedia.org/T124959) [20:12:31] (03PS3) 10Jdlrobson: Config change 2: Suppress HTML from initial stable views on BC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268202 (https://phabricator.wikimedia.org/T124959) [20:13:19] (03PS2) 10Hashar: Group0 to 1.27.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268196 [20:13:34] (03PS3) 10Hashar: Group0 to 1.27.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268196 [20:14:04] (03CR) 10Hashar: "I forgot to git add the w/static/1.27.0-wmf.12/{extensions,resources,skins} symlinks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268196 (owner: 10Hashar) [20:15:31] ostriches: should be good now https://gerrit.wikimedia.org/r/#/c/268196/ [20:16:07] (03CR) 10Chad: [C: 031] "fire at will commander" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268196 (owner: 10Hashar) [20:16:22] commander: fire please [20:16:35] (03CR) 10Hashar: [C: 032] Group0 to 1.27.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268196 (owner: 10Hashar) [20:16:59] (03Merged) 10jenkins-bot: Group0 to 1.27.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268196 (owner: 10Hashar) [20:17:54] and now the scary part sync to cluster and verify on testwiki [20:17:55] https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Sync_to_cluster_and_verify_on_testwiki [20:18:47] /srv/mediawiki-staging/wikiversions.json and set testwiki to php-VERSION [20:18:56] but I already pulled :D [20:19:31] looks like we need two patches: one for symlinks, one to switch group0 wikis in wikiversion.json [20:20:24] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Ariel is correct. Removal of the entire line is the correct way to go here." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266971 (owner: 10Dzahn) [20:20:28] hashar: Why? [20:20:30] hashar: that's an extra step for extra verification on testwiki before syncing to group0 [20:20:34] !log Hacked wikiversions.json to only have testwiki on .12 [20:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:20:49] but my patch https://gerrit.wikimedia.org/r/#/c/268196/ switches all group0 to .12 [20:20:52] hashar: not sure it's strictly necessary, but it's the process i've been following as well [20:21:04] anyway live hacked [20:21:12] That's a silly extra step. [20:21:48] running scap [20:21:50] !log hashar@mira Started scap: testwiki to php-1.27.0-wmf.12 and rebuild l10n cache [20:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:22:20] oh [20:22:35] from yesterday, maybe we should have depooled the servers acting as rsync proxies [20:22:54] fwiw, I agree with ostriches -- it's not obvious to me that it actually helps with anything, and by being a convention rather than a hard requirement, plus the fact that it requires live-hacking, I suspect it introduces unreliability to the process rather than eliminate it. [20:23:08] we can rework the process [20:23:14] and have a gerrit patch solely switching testwiki [20:23:19] * ostriches removes those 2 steps [20:23:24] then a second patch to migrate rest [20:23:31] No, extra steps are badddddd [20:23:37] ostriches++ [20:23:39] !log hashar@mira scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="testwiki" --outdir="/tmp/scap_l10n_2188303825" --threads=10 --lang en --quiet' returned non-zero exit status 255 (duration: 01m 49s) [20:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:23:46] :-/ [20:23:51] what was the error, hashar? [20:23:59] Warning: require_once(/srv/mediawiki-staging/php-1.27.0-wmf.12/extensions/RandomRootPage/Randomrootpage.php): failed to open stream: No such file or directory in /srv/mediawiki/wmf-config/CommonSettings.php on line 2402 [20:24:03] Fatal error: require_once(): Failed opening required '/srv/mediawiki-staging/php-1.27.0-wmf.12/extensions/RandomRootPage/Randomrootpage.php' (include_path='/srv/mediawiki-staging/php-1.27.0-wmf.12:/usr/local/lib/php:/usr/share/php') in /srv/mediawiki/wmf-config/CommonSettings.php on line 2402 [20:24:12] iirc that one got merged in core [20:24:18] so the extension message should no more reference it [20:24:19] Yes. [20:24:26] But it's still deployed on old branches. [20:24:32] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1995337 (10EBernhardson) Talked with yuri and max about this today. Yuri is going to try to get some numbers around how many tiles we can generate with the current hardwar... [20:24:35] Rather, it's deployed on all branches. [20:24:36] extra steps aren't bad if they provide valuable verification, but this one only provides the illusion of it [20:24:38] I thought we fixed that earlier. [20:24:41] * ostriches fixes [20:25:25] and CommonSettings.php still has require_once( "$IP/extensions/RandomRootPage/Randomrootpage.php" ); [20:25:29] marxarelli: yep [20:25:30] Yes. [20:25:39] The problem was it was merged to core and then not branched. [20:25:49] Easiest fix is to just re-branch [20:25:53] er, add to branch [20:27:20] if we require_once RandomRootPage in .12 I guess we will have some class naming conflict [20:27:51] Then it was improperly deprecated. [20:27:52] hashar: RandomRootPage was removed from the branch cut due to the repo being archived [20:27:54] * ostriches sighs [20:29:10] core has +class SpecialRandomrootpage extends RandomPage { [20:29:35] oh boy [20:29:49] and the extension had Randomrootpage_body.php:class SpecialRandomrootpage extends RandomPage { [20:29:51] Ahahah [20:29:56] {{fixed}} [20:30:25] A *.php entry stub with only a comment is there. [20:30:41] So .12 will just load an empty file and .10 and below will load the real extension [20:30:46] Committing shortly, syncing now [20:31:04] sounds neat [20:31:19] ostriches: but what about the derived class? it will still blow up [20:31:32] !log demon@mira Synchronized php-1.27.0-wmf.12/extensions/RandomRootPage/: unbreak (duration: 01m 19s) [20:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:31:48] will want that in the mediawiki/extensions/RandomRootPage wmf branch [20:32:11] (03PS1) 10Aude: Don't request pageprops for mobile search/nearby on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268208 (https://phabricator.wikimedia.org/T120197) [20:32:17] !log hashar@mira Started scap: testwiki to php-1.27.0-wmf.12 and rebuild l10n cache (after RandomRootPage had a dummy entry point added) [20:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:33:14] marxarelli: Derived class? There's nothing left in the extension [20:33:15] my patch above is not urgent and can wait until swat tomorrow [20:33:19] Everything's in core now [20:33:30] !log hashar@mira scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="testwiki" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.bTBpxD6CuI" ' returned non-zero exit status 1 (duration: 01m 13s) [20:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:33:37] ostriches: oh, gotcha [20:33:39] just realized i didn't completley solve T120197 [20:33:45] [cee836ce] [no req] Exception from line 31 of /srv/mediawiki-staging/php-1.27.0-wmf.12/extensions/Validator/Validator.php: Validator depends on the ParamProcessor library. [20:33:50] i didn't realize that's why the extension was deprecated [20:34:15] hashar: wrong version of smw? [20:34:18] for wikitech? [20:34:24] (like we had the wrong version of wikibase?) [20:34:41] think it's supposed to be pinned to a specific version [20:35:02] looks like we screwed up the branch cut yesterday [20:35:15] :/ [20:37:39] Validator has 1a49880 (origin/0.5.x) [20:38:32] -2257c7f2552c2a9202c4cf8f195f7aa1afca0790 extensions/Validator [20:38:40] so we point to the master branch [20:40:07] hashar: :/ checking integration-make-wmf-branch.eqiad.wmflabs ... [20:40:32] oh [20:40:39] preparing a patch [20:40:50] registering the extensions / checking out proper branches manually [20:42:51] 404.00 KiB/s gerrit is slow today [20:42:51] most of the time i'd probably not be paying attention, re: the wikidata branch (if we were not bumping this week) [20:43:11] and might be surprised what got deployed [20:45:40] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: /srv/mediawiki-staging broken on both scap masters - https://phabricator.wikimedia.org/T125506#1995411 (10phuedx) [20:46:48] hashar, oh, there's a related bug or two about this [20:46:54] hashar, is CentralNotice also on master? [20:46:55] hashar: we'll need to check the other special case extensions as well [20:47:24] https://phabricator.wikimedia.org/T113428 is very relevant [20:47:28] everything in make-wmf-branch/config.json [20:47:51] "If we had lost /srv/mediawiki-staging somehow, we would have been able to rebuild everything correctly from git (+any security patches), but not CentralNotice." - RoanKattouw predicted this back in September [20:47:54] Krenair: yes, CentralNotice is most likely on master as well [20:48:31] and jzerebecki also filed a related bug [20:48:36] so in theory https://gerrit.wikimedia.org/r/268214 [20:48:39] should fix the branches [20:49:12] hashar: needs Wikidata, too [20:49:25] yeah [20:49:25] target wmf/1.27.0-wmf.10 [20:49:25] Wikidata has already been bumped earlier today. [20:49:31] kk [20:49:40] some unit tests were falling iirc [20:49:47] and that is how we noticed Wikidata got cut from master [20:49:56] but forgot to check the other specials [20:50:10] will want to compare the commits I am proposing for .12 with the ones from .10 [20:53:25] hashar: i've verify what's on integration-make-wmf-branch as well to double check other extensions, vendor, etc. [20:53:31] any double checker for https://gerrit.wikimedia.org/r/#/c/268214/ ? [20:53:32] *i'll* verify [20:53:38] rechecked it twice from fresh clone [20:53:49] what I did is get core @ wmf/1.27.0-wmf.12 [20:54:12] submodule update --init extensions/{CentralNotice,SemanticMediaWiki,SemanticResultFormats,Validator} [20:54:23] then in each did git reset --hard 'the branch we want' [20:54:38] git add && commit / push [20:55:52] hashar: you should modify .gitmodules as well i think [20:56:09] (03PS1) 10EBernhardson: Don't create new log files for cirrus-suggest with logrotate [puppet] - 10https://gerrit.wikimedia.org/r/268215 [20:56:18] i.e. set the branch for each of the special extensions [20:56:23] oh [20:56:25] good point [20:57:06] hashar: that's the part of make-wmf-branch that f'd up, because of my local hack apparently :( [20:57:21] though i still don't see why exactly [20:57:55] maybe ostriches or thcipriani can explain that to me later :) [21:00:00] https://gerrit.wikimedia.org/r/#/c/268214/ fix the .gitmodules [21:00:04] hashar: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160203T2100). Please do the needful. [21:00:21] and https://gerrit.wikimedia.org/r/268218 fix the branch= entry for Wikidata [21:01:00] hashar: looks right. and you modified .gitmodules for Wikidata earlier? [21:01:10] nop [21:01:26] I guess aude did a reset --hard to wikidata .12 / add dir / review [21:01:31] just like I did [21:01:41] so https://gerrit.wikimedia.org/r/268218 adds the branch= [21:02:37] greg-g, i guess the mediawiki train is going out now .. when is the right time to do a parsoid deploy? [21:02:51] oohhh! [21:02:53] * apergos lurks [21:03:35] subbu: you're scheduled at 2pm pacific [21:03:41] mm [21:03:48] probably off line by then :-( [21:04:06] hashar: +2d and -1d (wrong branch for Wikidata) [21:04:15] greg-g, ah, ok .. used to be 1pm pt .. is this for this week only or a permanent change? [21:04:49] just for today [21:04:51] i assume this week because of the compressed multi-group m/w deployment schedule for today. [21:04:53] apergos, ok. we can test on monday. [21:04:56] greg-g, ok. thanks. [21:05:01] well I don't mind [21:05:25] marxarelli: I think we roll Wikidata wmf/1.27.0-wmf.12 [21:05:25] I mean I can find out after the fact, unless you would rather I be here [21:05:37] if you want me here, then Monday indeed [21:05:40] .10 does not work with core .12 iirc [21:05:43] subbu: [21:05:54] no, you don't need to be here. [21:06:17] i have the bash script that i can use if that command fails. [21:06:24] apergos, and will file an update on the ticket. [21:06:46] marxarelli: yeah https://gerrit.wikimedia.org/r/#/c/268100/ [21:07:10] ah thanks! fingers crossed [21:08:45] !log waiting for the submodule patch https://gerrit.wikimedia.org/r/#/c/268214/ to land and will scap again [21:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:09:10] (03CR) 10Mobrovac: "Ah, ok. My interpretation of the commit message was that you wanted to add it to labs only for now, but I see the wiki is already active." [puppet] - 10https://gerrit.wikimedia.org/r/268016 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [21:09:58] hashar: kk. we'll have to update tools/release [21:10:09] will want to sync with wmde [21:10:39] * marxarelli goes to his 1:1 with greg-g [21:10:44] ;-) [21:13:56] 6operations: setup/deploy oresrdb1001-oresrdb1002 - https://phabricator.wikimedia.org/T125562#1995482 (10RobH) I'm having a hell of a time getting the partitioning to work properly on oresrdb1001. I've tried a few things, and everything fails with the mounting of / during the OS installer partitioning steps. S... [21:14:22] !log depooling restbase1008 for kernel/Java update [21:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:15:05] 6operations, 5Patch-For-Review: jessie installer fails when using db hosts- same recipe works on trusty and on other hosts/a few weeks ago - https://phabricator.wikimedia.org/T125256#1982775 (10RobH) Please note I had this exact same (frustrating) issue today for my installation on oresrdb1001. I also tried m... [21:17:35] (03CR) 10Alex Monk: "Cool, thanks @Mobrovac. Is there a task tracking those kernel upgrades?" [puppet] - 10https://gerrit.wikimedia.org/r/268016 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [21:17:55] 6operations, 5Patch-For-Review: jessie installer fails after partitioning stage- same recipe works on trusty and a it worked few weeks ago - https://phabricator.wikimedia.org/T125256#1995495 (10jcrespo) [21:18:31] well [21:20:02] 6operations, 5Patch-For-Review: jessie installer fails after partitioning stage- same recipe works on trusty and a it worked few weeks ago - https://phabricator.wikimedia.org/T125256#1995497 (10jcrespo) CCing @MoritzMuehlenhoff as the "package expert". I may be wrong, but this smells like upstream bug on jessi... [21:25:13] !log mira had to hard reset CentralNotice / SemanticMediaWiki / SemanticResultFormats / Validator after we pointed them from master to their proper branch, submodule attempted a rebase automatically.. That is a no no [21:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:25:34] stupid qunit job fails [21:26:48] !log hashar@mira Started scap: testwiki to php-1.27.0-wmf.12 and rebuild l10n cache (with proper branches for special_extensions) [21:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:27:34] !log repooling restbase1008 [21:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:29:21] 10Ops-Access-Reviews: Get Alex Monk access to wikitech-static - https://phabricator.wikimedia.org/T125715#1995535 (10Andrew) 3NEW a:3Andrew [21:29:22] tis rebuilding [21:29:41] 6operations: setup/deploy oresrdb1001-oresrdb1002 - https://phabricator.wikimedia.org/T125562#1995545 (10RobH) I didn't ask @akosiaris if these should be Jessie or Trusty. If they are trusty, we can install them right away. If they are jessie, it will be blocked by the issue detailed on T125256. [21:29:48] 6operations, 5Patch-For-Review: jessie installer fails after partitioning stage- same recipe works on trusty and a it worked few weeks ago - https://phabricator.wikimedia.org/T125256#1995548 (10RobH) [21:29:51] 6operations: setup/deploy oresrdb1001-oresrdb1002 - https://phabricator.wikimedia.org/T125562#1995547 (10RobH) [21:32:21] Started sync-masters [21:32:27] !log depooling restbase1003 for kernel/Java update [21:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:34:29] !log hashar@mira scap aborted: testwiki to php-1.27.0-wmf.12 and rebuild l10n cache (with proper branches for special_extensions) (duration: 07m 41s) [21:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:34:51] ah sync-masters screwed up with srv/mediawiki-staging/php-1.27.0-wmf.12/cache/l10n/l10n_cache-pnt.cdb.tmp [21:35:04] (03PS1) 10BBlack: cache_mobile LVS decom: 1/2 remove LVS service [puppet] - 10https://gerrit.wikimedia.org/r/268226 (https://phabricator.wikimedia.org/T109286) [21:35:06] (03PS1) 10BBlack: cache_mobile LVS decom: 2/2 remove conftool data [puppet] - 10https://gerrit.wikimedia.org/r/268227 (https://phabricator.wikimedia.org/T109286) [21:35:08] (03PS1) 10BBlack: cache_mobile decom: 1/2 remove realserver IPs [puppet] - 10https://gerrit.wikimedia.org/r/268228 (https://phabricator.wikimedia.org/T109286) [21:35:10] (03PS1) 10BBlack: cache_mobile decom: 2/2 Remove most cache config [puppet] - 10https://gerrit.wikimedia.org/r/268229 (https://phabricator.wikimedia.org/T109286) [21:35:12] (03PS1) 10Subramanya Sastry: parsoid-rt-client: Have testreduce clients use global parsoid service [puppet] - 10https://gerrit.wikimedia.org/r/268230 [21:35:26] (03PS1) 10BBlack: cache_maps: define tier-2 backending [puppet] - 10https://gerrit.wikimedia.org/r/268233 (https://phabricator.wikimedia.org/T109162) [21:35:28] (03PS1) 10BBlack: cache_maps: define global service IPs [puppet] - 10https://gerrit.wikimedia.org/r/268234 (https://phabricator.wikimedia.org/T109162) [21:35:30] (03PS1) 10BBlack: cache_maps: add all sites to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/268235 (https://phabricator.wikimedia.org/T109162) [21:35:32] (03PS1) 10BBlack: cache_maps: re-role old mobile servers [puppet] - 10https://gerrit.wikimedia.org/r/268236 (https://phabricator.wikimedia.org/T109162) [21:35:34] (03PS1) 10BBlack: cache_maps: remove cp104[34] test caches [puppet] - 10https://gerrit.wikimedia.org/r/268237 (https://phabricator.wikimedia.org/T109162) [21:35:36] (03PS1) 10BBlack: cache_maps: add all sites in LVS [puppet] - 10https://gerrit.wikimedia.org/r/268238 (https://phabricator.wikimedia.org/T109162) [21:35:44] !log mismatching uid for l10nupdate user between mira and tin [21:35:45] (03PS1) 10BBlack: maps DNS 1/2: define at all DCs [dns] - 10https://gerrit.wikimedia.org/r/268239 (https://phabricator.wikimedia.org/T109162) [21:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:35:47] (03PS1) 10BBlack: maps DNS 2/2: enable geodns routing [dns] - 10https://gerrit.wikimedia.org/r/268240 (https://phabricator.wikimedia.org/T109162) [21:36:04] 6operations, 5Patch-For-Review: jessie installer fails after partitioning stage- same recipe works on trusty and a it worked few weeks ago - https://phabricator.wikimedia.org/T125256#1995578 (10RobH) p:5Triage>3Unbreak! I'm assigning directly to @MoritzMuehlenhoff, since it is easy to miss a CC addition to... [21:36:04] ugh, I thought we fixed that uid issue [21:36:13] 6operations, 5Patch-For-Review: jessie installer fails after partitioning stage- same recipe works on trusty and a it worked few weeks ago - https://phabricator.wikimedia.org/T125256#1995581 (10RobH) a:3MoritzMuehlenhoff [21:36:36] mira has l10nupdate uid == 10002 tin has l10nupdate uid = 1001 [21:36:46] https://phabricator.wikimedia.org/T119165 [21:37:03] supposed to be 02 [21:37:12] https://phabricator.wikimedia.org/T119165#1831929 [21:37:36] (03CR) 10jenkins-bot: [V: 04-1] cache_mobile decom: 1/2 remove realserver IPs [puppet] - 10https://gerrit.wikimedia.org/r/268228 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [21:37:40] 6operations, 10Deployment-Systems, 5Patch-For-Review: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1995591 (10greg) 5Resolved>3Open This is back: 21:36 < hashar> mira has l10nupdate uid == 10002 tin has l10nupdate uid = 1001 [21:38:02] 6operations, 10Deployment-Systems, 5Patch-For-Review: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1995594 (10hashar) Following reinstallation of tin on Feb 2nd: | mira | uid=10002(l10nupdate) gid=10002(l10nupdate) groups=10002(l10nupdate) | tin | uid=1001(l1... [21:38:30] 6operations, 10Deployment-Systems, 5Patch-For-Review: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1995597 (10hashar) [21:38:34] 6operations, 5Patch-For-Review: jessie installer fails after partitioning stage- same recipe works on trusty and a it worked few weeks ago - https://phabricator.wikimedia.org/T125256#1995598 (10RobH) p:5Unbreak!>3High Moritz chatted with us about this in IRC and plans to work on it tomorrow, so lowering to... [21:39:21] ah [21:39:31] obviously we dont have root access on our deployment servers [21:39:34] nope [21:40:20] * hashar blames puppet [21:41:05] (03CR) 10jenkins-bot: [V: 04-1] maps DNS 2/2: enable geodns routing [dns] - 10https://gerrit.wikimedia.org/r/268240 (https://phabricator.wikimedia.org/T109162) (owner: 10BBlack) [21:43:12] 6operations, 10Deployment-Systems, 5Patch-For-Review: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1995631 (10hashar) So @ori had a patch https://gerrit.wikimedia.org/r/#/c/256026/4/modules/scap/manifests/l10nupdate.pp,cm but that does not show up anymore in... [21:44:11] 6operations, 5Patch-For-Review: jessie installer fails after partitioning stage- same recipe works on trusty and a it worked few weeks ago - https://phabricator.wikimedia.org/T125256#1995645 (10Dzahn) attaching log files saved earlier from `oresrdb1001` {F3308098} {F3308099} {F3308100} [21:45:25] !log repooling restbase1003 [21:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:45:46] !log depooling restbase1004 for kernel/Java update [21:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:49:44] !log tin - fixing UID of l10nupdate user (T119165) [21:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:49:48] (03CR) 10Alex Monk: "This reveals an interesting issue: https://phabricator.wikimedia.org/P2558" [puppet] - 10https://gerrit.wikimedia.org/r/267816 (owner: 10EBernhardson) [21:51:45] !log tin - find / -uid 1001 -exec chown 10002 {} \; [21:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:52:44] !log hashar@mira Started scap: testwiki to php-1.27.0-wmf.12 and rebuild l10n cache (with proper branches for special_extensions) [21:53:34] !log reopened https://phabricator.wikimedia.org/T119165 ''l10nupdate user uid mismatch between tin and mira'' [21:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:59:03] 6operations, 10Deployment-Systems, 5Patch-For-Review: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1995688 (10Dzahn) I went to tin and adjusted the UID of the l10nupdate user, after confirming 10002 is correct on https://wikitech.wikimedia.org/wiki/UID `vi /e... [22:00:04] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160203T2200). Please do the needful. [22:00:12] RFC meeting now in #wikimedia-office: T124752 Expiring watch list entries [22:00:37] we already did a mobileapps deploy earlier today [22:02:06] jynus: ^ you may want to join this [22:02:19] starting parsoid deploy shortly [22:02:50] (03CR) 10Mobrovac: "In-lined question / comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268230 (owner: 10Subramanya Sastry) [22:03:24] !log mira, tin: find /srv/mediawiki-staging/ -uid 1001 -exec chown 10002 {} \; [22:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:04:05] note wmf.12 is not complete yet [22:05:23] scap-rebuild-cdbs: 0% (ok: 0; fail: 0; left: 478) [22:05:23] ... [22:05:39] 6operations, 10Deployment-Systems, 5Patch-For-Review: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1995731 (10Dzahn) same on mira (and tin) for: `find /srv/mediawiki-staging/ -uid 1001 -exec chown 10002 {} \;` [22:06:06] !log starting parsoid deploy [22:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:06:25] ah guess who is still online [22:07:15] subbu: lemme know how it's going, if you run into troubles (since I'm still here :-/) I can help look [22:07:17] !log repooling restbase1004 , depooling restbase1005 for kernel/Java update [22:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:07:39] (03PS2) 10Tim Landscheidt: Tools: Allow proxymanager to add and remove proxy forward entries [puppet] - 10https://gerrit.wikimedia.org/r/266448 [22:07:41] (03PS1) 10Tim Landscheidt: Tools: Switch portgrabber and portreleaser to proxymanager [puppet] - 10https://gerrit.wikimedia.org/r/268279 [22:07:48] will do. :) verified test on beta .. now to continue .. [22:09:30] git pull is slow today .. [22:09:52] some days are those days yeah [22:10:37] reallly slow [22:10:38] there you go. finished. [22:11:00] that was over a minute [22:11:42] (03CR) 10Tim Landscheidt: "Needs to have I2d643fc902208eafaaa0d7814e586f0c326f16b5 deployed on tools-proxy-*. Test script:" [puppet] - 10https://gerrit.wikimedia.org/r/268279 (owner: 10Tim Landscheidt) [22:12:01] syncing. [22:12:17] doo dee doo dee doo [22:13:07] !log hashar@mira Finished scap: testwiki to php-1.27.0-wmf.12 and rebuild l10n cache (with proper branches for special_extensions) (duration: 20m 23s) [22:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:13:27] !log restarted parsoid on wtp1002 as a canary [22:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:13:36] !log hashar@mira Started scap: to properly sync other master tin due to l10nupdate ui mismatch [22:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:14:07] chirp chirp [22:14:07] will monitor vitals for a minute or two before restarting on all nodes [22:14:14] oh did the sync complete ok? [22:14:19] !log https://test.wikipedia.org/ switched to 1.27.0-wmf.12 [22:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:14:31] yes, sync went through just fine. [22:14:58] 6operations, 10Deployment-Systems, 5Patch-For-Review: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1995783 (10Dzahn) 5Open>3Resolved [22:16:15] 6operations, 10Deployment-Systems, 5Patch-For-Review: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1819881 (10Dzahn) the immediate blocker is fixed. for the puppetization issue see comments on https://gerrit.wikimedia.org/r/#/c/255421/ [22:18:02] ok, looking good .. apergos now using the git deploy service restart command .. let us see how that goes. [22:18:11] yeah here's where I wonder what will happen [22:18:28] because there's also the phab ticket about the timeout issue, which is separate [22:18:42] from what i remember, i don't get any feedback till it is done. [22:18:49] that's right [22:20:40] wtp1022 doesn't look happy .. might be a stuck process .. that one might need a manual intervention after this command finishes. [22:21:03] ok, I'm here to give it a kick if needed [22:21:09] !log repooling restbase1005 , depooling restbase1006 for kernel/Java update [22:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:21:16] (03CR) 10GWicke: [C: 031] Change default consistency to localOne [puppet] - 10https://gerrit.wikimedia.org/r/267924 (https://phabricator.wikimedia.org/T124947) (owner: 10Eevans) [22:21:39] (03PS2) 10BBlack: cache_mobile decom: 1/2 remove realserver IPs [puppet] - 10https://gerrit.wikimedia.org/r/268228 (https://phabricator.wikimedia.org/T109286) [22:21:41] (03PS2) 10BBlack: cache_mobile decom: 2/2 Remove most cache config [puppet] - 10https://gerrit.wikimedia.org/r/268229 (https://phabricator.wikimedia.org/T109286) [22:21:44] ssastry@mira:/srv/deployment/parsoid/deploy$ git deploy service restart [22:21:44] Error received from salt; raw output: [22:21:44] 'deploy.restart' runner publish timed out [22:21:44] ssastry@mira:/srv/deployment/parsoid/deploy$ [22:21:45] apergos, ^ [22:21:51] (03PS2) 10BBlack: cache_maps: add all sites in LVS [puppet] - 10https://gerrit.wikimedia.org/r/268238 (https://phabricator.wikimedia.org/T109162) [22:21:52] that's the timeout bug then [22:21:53] (03PS2) 10BBlack: cache_maps: re-role old mobile servers [puppet] - 10https://gerrit.wikimedia.org/r/268236 (https://phabricator.wikimedia.org/T109162) [22:21:54] which is um [22:21:55] (03PS2) 10BBlack: cache_maps: remove cp104[34] test caches [puppet] - 10https://gerrit.wikimedia.org/r/268237 (https://phabricator.wikimedia.org/T109162) [22:21:57] (03PS2) 10BBlack: cache_maps: define global service IPs [puppet] - 10https://gerrit.wikimedia.org/r/268234 (https://phabricator.wikimedia.org/T109162) [22:21:59] (03PS2) 10BBlack: cache_maps: add all sites to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/268235 (https://phabricator.wikimedia.org/T109162) [22:22:01] (03PS2) 10BBlack: cache_maps: define tier-2 backending [puppet] - 10https://gerrit.wikimedia.org/r/268233 (https://phabricator.wikimedia.org/T109162) [22:22:03] so, i guess i'll run through my bash script then. [22:22:10] https://phabricator.wikimedia.org/T63882 [22:22:29] ok .. i am going to run my bash script unless you tell me to do something else. [22:22:52] PROBLEM - Parsoid on wtp1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:22:58] the timeout might be because of wtp1022 stuck process potentially .. [22:23:13] well hm [22:23:19] should I go kick wtp1022 then? [22:23:20] (03PS2) 10BBlack: maps DNS 2/2: enable geodns routing [dns] - 10https://gerrit.wikimedia.org/r/268240 (https://phabricator.wikimedia.org/T109162) [22:23:31] apergos, pid 2337 [22:23:32] on wtp1022 [22:23:59] and pid 3825 on wtp1020 [22:24:41] PROBLEM - Parsoid on wtp1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:24:43] apergos, if you kill -9 those processes, they will recover. [22:26:21] RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.022 second response time [22:26:21] RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.038 second response time [22:26:23] hashar: Please merge https://gerrit.wikimedia.org/r/#/c/268306/ [22:26:31] no way [22:26:44] hashar: Oh. [22:26:45] paladox: I am doing the mw deployment still [22:26:48] :D [22:26:54] hashar: Ok. [22:27:05] apergos, ok .. so, do you want me to retry that command or should i use my bash script? (since we don't know wihch still need restarting) [22:27:15] paladox: but yeah that looks like a potential hack solution [22:27:29] retry once more, if you get timeout then I'll have to go look at that task tomorrow [22:27:35] and you can fall back to the bash script [22:27:37] subbu: [22:27:43] retrying. [22:27:51] ok. [22:27:54] I was just checking to see which hosts had gotten the restart completed but it's a real mixed bag [22:27:55] (03CR) 10Ottomata: "Joal, let's merge this tomorrow. Help me babysit it?" [puppet] - 10https://gerrit.wikimedia.org/r/267924 (https://phabricator.wikimedia.org/T124947) (owner: 10Eevans) [22:27:58] hashar: Ok thanks for looking into. I will see if someone else could merge. Since it is a temp solution and allows us to merge on the REL1_25 branch again. [22:28:14] paladox: most probably addshore / jzerebecki tomorrow [22:28:26] hashar: Ok. [22:28:27] *waves* [22:28:43] what exactly is the issue? I saw all the patches but dont know what the problem is [22:29:12] addshore: REL1_25 branch is failing because of wikidata. It is also causing tests to fail for flow. [22:29:12] Wikibase has been added to the shared job mwext-testextensions-* [22:29:23] but that fails on REL1_25 (and probably on REL1_26 as well) [22:29:49] but what failure? :P [22:29:53] 7Blocked-on-Operations, 6operations, 10Parsoid, 10Salt, 6Scrum-of-Scrums: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#1995860 (10ArielGlenn) [22:29:53] apergos, the problem is that sometimes some workers end up in a large request and they would be killed by the cluster master after a timeout (3 min) .. but it also means that parsoid restarts get held up on those .. which the git deploy service restart probably doesn't like. [22:29:56] 6operations, 10Deployment-Systems, 10Salt, 5Patch-For-Review: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#1995861 (10ArielGlenn) [22:30:15] well we ought to be able to configure for that [22:30:19] I havnt seen any links etc! [22:30:25] and we have a couple of those pages being parsed right now by parsoid workers .. so there is small chance this retry might fail too .. but fingers crossed. [22:30:33] so I've just added the timeout issue as a blocking task for the one you commented on [22:30:47] I'll slug away on that over the next few days [22:31:00] addshore: https://integration.wikimedia.org/ci/job/mwext-testextension-php53/1431/console [22:31:31] apergos, wtp1017 this time. [22:31:43] addshore: It seems scribunto also fails. https://integration.wikimedia.org/ci/job/mwext-testextension-hhvm/1290/console [22:31:47] but, it might recover, let us see. [22:31:52] huh [22:32:03] aude: know anything about this? [22:32:10] what's the full output you get from the restart command? [22:32:36] ssastry@mira:/srv/deployment/parsoid/deploy$ git deploy service restart [22:32:36] Error received from salt; raw output: [22:32:36] 'deploy.restart' runner publish timed out [22:32:37] ssastry@mira:/srv/deployment/parsoid/deploy$ [22:33:10] apergos, pid 10739 on wtp1017 needs restarting. [22:33:11] ok that's the whole thing then [22:33:12] PROBLEM - Parsoid on wtp1017 is CRITICAL: Connection refused [22:33:17] ok I'll shoot that too [22:33:36] addshore: don't know [22:33:46] apergos, pid 716 on wtp1003 [22:33:53] but our tests against core only passed when we made the branch right aude ? [22:34:08] !log Still looking at test.wikipedia.org being super "slow" . scap still rebuilding though [22:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:34:15] did those [22:34:23] anything else? [22:34:33] PROBLEM - Parsoid on wtp1008 is CRITICAL: Connection refused [22:34:38] !log repooling restbase1006 , depooling restbase1009 for kernel/Java update [22:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:34:48] subbu: ? [22:35:00] apergos, 13411 on wtp1008 [22:35:00] addshore: https://integration.wikimedia.org/ci/job/mwext-testextension-hhvm/1290/consoleFull looks vaguely familiar [22:35:02] PROBLEM - salt-minion processes on ms-be2007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [22:35:02] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.054 second response time [22:35:12] jan might remember the problem [22:35:22] PROBLEM - Parsoid on wtp1003 is CRITICAL: Connection refused [22:35:31] done [22:35:39] are you sure it was 716? [22:35:46] looks like i missed 9050 on wtp1003. [22:36:06] done [22:36:16] (03PS5) 10Krinkle: [WIP] Prepare db-codfw.php for a live deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo) [22:36:22] RECOVERY - Parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.009 second response time [22:36:23] aude: addshore: Ive created this task https://phabricator.wikimedia.org/T125722 [22:36:42] (03PS2) 10Dzahn: Remove caesium from DNS Bug:T125165 [dns] - 10https://gerrit.wikimedia.org/r/268149 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul) [22:36:51] so I guess it's the bash script for now (sorry) [22:36:53] (03CR) 10Dzahn: [C: 032] Remove caesium from DNS Bug:T125165 [dns] - 10https://gerrit.wikimedia.org/r/268149 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul) [22:37:09] yes, that is fine. [22:37:12] RECOVERY - Parsoid on wtp1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.018 second response time [22:37:13] hashar: any idea what all those connection timeouts to rdb*.eqiad.wmnet in the hhvm logs are about? [22:37:26] bd808: no idea :( [22:37:35] bd808: upstart was broken at some point and kept restarting redis [22:37:44] apergos, we separately need to figure out how to deal with these bad requests that interfere with restarts .. [22:38:03] !log hashar@mira Finished scap: to properly sync other master tin due to l10nupdate ui mismatch (duration: 24m 27s) [22:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:38:06] bd808: and we had some mismatch between redis-server package and configuration causing redis to refuse to start (but that should have not happened on prod) [22:38:18] bd808: the rest, I dont know. Maybe they are overloaded :( [22:38:25] at last scap complete [22:38:36] ok, running the bash script now. [22:38:55] 6operations, 5Patch-For-Review: decom caesium - https://phabricator.wikimedia.org/T125165#1995892 (10Dzahn) p:5Triage>3Normal [22:39:03] getting test.wp.o from the cluster is taking ~30s. Getting it from mw1017 taks 1.5s [22:39:11] well if we can increase the timout to 3 minutes or whatever, then your long queries can do what they do [22:39:41] i suspect we'll get a couple more of these stuck process issues again even with the bash script .. since that bad title continues to be requested. [22:40:05] 6operations, 10Deployment-Systems, 10Salt, 5Patch-For-Review: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#1995902 (10ArielGlenn) (12:32:36 πμ) subbu: ssastry@mira:/srv/deployment/parsoid/deploy$ git deploy service restart (12:32:36 πμ) subbu: Error received fro... [22:40:14] but, at least we know for sure on which nodes we had a successful restart. [22:40:14] 6operations, 10ops-eqiad, 5Patch-For-Review: decom caesium - https://phabricator.wikimedia.org/T125165#1980223 (10Dzahn) [22:40:20] right [22:40:29] well I can shoot processes again, it's no problem [22:40:51] so [22:41:00] nerd movie "they shoot parsoids don't they" [22:41:05] we are going to try to push 1.20.7-wmf20 on another wiki to get more logs [22:41:11] thcipriani: though test.wikipedia.org is not happy :( [22:41:23] yeah it's unusably slow [22:41:39] should we switch test2 ? [22:41:42] 15-30 seconds to load [[Main Page]] [22:41:49] cause test2 is fast right now [22:41:53] 6operations, 10ops-eqiad, 5Patch-For-Review: decom caesium - https://phabricator.wikimedia.org/T125165#1995907 (10Dzahn) is gone from DNS now https://gerrit.wikimedia.org/r/#/c/268149/ please follow-up with disk wiping / hardware decom or reclaim [22:41:57] !log repooling restbase1009 [22:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:42:02] if I pin my requests to mw1017 things are fast too [22:42:12] RECOVERY - salt-minion processes on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:42:15] so something is wonky in the main cluster [22:42:17] I don't understand. I'm not seeing slowness locally... [22:42:50] to switch solely test2 I have edited the wikiversions.json [22:42:51] the rdb1* servers that are logging all of the connection timeouts are the job queues [22:42:54] now should I scap again ? [22:43:15] just sync-wikiversions hashar [22:43:26] apergos, :) 17 done .. 7 more to go .. codfw cluster will go through quickly since it has no load. [22:43:33] right [22:43:46] !log sync-wikiversions "test2wiki to php-1.27.0-wmf.12" [22:43:46] !log hashar@mira rebuilt wikiversions.php and synchronized wikiversions files: test2wiki to php-1.27.0-wmf.12 [22:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:43:50] oh [22:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:43:53] what a fast command [22:44:05] only changes 1 file :) [22:44:06] and guess what became slow [22:44:10] test2 [22:44:15] do I get a cookie? [22:44:29] * bd808 sticks a tracking cookie to apergos [22:44:31] so 1.27.0-wmf.12 is borked [22:44:35] dang. shoulda known [22:44:38] I am happy I only deployed testwiki [22:44:46] hashar: It seems mediawiki 1.27 wmf12 has performance issues test2 a minute ago was loading instaly now it is slow. [22:45:11] paladox: yeah I pushed .12 a minute or so ago [22:45:11] paladox: agreed. we are looking into it [22:45:31] got the same slowness whether logged or not logged in [22:45:52] hashar: bd808: Ok. I wonder what could be causing such peformance issues. [22:46:32] Maybe we should setup a wiki that downloads the latest alpha so we can checkout in advance if issues start to arise performance wise. [22:46:48] apergos, success .. all restarted. [22:46:52] ok [22:46:59] any hung processes this time, subbu ? [22:47:11] PROBLEM - Parsoid on wtp1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:47:14] !log finished deploying parsoid sha 98619f7f [22:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:47:26] bd808: Could it be an extension slowing everything down. [22:47:37] apergos, looks like one on wtp1018 [22:47:40] looking which one it is. [22:47:51] * apergos gets trigger finger ready [22:48:13] pid 12691 [22:48:30] done [22:48:42] and one mor elooks like on it. [22:48:50] pid 10335 [22:49:01] done [22:49:08] and 26406 [22:49:21] done [22:49:26] hashar: the stage bit is that whatever is slowing things down is not effecting mw1017 [22:49:53] great. should recover now. [22:50:05] filling a task [22:50:14] addshore: audo: Wikidata also fails qunit tests in master https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/29062/consoleFull [22:50:33] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.021 second response time [22:50:37] and there it is [22:50:51] (03PS1) 10Dzahn: dhcp: switch mc1004/1005 to jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/268311 (https://phabricator.wikimedia.org/T123711) [22:50:53] thanks very much. [22:50:56] sure [22:51:01] thanks for helping with testing [22:51:10] now I know what state we're in and what's next [22:51:12] and next for me is [22:51:33] * apergos sticks bd808 with the tracking cookie and high-tails it for the bed [22:51:34] have a good night. [22:51:35] night! [22:51:36] !log test / test2 wikis are incredibly slow . Filled https://phabricator.wikimedia.org/T125727 [22:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:51:51] all done with parsoid deploy [22:52:17] thcipriani: so I dont think rest of group0 / group1 will be done anytime soon [22:52:30] 6operations, 5Patch-For-Review: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#1995936 (10Dzahn) >>! In T123711#1985735, @elukey wrote: > 1) Work with Joe and Dzahn to figure out how to provision a new node with Jessie Here's an example change that switches mc1004 and... [22:52:53] test and test2 are no longer exclusively (or at all) hosted by mw1017 [22:53:08] mw1017 handles requests which bear the X-Wikimedia-Debug header, regardless of project [22:53:58] which points to this being a wmf12 problem, rather than a testwiki/test2wiki problem [22:54:09] yup [22:54:26] ori: mw1017 for wmf.12 is fast though [22:54:39] so now other apaches are handling wmf.12 requests, and they are slower? [22:54:41] which seems like a strange outcome [22:55:29] Krenair: yes, test and test2 go to the general app server pool, like other projects. [22:55:36] "wgBackendResponseTime":10767,"wgHostname":"mw1113" [22:55:44] so that float everywhere [22:55:49] and wmf.12 is borked [22:56:11] yep [22:56:21] maybe forceprofile=1 would help ? [22:56:31] needs the magic headers now apparently [22:56:41] yeah [22:57:07] Seeing a steady trickle of "Warning: timed out after 0.2 seconds when connecting to rdb1007.eqiad.wmnet [110]: Connection timed out" in hhvm logs. the rdb* host varies [22:57:12] it's strange, a curl for testwiki main_page directly on mw1017 takes < 1sec, but on a different machine takes 13 seconds... [22:57:35] testwiki goes to all backends, not just mw1017 [22:58:10] sure, a curl for localhost testwiki, to clarify [22:58:58] the LightProcess errors from mw1019 are a known issue correct? [22:59:35] (03PS1) 10Hashar: Only keep testwiki test2wiki 1.20.7-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268315 (https://phabricator.wikimedia.org/T125727) [22:59:48] thcipriani: ^^ that is to clarify the current situation on mira [23:00:01] (03PS2) 10Dzahn: ipsec: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266985 [23:00:22] I just tried a random article from testwiki on curl, and got one slow load then one fast load, hmmm [23:00:31] (via appservers.svc IP) [23:00:33] Im going to test mediawiki 1.27 wmf12 on my wiki without updating extensions. [23:00:53] the timing seems random [23:01:05] I've had near-instant up to ~5s page loads [23:01:18] bblack: I've had up to 30s [23:01:31] at least we know wmf11 had no such slowdown [23:01:59] (03CR) 10Dzahn: [C: 032] ipsec: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266985 (owner: 10Dzahn) [23:02:09] one thing is that on mira some l10n cache had the wrong entries for wmf12 so maybe they havent been synced properly on some/most hosts [23:02:16] thus mediawiki would run without a l10n cache [23:02:30] bblack@neodymium:~$ time curl -sH 'X-Forwarded-Proto: https' http://test.wikipedia.org/wiki/NotARedirtest --resolve test.wikipedia.org:80:10.2.2.1 >/dev/null [23:02:33] real 0m11.319s [23:02:44] bblack: the curl is properly cached isn't it ? [23:03:05] that particular curl is straight to appservers.svc (10.2.2.1), so behind varnish, but randomly to any appserver [23:03:08] ah no direct [23:04:24] thcipriani: can you review https://gerrit.wikimedia.org/r/#/c/268315/ ? that is to have solely test/test2 on wmf12 [23:04:36] hashar: bd808: I am half way through installing mediawiki and it is loading fine. So it is either resources/ or an extension. [23:04:39] wikiversions.json is live hacked on mira currently [23:05:01] paladox: yeah we would need to profile the request on our servers [23:05:02] hashar: bd808: I would say disable one by one extension until you find the problem extension. [23:05:07] hashar: Ok. [23:05:22] paladox: we have other magic tricks to track such issues :D [23:05:29] "scap bisect" [23:05:31] and a shit ton of smart folks [23:05:31] bblack@neodymium:~$ time curl -sH 'X-Forwarded-Proto: https' http://test.wikipedia.org/ --resolve test.wikipedia.org:80:10.2.2.1 >/dev/null [23:05:34] ori: lol [23:05:34] real 0m13.179s [23:05:35] (03CR) 10Thcipriani: [C: 032] Only keep testwiki test2wiki 1.20.7-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268315 (https://phabricator.wikimedia.org/T125727) (owner: 10Hashar) [23:05:36] hashar: Oh how. [23:05:45] ^ 13s response time on just a 301 to main page [23:05:58] \o/ [23:06:00] (03Merged) 10jenkins-bot: Only keep testwiki test2wiki 1.20.7-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268315 (https://phabricator.wikimedia.org/T125727) (owner: 10Hashar) [23:06:11] thcipriani: thanks clearing up mira [23:06:29] (03PS2) 10Dzahn: osm: fix top-scope var without namespace, rm cruft [puppet] - 10https://gerrit.wikimedia.org/r/266971 [23:07:14] !log rebooting wdqs1001 for kernel update [23:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:07:33] (03CR) 10Dzahn: osm: fix top-scope var without namespace, rm cruft (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266971 (owner: 10Dzahn) [23:07:33] !log hashar@mira rebuilt wikiversions.php and synchronized wikiversions files: Clarify only testwiki and test2wiki are on php-1.27.0-wmf.12 [23:07:34] can we fix the redis connection issues to the rdb1* servers? [23:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:08:03] (03PS3) 10Dzahn: osm: fix top-scope var without namespace, rm cruft [puppet] - 10https://gerrit.wikimedia.org/r/266971 [23:08:35] !log Full script of my deployment session is on mira.codfw.wmnet:/home/hashar/wmf12-deploy.script [23:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:08:54] (03PS2) 10Subramanya Sastry: parsoid-rt-client: Have testreduce clients use global parsoid service [puppet] - 10https://gerrit.wikimedia.org/r/268230 [23:09:35] hashar: bd808: Could it be visualeditor since i get client-js ve-not-available un the html class. [23:09:51] (03CR) 10Dzahn: [C: 032] osm: fix top-scope var without namespace, rm cruft [puppet] - 10https://gerrit.wikimedia.org/r/266971 (owner: 10Dzahn) [23:10:09] bd808: is there a task for that? [23:10:16] * bd808 looks [23:10:17] (if not, can you file one?) [23:11:00] there was a parsoid deploy about 15 minutes ago [23:11:03] possible? [23:12:41] (03CR) 10Subramanya Sastry: parsoid-rt-client: Have testreduce clients use global parsoid service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268230 (owner: 10Subramanya Sastry) [23:12:43] (03PS4) 10Dzahn: dataset: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266966 [23:13:40] hasahr: Could this fix visual editor https://git.wikimedia.org/commitdiff/mediawiki%2Fextensions%2FVisualEditor.git/7099ae0eef7ce8531db724b4e73e588ed41859ce [23:14:00] ori: https://phabricator.wikimedia.org/T125735 [23:14:06] if it is just test and test2 that have been updated to wmf12, and it is only us using them, then maybe the HHVM JIT is still running the code in the interpreter and tracing it [23:14:21] repeat requests seem faster, even if i add a cache-busting query-string [23:14:55] oh on another note [23:15:06] bd808: excellent (and interesting) thanks [23:15:15] beta pages are just fine. and afaik no slowness reports have been mentionned [23:15:23] !log rebooting wdqs1002 for kernel update [23:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:15:35] so maybe I screwed up the wmf12 deployment/file generation somehow. The l10n cache comes to mind [23:17:17] hashar: If you're worried about l10n cache you could scap again [23:17:35] yeah did it a second time already [23:18:19] I think it was the JIT [23:18:22] I can't reproduce the slowness [23:18:24] can anyone? [23:18:24] scap should rebuild all l10n json files before the cluster syncs so that should be fine [23:18:51] so hhvm warming up its cache ? [23:18:57] JIT threshold + few users + many app servers = high likehood of slow request [23:18:58] yes [23:19:22] so that would only happen when we roll a new branch? [23:19:27] there's what, a half dozen of us making requests to test and test2, plus a handful of random users [23:19:39] the requests that we're making are distributed over 200+ app servers [23:19:57] so there's a high chance that your request is hitting an app server that is translating the code in the wmf12 branch for the first time [23:19:59] it does seem to be getting better and better [23:20:02] is a single hit enough to fully populate the jit / cache whatever? [23:20:21] no, it's something like 11 IIRC [23:20:38] first hit would prime the apc cache equivalent. it takes 11+ to fully warm up the JIT [23:20:51] so by only deploying to testwiki , that hasn't attracted enough traffic [23:20:53] and I freaked out [23:21:06] Yes, I think so. I freaked out too, but I think this is what happened. [23:21:22] whereas had I deployed to mediawikiwiki that would have populated much faster [23:21:29] yep. [23:21:30] this is our first full scap wiht testwiki distributed to the cluster too correct? [23:21:54] first full branch maybe; I think there have been other scaps [23:21:57] (03PS2) 10Dzahn: grafana: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266978 [23:21:58] not sure [23:22:18] unfortunately, I think this is the first branch cut since that happened [23:22:22] (03CR) 10Dzahn: grafana: fix top-scope var without namespace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266978 (owner: 10Dzahn) [23:22:23] maybe they got rolled out on all of group0 [23:22:25] yeah [23:22:47] !log depooling restbase2001 for kernel/Java update [23:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:00] first branch for sure because the change was in part inspired by wmf.11 [23:23:07] right [23:23:25] if you could summarize on https://phabricator.wikimedia.org/T125727 [23:23:27] at any rate, I think it is fine to roll out wmf12 to all group0 wikis [23:23:32] thcipriani: I guess you can do rest of group0 [23:23:33] Special:Random on testwiki is getting more consistently fast for me [23:23:37] hashar: ack, doing. [23:23:37] I'll update the task [23:23:40] oh [23:23:41] ok [23:23:45] thcipriani: leaving it to you then [23:23:56] and continue as planned, hold half an hour then move to group1 [23:23:58] I like ori's theory [23:24:23] someday we will find a way to depool and warm the JIT before repooling [23:24:32] sorry for the waste of time / freak out etc [23:24:40] no worries hashar [23:24:48] ori: I was just talking about the group0 rollout :) [23:24:52] many of us were concerned [23:24:52] isn't the JIT stored in a sqlite file ? [23:24:59] hashar: I actually really appreciate you slamming the breaks on a deployment because of a perf hit [23:24:59] or the hhvm sqlite file is unrelated [23:25:22] yeah the cache is a sqlite file [23:25:39] ori: well a few milliseconds on dom loaded I would not have noticed. But seconds, I do :-} [23:25:44] and if we had repoauthoratative it would be primed on deploy, but that's a whole other problem [23:26:55] the deploy would become: mysqldump -H deploy | mysql -H 127.0.0.1 hhvmcache :D [23:27:08] yet another challenge [23:27:19] --single-transaction [23:27:28] it seems that if i set the wiki language set as for example ja and in my user account i set it as en it scauses the links to go all funny and cause them to error out. [23:28:02] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: Connection refused [23:28:17] hashar: the seconds were spent crawling, parsing and translating the entire wmf12 source tree [23:28:36] ^that's me, forgot to silence icinga, fixing now [23:29:04] with mediawikiwiki getting substantial organic traffic, and testwiki being on mw1017 only (until recently), the chances of your request being the one that has to do all of that work were very small [23:29:22] !log passing wmf12 responsibility to thcipriani . Crashing to bed myself. [23:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:29:31] (03PS2) 10Dzahn: varnish: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266977 [23:31:05] ori: yeah that totally make sense. So I guess we will want to sync the whole group0 instead of just test/test2 [23:31:12] hashar: yep [23:32:47] (03CR) 10Dzahn: [C: 032] varnish: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266977 (owner: 10Dzahn) [23:33:27] (03PS1) 10Thcipriani: group0 to wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268321 [23:33:48] thank you all to have stepped in. Much appreciated :-} [23:33:56] (03PS2) 10Dzahn: mediawiki/jobrunner: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266967 [23:34:00] have a good rest of deploy, I am heading bed myself. [23:34:08] is there an easy way to get a list of all the MW servers that are handling web requests? [23:34:14] good night hashar [23:34:21] hashar: goodnight! thanks! [23:34:35] (03CR) 10Thcipriani: [C: 032] group0 to wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268321 (owner: 10Thcipriani) [23:34:57] bd808: pointed earlier today by ops http://config-master.wikimedia.org/conftool/eqiad/ [23:34:59] (03Merged) 10jenkins-bot: group0 to wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268321 (owner: 10Thcipriani) [23:35:21] bd808: since varnish / app pooling is handled by conftool , we can just look at conftool files :-} [23:35:24] hashar: awesome. [23:35:39] 6operations, 10MediaWiki-General-or-Unknown, 10MediaWiki-Maintenance-scripts: PHP 5.5.9 seems to have issues parsing some argumentless command line parameters - https://phabricator.wikimedia.org/T125748#1996224 (10Krenair) 3NEW [23:35:45] someone had a better faster command but that url is in my history. Might be worth asking to ops list [23:36:03] !log thcipriani@mira rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.27.0-wmf.12 [23:36:03] and ideally we would want a small web app on top of that [23:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:36:19] 6operations, 10MediaWiki-General-or-Unknown, 10MediaWiki-Maintenance-scripts, 7Newphp, 7Upstream: PHP 5.5.9 seems to have issues parsing some argumentless command line parameters - https://phabricator.wikimedia.org/T125748#1996236 (10Krenair) [23:36:37] 6operations, 10MediaWiki-General-or-Unknown, 10MediaWiki-Maintenance-scripts, 7Upstream: PHP 5.5.9 seems to have issues parsing some argumentless command line parameters - https://phabricator.wikimedia.org/T125748#1996224 (10Krenair) [23:36:51] bd808: depooled hosts have 'enabled': False [23:37:21] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.038 second response time on port 9042 [23:39:04] !log repooling restbase2001 , depooling restbase2002 for kernel/Java update [23:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:45:02] 6operations, 6Labs, 10Labs-Infrastructure, 10Tool-Labs: failed backups on labstore? - https://phabricator.wikimedia.org/T125749#1996302 (10Dzahn) 3NEW [23:45:41] 6operations, 6Labs, 10Labs-Infrastructure, 10Tool-Labs: failed backups on labstore? - https://phabricator.wikimedia.org/T125749#1996310 (10Dzahn) since it says "was exit-code" it looks more like a typo in the monitoring script ? [23:50:29] 6operations, 6Labs, 10Labs-Infrastructure, 10Tool-Labs: failed backups on labstore? - https://phabricator.wikimedia.org/T125749#1996327 (10Dzahn) on neon i can see the check commands used: ``` @neon:/etc/icinga# grep check_replicate puppet_services.cfg check_command nrpe_check!check_re... [23:53:05] !log repooling restbase2002 , depooling restbase2003 for kernel/Java update [23:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:56:36] (03PS1) 10Andrew Bogott: Switch keystone to mysql assignment from ldap. [puppet] - 10https://gerrit.wikimedia.org/r/268325 (https://phabricator.wikimedia.org/T115029) [23:59:22] (03PS1) 10Subramanya Sastry: parsoid-vd-client: Add missing PATH environment var to script [puppet] - 10https://gerrit.wikimedia.org/r/268326