[00:00:43] Teamwork makes the dream work. [00:00:56] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [00:01:00] mutante: Niharika's wikitech user is Niharika29 [00:01:31] bd808: thanks. looked like she has Niharika and Niharika29 [00:01:42] oh? weird [00:01:53] Niharika29 is the one that had rights on the commtech project [00:02:01] s/had/has/ [00:02:11] MaxSem: where is 5a6f2e698412e049ac32e48e28f23f111a86f2b6? [00:02:22] bd808: gotcha, ok [00:02:36] whaaa? [00:02:37] MaxSem: in /srv/mediawiki-staging/php-1.28.0-wmf.13/extensions/timeline: fatal: bad object 5a6f2e698412e049ac32e48e28f23f111a86f2b6 [00:02:42] git fetch --all [00:02:45] it's the .14 branch [00:03:09] fetch --all isn't a thing :P [00:03:33] okay, so now it works and it's the right commit Creating new wmf/1.28.0-wmf.14 branch [00:03:36] (03PS2) 10Reedy: Remove timeline old config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303303 (https://phabricator.wikimedia.org/T140852) [00:03:45] reedy@ubuntu64-web-esxi:~/git/operations/mediawiki-config$ git fetch --all [00:03:45] Fetching origin [00:03:46] wfm [00:03:55] (03CR) 10Reedy: Remove timeline old config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303303 (https://phabricator.wikimedia.org/T140852) (owner: 10Reedy) [00:04:08] Anyway ^ is the fix after the branch update is done [00:04:16] Lovingly prepared in advance [00:06:37] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [00:07:22] live on mw1099 [00:11:42] !log maxsem@tin Synchronized php-1.28.0-wmf.13/extensions/timeline: https://gerrit.wikimedia.org/r/#/c/303947/ (duration: 00m 48s) [00:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:14:17] (03CR) 10MaxSem: [C: 032] Remove timeline old config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303303 (https://phabricator.wikimedia.org/T140852) (owner: 10Reedy) [00:14:45] (03Merged) 10jenkins-bot: Remove timeline old config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303303 (https://phabricator.wikimedia.org/T140852) (owner: 10Reedy) [00:14:56] Thanks MaxSem and Dereckson :) [00:16:55] Reedy, https://test.wikipedia.org/wiki/User:MaxSem/sandbox [00:18:13] Looks like a typo [00:18:43] global $wgtimelineFontFile; [00:18:51] that probably doesn't help either [00:18:55] geee [00:19:50] TimelineTimeline [00:19:54] at least prod wikis aren't on fire [00:20:37] https://gerrit.wikimedia.org/r/303949 [00:20:55] (03PS3) 10Dzahn: admin: add shell account for Niharika Kohli [puppet] - 10https://gerrit.wikimedia.org/r/303543 (https://phabricator.wikimedia.org/T141593) (owner: 10Ema) [00:24:27] (03PS1) 10Yuvipanda: labspuppetbackend: Switch from json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/303954 [00:24:36] MaxSem: i guess that swat is not gonna happen now? [00:24:38] :) [00:25:22] !log maxsem@tin Synchronized php-1.28.0-wmf.14/extensions/timeline: https://gerrit.wikimedia.org/r/#/c/303950/ (duration: 00m 47s) [00:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:28:36] !log maxsem@tin Synchronized php-1.28.0-wmf.13/extensions/timeline: https://gerrit.wikimedia.org/r/#/c/303952/ (duration: 00m 50s) [00:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:29:13] (03PS1) 10Thcipriani: Stop icinga git remote update [puppet] - 10https://gerrit.wikimedia.org/r/303955 (https://phabricator.wikimedia.org/T127093) [00:29:22] jdlrobson: your patch can be deployed after extensions issues are solved [00:29:41] Reedy, still fubar [00:31:01] I wonder if https://gerrit.wikimedia.org/r/#/c/303303/ can't be tested cuz it's async [00:32:43] (03CR) 10Dzahn: [C: 032] admin: add shell account for Niharika Kohli [puppet] - 10https://gerrit.wikimedia.org/r/303543 (https://phabricator.wikimedia.org/T141593) (owner: 10Ema) [00:32:53] (03PS4) 10Dzahn: admin: add shell account for Niharika Kohli [puppet] - 10https://gerrit.wikimedia.org/r/303543 (https://phabricator.wikimedia.org/T141593) (owner: 10Ema) [00:33:28] !log maxsem@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/303303/ (duration: 00m 47s) [00:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:34:03] oops [00:34:07] we're broken [00:34:19] Reedy, any ideas? [00:35:15] It kinda looks like it's not writing the file out [00:35:58] still Undefined variable: wgTimelinePloticusCommand [00:36:19] caching? [00:36:40] log caching? these still appear [00:36:46] no [00:36:50] opcache caching [00:37:05] oh ffs [00:37:07] last time, 2 hours were needed to make a notice in CS disappear [00:37:11] missing global $TimelinePloticusCommand; [00:37:14] with a wg [00:37:46] (03PS1) 10Sbisson: Enabling thank-you-edit on beta for testing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303957 (https://phabricator.wikimedia.org/T128249) [00:41:11] !log maxsem@tin Synchronized php-1.28.0-wmf.13/extensions/timeline: (no message) (duration: 00m 47s) [00:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:42:00] !log reedy@tin Synchronized php-1.28.0-wmf.14/extensions/timeline/Timeline.body.php: (no message) (duration: 00m 48s) [00:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:42:16] https://test.wikipedia.org/wiki/User:MaxSem/sandbox [00:42:17] so pretty [00:45:41] 06Operations, 10Ops-Access-Requests: Requesting deployment access for Niharika - https://phabricator.wikimedia.org/T141593#2538859 (10Dzahn) ``` [bast1001:~] $ id niharika29 uid=4103(niharika29) gid=500(wikidev) groups=500(wikidev),705(deployment),707(bastiononly) ``` [tin:~] $ id niharika29 uid=4103(niharika... [00:46:09] 06Operations, 10Ops-Access-Requests: Requesting deployment access for Niharika - https://phabricator.wikimedia.org/T141593#2538861 (10Dzahn) 05Open>03Resolved a:03Dzahn [00:48:34] Niharika: you are a deployer now [00:49:10] mutante: Thank you! [00:49:59] Niharika: you are welcome, the username is _with_ the 29 at the end [00:50:22] mutante: Yes, that's the account I use. [00:50:33] great [01:02:56] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [01:08:38] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [01:11:44] no more wg issue in the logs [01:12:19] but 74 not find/open font (unifont-5.1.20080907) (width calc) 28 not find/open font (unifont-5.1.20080907) [01:12:50] that's the font requested by easytimeline [01:18:26] PROBLEM - MariaDB Slave Lag: m3 on db1043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1097.40 seconds [01:20:07] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [01:29:56] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [01:38:07] (03PS3) 10Yuvipanda: tools: Expose etcd /metrics end point to everywhere [puppet] - 10https://gerrit.wikimedia.org/r/303759 [01:38:13] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Expose etcd /metrics end point to everywhere [puppet] - 10https://gerrit.wikimedia.org/r/303759 (owner: 10Yuvipanda) [01:38:25] (03PS4) 10Yuvipanda: tools: Scrape etcd metrics too [puppet] - 10https://gerrit.wikimedia.org/r/303762 [01:38:30] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Scrape etcd metrics too [puppet] - 10https://gerrit.wikimedia.org/r/303762 (owner: 10Yuvipanda) [01:38:42] (03PS2) 10Yuvipanda: prometheus: Don't require scrape config files to start with node_ [puppet] - 10https://gerrit.wikimedia.org/r/303765 [01:38:46] (03CR) 10Yuvipanda: [C: 032 V: 032] prometheus: Don't require scrape config files to start with node_ [puppet] - 10https://gerrit.wikimedia.org/r/303765 (owner: 10Yuvipanda) [01:43:36] RECOVERY - MariaDB Slave Lag: m3 on db1043 is OK: OK slave_sql_lag Replication lag: 58.04 seconds [02:12:26] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [02:14:17] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [02:40:00] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.13) (duration: 17m 50s) [02:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:15:03] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.14) (duration: 18m 02s) [03:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:22:30] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Aug 10 03:22:29 UTC 2016 (duration 7m 26s) [03:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:24:48] 06Operations, 10MediaWiki-Database, 06Performance-Team: periodic spike of MW exceptions "DB connection was already closed or the connection dropped." - https://phabricator.wikimedia.org/T142079#2538941 (10aaron) [04:11:40] 06Operations, 10MediaWiki-extensions-BounceHandler, 13Patch-For-Review: Need an administrative front end for BounceHandler - https://phabricator.wikimedia.org/T114020#2538985 (10Legoktm) 05Open>03Resolved [05:39:17] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [05:41:07] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [06:32:55] (03PS2) 10Muehlenhoff: openldap::labtest: Restrict to production/labs networks [puppet] - 10https://gerrit.wikimedia.org/r/303839 [06:39:26] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [06:41:16] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [06:43:46] (03CR) 10Muehlenhoff: [C: 032 V: 032] openldap::labtest: Restrict to production/labs networks [puppet] - 10https://gerrit.wikimedia.org/r/303839 (owner: 10Muehlenhoff) [06:54:56] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [06:56:37] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [07:02:25] (03PS1) 10Yuvipanda: prometheus: Add blackbox exporter role/class [puppet] - 10https://gerrit.wikimedia.org/r/303986 [07:17:19] PROBLEM - puppet last run on mw2105 is CRITICAL: CRITICAL: puppet fail [07:40:13] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster: Remove rhodium.eqiad.wmnet auth rule [puppet] - 10https://gerrit.wikimedia.org/r/303842 (owner: 10Alex Monk) [07:41:30] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster: Split secure_private from is_labs_master [puppet] - 10https://gerrit.wikimedia.org/r/303758 (owner: 10Alex Monk) [07:41:35] (03PS3) 10Alexandros Kosiaris: puppetmaster: Split secure_private from is_labs_master [puppet] - 10https://gerrit.wikimedia.org/r/303758 (owner: 10Alex Monk) [07:41:39] (03CR) 10Alexandros Kosiaris: [V: 032] puppetmaster: Split secure_private from is_labs_master [puppet] - 10https://gerrit.wikimedia.org/r/303758 (owner: 10Alex Monk) [07:42:07] (03PS2) 10Alexandros Kosiaris: puppetmaster: Remove rhodium.eqiad.wmnet auth rule [puppet] - 10https://gerrit.wikimedia.org/r/303842 (owner: 10Alex Monk) [07:42:09] (03CR) 10Alexandros Kosiaris: [V: 032] puppetmaster: Remove rhodium.eqiad.wmnet auth rule [puppet] - 10https://gerrit.wikimedia.org/r/303842 (owner: 10Alex Monk) [07:44:37] RECOVERY - puppet last run on mw2105 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:52:17] (03CR) 10Alexandros Kosiaris: [C: 04-1] puppetmaster: Split extra_auth_rules from is_labs_master (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/303757 (owner: 10Alex Monk) [07:52:59] (03CR) 10Alexandros Kosiaris: [C: 031] puppetmaster: Kill is_labs_master [puppet] - 10https://gerrit.wikimedia.org/r/303761 (owner: 10Alex Monk) [07:55:23] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Block per the comment in https://gerrit.wikimedia.org/r/#/c/302881/1 (rhodium)" [dns] - 10https://gerrit.wikimedia.org/r/302757 (owner: 10Dzahn) [07:55:38] (03CR) 10Alexandros Kosiaris: [C: 031] alsafi: add missing IPV6 AAAA and reverse [dns] - 10https://gerrit.wikimedia.org/r/302868 (owner: 10Dzahn) [08:00:37] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [08:00:56] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 2, dormant: 0, excluded: 0, unused: 0BRxe-0/0/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-314534, 29ms) {#11375} [10Gbps wave]BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [08:01:06] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqord:xe-0/0/0 (Telia, IC-314534, 24ms) {#10694} [10Gbps wave]BR [08:01:37] (03CR) 10Alexandros Kosiaris: maps - osm-initial-import fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/303816 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [08:03:48] (03CR) 10Gehel: maps - osm-initial-import fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/303816 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [08:14:07] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [08:14:26] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:14:28] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [08:16:47] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: puppet fail [08:18:55] (03CR) 10Elukey: [C: 031] "Looks good to me and pcc is also happy https://puppet-compiler.wmflabs.org/3648/" [puppet] - 10https://gerrit.wikimedia.org/r/303846 (owner: 10Eevans) [08:26:38] (03PS3) 10Gehel: maps - osm-initial-import fixes [puppet] - 10https://gerrit.wikimedia.org/r/303816 (https://phabricator.wikimedia.org/T138092) [08:35:06] 06Operations, 06Discovery, 10Elasticsearch, 10netops, 03Discovery-Search-Sprint: Enable access to relforge clusters from virtual machines running on labs - https://phabricator.wikimedia.org/T142211#2539191 (10dcausse) Thanks! [08:38:14] (03PS1) 10Volans: Monitoring: avoid NRPE limit for RAID get status [puppet] - 10https://gerrit.wikimedia.org/r/303992 (https://phabricator.wikimedia.org/T142085) [08:40:49] (03PS2) 10Volans: Monitoring: avoid NRPE limit for RAID get status [puppet] - 10https://gerrit.wikimedia.org/r/303992 (https://phabricator.wikimedia.org/T142085) [08:44:17] RECOVERY - puppet last run on ms-be2006 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [08:45:37] (03PS1) 10Giuseppe Lavagetto: Add a simple TLS-terminating reverse proxy class [puppet/nginx] - 10https://gerrit.wikimedia.org/r/303998 [08:45:45] <_joe_> gehel: ^^ [08:46:00] * gehel looking... [08:46:18] 06Operations, 10Analytics, 10Traffic: Correct cache_status field on webrequest dataset - https://phabricator.wikimedia.org/T142410#2539220 (10elukey) @BBlack thanks a lot! I checked https://grafana.wikimedia.org/dashboard/db/varnishkafka and everything looks good, plus the cache_status field looks sane from... [08:47:54] (03PS2) 10Giuseppe Lavagetto: Add a simple TLS-terminating reverse proxy class [puppet/nginx] - 10https://gerrit.wikimedia.org/r/303998 [08:53:43] _joe_: probably a stupid question, but why did you remove the server_name? [08:54:27] _joe_: in the case of elsaticsearch, no server_name was simpler as we want to access that host with either the service name or with the node name, but for the generic case, I'm wondering... [08:55:09] <_joe_> gehel: that's supposed to be a simple 1:1 proxy [08:55:17] <_joe_> so it's bound to be the default site [08:55:33] simpler is better... [08:55:36] (03CR) 10Gehel: [C: 032] Add a simple TLS-terminating reverse proxy class [puppet/nginx] - 10https://gerrit.wikimedia.org/r/303998 (owner: 10Giuseppe Lavagetto) [08:55:52] _joe_: sorry, I wanted to +1, but slipped... [08:56:13] <_joe_> gehel: no problem, it's a submodule so now I need to do the usual dance anyways [08:56:44] _joe_: good dancing! [08:57:00] _joe_: I'll update elastic once this is merged [09:00:02] (03PS1) 10Giuseppe Lavagetto: nginx: add simple TLS proxy class [puppet] - 10https://gerrit.wikimedia.org/r/304003 [09:01:30] (03CR) 10Giuseppe Lavagetto: [C: 032] nginx: add simple TLS proxy class [puppet] - 10https://gerrit.wikimedia.org/r/304003 (owner: 10Giuseppe Lavagetto) [09:01:40] (03CR) 10Giuseppe Lavagetto: [V: 032] nginx: add simple TLS proxy class [puppet] - 10https://gerrit.wikimedia.org/r/304003 (owner: 10Giuseppe Lavagetto) [09:04:30] (03CR) 10Jcrespo: "> eh.. does this mean you want it to have that additonal manual "git pull" step on the server and you ran that after each merge?" [puppet] - 10https://gerrit.wikimedia.org/r/303719 (owner: 10Dzahn) [09:09:46] !log rebooting hafnium for kernel update to 4.4 [09:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:19:06] 06Operations: create puppetDB puppet role + debian package - https://phabricator.wikimedia.org/T142363#2539288 (10Joe) a:03Joe [09:28:25] 06Operations, 06WMF-Legal, 06WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#2539300 (10Aklapper) @ZhouZ: Hmm, let me try to rephrase the problem: >>! In T98722#2534884, @ZhouZ wrote: > So everyone who has a @wikimedia.org account should be on a NDA and can b... [09:41:41] <_joe_> !log uploaded puppetdb deb packages for jessie (T142363) [09:41:42] T142363: create puppetDB puppet role + debian package - https://phabricator.wikimedia.org/T142363 [09:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:44:34] !log rebooting pollux for kernel update to 4.4 [09:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:55:39] !log rebooting dubnium for kernel update to 4.4 [09:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:03:22] (03PS1) 10Gehel: Elasticsearch now uses the more generic nginx::simple_tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/304010 [10:04:24] (03CR) 10jenkins-bot: [V: 04-1] Elasticsearch now uses the more generic nginx::simple_tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/304010 (owner: 10Gehel) [10:04:25] 06Operations, 10Ops-Access-Requests, 10Fundraising-Backlog: Access request: AWight access to iridium - https://phabricator.wikimedia.org/T142446#2535387 (10akosiaris) @maxsem is correct. Using the HTTP requests from Hadoop is way better as an alternative. In fact, in previous security related cases, ops did... [10:04:33] 06Operations, 10Ops-Access-Requests, 10Fundraising-Backlog: Access request: AWight access to iridium - https://phabricator.wikimedia.org/T142446#2539358 (10akosiaris) p:05Triage>03Low [10:05:40] (03PS2) 10Gehel: Elasticsearch now uses the more generic nginx::simple_tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/304010 [10:07:44] (03PS3) 10Gehel: Elasticsearch now uses the more generic nginx::simple_tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/304010 [10:18:34] (03PS4) 10Gehel: Elasticsearch now uses the more generic nginx::simple_tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/304010 [10:20:47] (03CR) 10Gehel: "Puppet compiler: https://puppet-compiler.wmflabs.org/3651/" [puppet] - 10https://gerrit.wikimedia.org/r/304010 (owner: 10Gehel) [10:21:29] (03CR) 10Alexandros Kosiaris: "Yeah, indeed in apache 2.4 NameVirtualHost is deprecated and redundant. Sure let's go for it" [puppet] - 10https://gerrit.wikimedia.org/r/297727 (https://phabricator.wikimedia.org/T132661) (owner: 10Dzahn) [10:23:39] (03PS3) 10Giuseppe Lavagetto: postgresql: support SSL connections/replication [puppet] - 10https://gerrit.wikimedia.org/r/303800 [10:23:41] (03PS4) 10Giuseppe Lavagetto: puppetmaster: add role for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/303801 (https://phabricator.wikimedia.org/T142363) [10:25:05] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: add role for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/303801 (https://phabricator.wikimedia.org/T142363) (owner: 10Giuseppe Lavagetto) [10:42:03] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [11:07:54] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [11:29:03] !log T140443 uploaded to apt.wikimedia.org trusty-wikimedia: php-wikidiff2_1.4.0 [11:29:05] T140443: Deploy new version of wikidiff2 package - https://phabricator.wikimedia.org/T140443 [11:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:30:25] 06Operations, 06Community-Tech, 10wikidiff2, 13Patch-For-Review: Deploy new version of wikidiff2 package - https://phabricator.wikimedia.org/T140443#2464963 (10akosiaris) Version 1.4.0 is on apt.wikimedia.org and has been deployed to mw1017 for testing. Let's evaluate and then move forward with the full fl... [11:31:06] (03PS1) 10ArielGlenn: make the 'force' (lock stealing) option for dumps useful [dumps] - 10https://gerrit.wikimedia.org/r/304013 [11:55:59] (03CR) 10Mobrovac: [C: 04-1] "I am not sure this is the good way to go about it. Hiding the config from the world means that deployers and maintainers won't be able to " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/303846 (owner: 10Eevans) [11:56:00] 06Operations: Disable unprivileged user namespaces on trusty kernels - https://phabricator.wikimedia.org/T142567#2539543 (10MoritzMuehlenhoff) [12:04:10] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and bast1001 for ovasileva - https://phabricator.wikimedia.org/T142502#2537219 (10akosiaris) Hello @ovasileva, I 've see you 've already read and signed L3 per https://wikitech.wikimedia.org/wiki/Production_shell_access#Requesting_ac... [12:15:19] (03PS2) 10ArielGlenn: make the 'force' (lock stealing) option for dumps useful [dumps] - 10https://gerrit.wikimedia.org/r/304013 [12:19:05] !log T140443 uploaded to apt.wikimedia.org trusty-wikimedia: php-wikidiff2_1.4.1 [12:19:06] T140443: Deploy new version of wikidiff2 package - https://phabricator.wikimedia.org/T140443 [12:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:24:46] (03PS3) 10ArielGlenn: make the 'force' (lock stealing) option for dumps useful [dumps] - 10https://gerrit.wikimedia.org/r/304013 [12:28:35] (03PS1) 10Alexandros Kosiaris: Grant jdlrobson access to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/304017 (https://phabricator.wikimedia.org/T141811) [12:28:42] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and bast1001 for ovasileva - https://phabricator.wikimedia.org/T142502#2539684 (10akosiaris) p:05Triage>03Normal [12:42:52] 06Operations, 06Discovery, 10Traffic, 03Discovery-Search-Sprint: Setup load balancing for elasticsearch service on relforge servers - https://phabricator.wikimedia.org/T142098#2539708 (10Gehel) 05Open>03declined Since we don't need load balancing, let's close this. If the need comes at some point, we w... [12:43:46] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations: Analytics cluster access request for ISI Foundation team - https://phabricator.wikimedia.org/T141634#2505524 (10akosiaris) Hello, request looks good, but for technical reasons (avoiding uid/gid mess) we also need the above... [12:47:22] 06Operations, 10Ops-Access-Requests: Access for p858snake to chanops in #wikimedia-operations - https://phabricator.wikimedia.org/T142270#2539714 (10akosiaris) @robh @mutante: I suppose you have access to do this ? [12:47:40] 06Operations, 10EventBus, 06Services, 15User-mobrovac: EventBus Proxy Service Doesn't Handle SIGHUP Correctly - https://phabricator.wikimedia.org/T140868#2539716 (10mobrovac) @Ottomata any update on this? Is SIGHUP being correctly propagated now? [12:47:43] 06Operations, 10Ops-Access-Requests, 10Analytics: Add analytics team members to group aqs-admins to be able to deploy pageview APi - https://phabricator.wikimedia.org/T142101#2522728 (10akosiaris) aqs-users was introduced in T117473. Seems to have been created with something different in mind. I 'll rather c... [12:49:44] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 15User-mobrovac: Allow *-admin groups to see systemd logs for their units - https://phabricator.wikimedia.org/T137878#2539721 (10Joe) @mobrovac I already know what to do in order to fix it. At the moment all logs are registered as "firejail". [12:50:27] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 15User-mobrovac: Allow *-admin groups to see systemd logs for their units - https://phabricator.wikimedia.org/T137878#2539722 (10mobrovac) Duh! Good point. So obvious ... [12:51:44] (03PS1) 10Yurik: Maps: Added list of cassandra servers [puppet] - 10https://gerrit.wikimedia.org/r/304019 [12:51:48] gehel, ^ [12:52:08] could you change it so that it works for each of the clusters? [12:55:45] (03PS1) 10Alexandros Kosiaris: Create deploy-aqs group [puppet] - 10https://gerrit.wikimedia.org/r/304020 (https://phabricator.wikimedia.org/T142101) [12:57:41] !log reboot rdb2001-rdb2004 for updates to Linux 4.4 [12:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:22] 06Operations, 10EventBus, 06Services, 15User-mobrovac: EventBus Proxy Service Doesn't Handle SIGHUP Correctly - https://phabricator.wikimedia.org/T140868#2539737 (10Ottomata) The fix has been made, but not deployed to eqiad. We were waiting to make sure the kafka client issues settled before we did a depl... [13:17:51] (03PS1) 10BBlack: rcstream: remove internal TLS listener [puppet] - 10https://gerrit.wikimedia.org/r/304023 (https://phabricator.wikimedia.org/T134871) [13:19:50] 06Operations, 06Discovery, 06Maps, 10Maps-data, and 2 others: Configure new maps servers in eqiad - https://phabricator.wikimedia.org/T138092#2539745 (10Gehel) p:05Triage>03High [13:23:04] (03PS1) 10Gehel: Maps - categorizing new eqiad slaves [puppet] - 10https://gerrit.wikimedia.org/r/304025 (https://phabricator.wikimedia.org/T138092) [13:27:41] 06Operations, 10Ops-Access-Requests: Access for p858snake to chanops in #wikimedia-operations - https://phabricator.wikimedia.org/T142270#2539769 (10Danny_B) @akosiaris As shown in my previous post, many people do actually. (Might need a review, but that's not a part of this task.) [13:32:25] (03PS1) 10Volans: [WIP] Monitoring: add event handler for RAID checks [puppet] - 10https://gerrit.wikimedia.org/r/304026 (https://phabricator.wikimedia.org/T142085) [13:33:31] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Monitoring: add event handler for RAID checks [puppet] - 10https://gerrit.wikimedia.org/r/304026 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [13:37:20] (03PS2) 10Volans: [WIP] Monitoring: add event handler for RAID checks [puppet] - 10https://gerrit.wikimedia.org/r/304026 (https://phabricator.wikimedia.org/T142085) [13:38:26] !log deploying eventbus service to kafka100[12], depooling, deploying, and repooling each one at at time [13:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:38:35] !log otto@palladium conftool action : set/pooled=no; selector: kafka1001.eqiad.wmnet [13:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:39:50] (03PS2) 10Gehel: Maps: Added list of cassandra servers [puppet] - 10https://gerrit.wikimedia.org/r/304019 (https://phabricator.wikimedia.org/T138092) (owner: 10Yurik) [13:41:08] (03PS5) 10Giuseppe Lavagetto: puppetmaster: add role for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/303801 (https://phabricator.wikimedia.org/T142363) [13:42:20] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: add role for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/303801 (https://phabricator.wikimedia.org/T142363) (owner: 10Giuseppe Lavagetto) [13:42:26] !log otto@palladium conftool action : set/pooled=yes; selector: kafka1001.eqiad.wmnet [13:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:42:44] (03CR) 10Yurik: [C: 031] Maps: Added list of cassandra servers [puppet] - 10https://gerrit.wikimedia.org/r/304019 (https://phabricator.wikimedia.org/T138092) (owner: 10Yurik) [13:43:06] !log otto@palladium conftool action : set/pooled=no; selector: kafka1002.eqiad.wmnet [13:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:48:00] !log otto@palladium conftool action : set/pooled=yes; selector: kafka1002.eqiad.wmnet [13:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:48:46] (03PS1) 10Giuseppe Lavagetto: service::node: expliticly set syslog identifier [puppet] - 10https://gerrit.wikimedia.org/r/304028 (https://phabricator.wikimedia.org/T137878) [13:48:55] <_joe_> mobrovac: ^^ that's the fix [13:50:27] (03CR) 10Mobrovac: [C: 031] service::node: expliticly set syslog identifier [puppet] - 10https://gerrit.wikimedia.org/r/304028 (https://phabricator.wikimedia.org/T137878) (owner: 10Giuseppe Lavagetto) [13:51:08] (03CR) 10Elukey: "Could be another option, maybe the problem is indeed mixing password/private stuff and configuration, but $something needs to be done sinc" [puppet] - 10https://gerrit.wikimedia.org/r/303846 (owner: 10Eevans) [13:51:17] <_joe_> mobrovac: makes sense to you? [13:51:33] yup, _joe_, +1'ed it [13:51:39] (03CR) 10Giuseppe Lavagetto: [C: 032] service::node: expliticly set syslog identifier [puppet] - 10https://gerrit.wikimedia.org/r/304028 (https://phabricator.wikimedia.org/T137878) (owner: 10Giuseppe Lavagetto) [13:52:02] <_joe_> ok I'm going to apply it, it's simple enough that should not need extra care [13:52:13] kk [13:52:24] jsut don't restart changeprop _joe_, i'll do it [13:52:34] <_joe_> yeah I will just run puppet [13:52:35] it's better to stop both nodes and then start them up one by one [13:52:49] <_joe_> wat? [13:52:56] (03PS3) 10Volans: [WIP] Monitoring: add event handler for RAID checks [puppet] - 10https://gerrit.wikimedia.org/r/304026 (https://phabricator.wikimedia.org/T142085) [13:54:21] !log depooling image scaler mw1298 for some local tests with huge SVGs [13:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:54:45] 06Operations, 10Traffic: TLS stats regression related to Chrome/41 on Windows - https://phabricator.wikimedia.org/T141786#2539811 (10BBlack) MS has released some new rolled-up bugfixes for their Aug 9th "Patch Tuesday", which includes this gem that disables RC4 ciphersuites: https://support.microsoft.com/en-us... [13:55:01] <_joe_> mobrovac: a rolling restart of all parsoid will be needed too [13:55:36] can you do it or should i? [13:55:40] <_joe_> but the patch works [13:55:50] <_joe_> I can do it, of course, but later today [13:57:37] <_joe_> as in, I'll let puppet run everywhere first [13:57:43] <_joe_> rb too of course [14:00:38] i'll do scb and rb now (force puppet + restart what's needed) [14:02:05] _joe_: kk, it works! [14:02:08] _joe_: thnx! [14:02:30] <_joe_> mobrovac: I am already forcing puppet on restbases in codfw [14:02:39] <_joe_> I wanted to check everything was allright :P [14:02:46] hehe [14:02:47] kk [14:02:55] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 15User-mobrovac: Allow *-admin groups to see systemd logs for their units - https://phabricator.wikimedia.org/T137878#2539818 (10mobrovac) 05Open>03Resolved [14:04:10] (03PS4) 10Volans: [WIP] Monitoring: add event handler for RAID checks [puppet] - 10https://gerrit.wikimedia.org/r/304026 (https://phabricator.wikimedia.org/T142085) [14:05:17] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Monitoring: add event handler for RAID checks [puppet] - 10https://gerrit.wikimedia.org/r/304026 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [14:11:09] 06Operations, 06Release-Engineering-Team: Manage Appveyor account - https://phabricator.wikimedia.org/T104306#2539826 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff>03None [14:14:13] (03PS5) 10Volans: [WIP] Monitoring: add event handler for RAID checks [puppet] - 10https://gerrit.wikimedia.org/r/304026 (https://phabricator.wikimedia.org/T142085) [14:14:17] (03PS1) 10Ottomata: Add error_output to eventlogging service and make eventbus write EventErrors to log file [puppet] - 10https://gerrit.wikimedia.org/r/304029 (https://phabricator.wikimedia.org/T138265) [14:15:38] (03CR) 10jenkins-bot: [V: 04-1] Add error_output to eventlogging service and make eventbus write EventErrors to log file [puppet] - 10https://gerrit.wikimedia.org/r/304029 (https://phabricator.wikimedia.org/T138265) (owner: 10Ottomata) [14:15:42] (03PS2) 10Ottomata: Add error_output to eventlogging service and make eventbus write EventErrors to log file [puppet] - 10https://gerrit.wikimedia.org/r/304029 (https://phabricator.wikimedia.org/T138265) [14:16:27] (03PS3) 10Ottomata: Add error_output to eventlogging service and make eventbus write EventErrors to log file [puppet] - 10https://gerrit.wikimedia.org/r/304029 (https://phabricator.wikimedia.org/T138265) [14:16:34] (03PS3) 10Yurik: Maps: Added list of cassandra servers [puppet] - 10https://gerrit.wikimedia.org/r/304019 (https://phabricator.wikimedia.org/T138092) [14:17:42] (03CR) 10jenkins-bot: [V: 04-1] Add error_output to eventlogging service and make eventbus write EventErrors to log file [puppet] - 10https://gerrit.wikimedia.org/r/304029 (https://phabricator.wikimedia.org/T138265) (owner: 10Ottomata) [14:18:58] (03PS4) 10Ottomata: Add error_output to eventlogging service and make eventbus write EventErrors to log file [puppet] - 10https://gerrit.wikimedia.org/r/304029 (https://phabricator.wikimedia.org/T138265) [14:19:16] (03PS4) 10Yurik: Maps: Added list of cassandra servers [puppet] - 10https://gerrit.wikimedia.org/r/304019 (https://phabricator.wikimedia.org/T138092) [14:20:01] (03CR) 10jenkins-bot: [V: 04-1] Add error_output to eventlogging service and make eventbus write EventErrors to log file [puppet] - 10https://gerrit.wikimedia.org/r/304029 (https://phabricator.wikimedia.org/T138265) (owner: 10Ottomata) [14:20:36] (03PS1) 10Muehlenhoff: ganglia: Limit to production networks and fundraising networks [puppet] - 10https://gerrit.wikimedia.org/r/304032 [14:20:44] (03PS5) 10Ottomata: Add error_output to eventlogging service and make eventbus write EventErrors to log file [puppet] - 10https://gerrit.wikimedia.org/r/304029 (https://phabricator.wikimedia.org/T138265) [14:21:49] PROBLEM - Host mw2086 is DOWN: PING CRITICAL - Packet loss = 100% [14:23:14] (03PS2) 10Muehlenhoff: ganglia: Limit to production networks and fundraising networks [puppet] - 10https://gerrit.wikimedia.org/r/304032 [14:23:24] ^mw2086 is expected, forgot to silence [14:23:51] 06Operations, 10EventBus, 06Services, 15User-mobrovac: EventBus Proxy Service Doesn't Handle SIGHUP Correctly - https://phabricator.wikimedia.org/T140868#2539872 (10mobrovac) 05Open>03Resolved a:03Ottomata Deployed, resolving. [14:26:36] (03PS6) 10Ottomata: Add error_output to eventlogging service and make eventbus write EventErrors to log file [puppet] - 10https://gerrit.wikimedia.org/r/304029 (https://phabricator.wikimedia.org/T138265) [14:26:50] (03PS1) 10Giuseppe Lavagetto: restbase: allow pooling/depooling programmatically [puppet] - 10https://gerrit.wikimedia.org/r/304033 (https://phabricator.wikimedia.org/T140895) [14:27:11] <_joe_> mobrovac: ^^ [14:27:20] opa! [14:30:05] (03PS2) 10Giuseppe Lavagetto: restbase: allow pooling/depooling programmatically [puppet] - 10https://gerrit.wikimedia.org/r/304033 (https://phabricator.wikimedia.org/T140895) [14:30:14] (03CR) 10Alexandros Kosiaris: [C: 031] ganglia: Limit to production networks and fundraising networks [puppet] - 10https://gerrit.wikimedia.org/r/304032 (owner: 10Muehlenhoff) [14:32:15] (03PS7) 10Ottomata: Add error_output to eventlogging service and make eventbus write EventErrors to log file [puppet] - 10https://gerrit.wikimedia.org/r/304029 (https://phabricator.wikimedia.org/T138265) [14:32:41] (03CR) 10Mobrovac: [C: 031] restbase: allow pooling/depooling programmatically [puppet] - 10https://gerrit.wikimedia.org/r/304033 (https://phabricator.wikimedia.org/T140895) (owner: 10Giuseppe Lavagetto) [14:35:38] (03PS8) 10Ottomata: Add error_output to eventlogging service and make eventbus write EventErrors to log file [puppet] - 10https://gerrit.wikimedia.org/r/304029 (https://phabricator.wikimedia.org/T138265) [14:37:24] 06Operations, 10Wikimedia-Etherpad: Unable to access Etherpad - https://etherpad.wikimedia.org/p/Fundraising_Staff_Feedback - https://phabricator.wikimedia.org/T140886#2539901 (10akosiaris) 05Open>03Resolved a:03akosiaris I admit I haven't managed to recover the entire pad. I 'll resolve :-( [14:40:39] !log depooling mw1261 to install/test apache2_2.4.10-10+deb8u6+wmf1_amd64.deb (T73487). After basic checks the host will get back into service with weight 5. [14:40:40] T73487: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487 [14:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:47:38] (03PS3) 10Giuseppe Lavagetto: restbase: allow pooling/depooling programmatically [puppet] - 10https://gerrit.wikimedia.org/r/304033 (https://phabricator.wikimedia.org/T140895) [14:48:25] (03PS9) 10Ottomata: Add error_output to eventlogging service and make eventbus write EventErrors to log file [puppet] - 10https://gerrit.wikimedia.org/r/304029 (https://phabricator.wikimedia.org/T138265) [14:52:30] PROBLEM - DPKG on mw1261 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:53:37] this is me sorry [14:59:41] 06Operations, 10hardware-requests: Site: 8 hardware access request for ORES - https://phabricator.wikimedia.org/T142578#2539946 (10akosiaris) [15:00:05] anomie, ostriches, thcipriani, hashar, and twentyafterfour: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160810T1500). [15:00:05] James_F: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:02:35] I can SWAT today. James_F ping for SWAT. [15:02:39] RECOVERY - DPKG on mw1261 is OK: All packages OK [15:02:48] thcipriani: Heya. [15:02:56] howdy [15:03:19] thcipriani: BTW, will need to sync Init then the dblist. [15:03:26] 06Operations: eqiad: Install SSD's into ganeti hosts - https://phabricator.wikimedia.org/T138414#2539964 (10akosiaris) The box is fully part of the cluster now. I 've submitted some VM rebalancing jobs and we should be in a great state. Thanks Chris! [15:03:31] (When we get there.) [15:03:42] James_F: ack, thanks :) [15:03:48] 06Operations, 06Discovery, 06Maps: Icinga is randomly loosing connectivity to maps1002 - https://phabricator.wikimedia.org/T138782#2539966 (10Gehel) a:05Cmjohnson>03Gehel Thanks to @faidon, it seems the issue is conflicting IP address configuration with ores1002. ores1002 seems to have been decommissionn... [15:03:49] (03PS2) 10Thcipriani: Enable VisualEditor by default for logged-in users on Arabic-script Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303586 (https://phabricator.wikimedia.org/T93387) (owner: 10Jforrester) [15:04:00] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303586 (https://phabricator.wikimedia.org/T93387) (owner: 10Jforrester) [15:04:28] (03Merged) 10jenkins-bot: Enable VisualEditor by default for logged-in users on Arabic-script Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303586 (https://phabricator.wikimedia.org/T93387) (owner: 10Jforrester) [15:05:00] PROBLEM - puppet last run on mw1261 is CRITICAL: CRITICAL: Puppet has 1 failures [15:06:24] James_F: patch live on mw1099, check please [15:07:09] thcipriani: Yeah, LGTM. [15:07:41] ack, going out IS first, then dblist :) [15:07:50] Thanks. [15:10:15] (03CR) 10Jforrester: [C: 04-1] "Scheduled for 17 August." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303587 (https://phabricator.wikimedia.org/T93387) (owner: 10Jforrester) [15:10:29] Hey thcipriani is there any possibility I could be squeezed into the swat window? Mine got cancelled yesterday due to the issues [15:11:15] hrm. mw2086 having troubles? Deploy seems to be hanging there. [15:11:30] jdlrobson: sure. [15:11:51] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:303586|Enable VisualEditor by default for logged-in users on Arabic-script Wikipedias (T93387)]] PART I]] (duration: 02m 49s) [15:11:52] T93387: Enable VisualEditor by default for all users of all "phase 6" Wikipedias - https://phabricator.wikimedia.org/T93387 [15:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:05] (03PS4) 10Jdlrobson: Promote new language switcher to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303003 (https://phabricator.wikimedia.org/T129505) [15:12:12] ^ thcipriani that's the patch from yesterday [15:12:20] would you like me to put it on Deployment calendar for today [15:12:58] jdlrobson: yes please [15:13:13] thcipriani: done [15:13:32] thanks a bunch! [15:14:24] !log thcipriani@tin Synchronized dblists/visualeditor-nondefault.dblist: SWAT: [[gerrit:303586|Enable VisualEditor by default for logged-in users on Arabic-script Wikipedias (T93387)]] PART II]] (duration: 01m 40s) [15:14:25] T93387: Enable VisualEditor by default for all users of all "phase 6" Wikipedias - https://phabricator.wikimedia.org/T93387 [15:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:31] ^ James_F live everywhere [15:14:36] Thanks! [15:16:09] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303003 (https://phabricator.wikimedia.org/T129505) (owner: 10Jdlrobson) [15:16:33] (03Merged) 10jenkins-bot: Promote new language switcher to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303003 (https://phabricator.wikimedia.org/T129505) (owner: 10Jdlrobson) [15:17:17] jdlrobson: live on mw1099, check please [15:18:16] works! thanks thcipriani [15:18:23] please propagate everywhere [15:18:29] jdlrobson: ack, going everywhere [15:20:54] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:303003|Promote new language switcher to all wikis (T129505)]] (duration: 01m 12s) [15:20:56] T129505: Ship mobile web readily available language button placement affordance on Wednesday immediately following Tuesday single-language deployment - https://phabricator.wikimedia.org/T129505 [15:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:00] ^ jdlrobson live everywhere [15:21:24] and confirmed. thanks thcipriani ! :D [15:21:35] jdlrobson: awesome, thanks for checking :) [15:22:39] !log mw2086 ssh from tin as mwdeploy user failing. Will need to run 'scap pull' when it comes back online [15:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:04] (03CR) 10Volans: "@Faidon:" [puppet] - 10https://gerrit.wikimedia.org/r/304026 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [15:28:15] thcipriani: mw2086 is down for temporary hardware analysis by Papaul, should be back up soon [15:29:04] moritzm: ack, figured on something like that :) [15:30:18] (03CR) 10Liuxinyu970226: [C: 031] Enable VisualEditor by default for all users of the Chinese Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292747 (https://phabricator.wikimedia.org/T136996) (owner: 10Jforrester) [15:30:21] !log correct version installed on mw1261 is 2.4.10-10+deb8u6+wmf2 [15:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:31] RECOVERY - puppet last run on mw1261 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:30:43] (03PS4) 10Gehel: maps - osm-initial-import fixes [puppet] - 10https://gerrit.wikimedia.org/r/303816 (https://phabricator.wikimedia.org/T138092) [15:32:26] (03PS1) 10Hoo man: Enable the ArticlePlaceholder on cywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304035 (https://phabricator.wikimedia.org/T140725) [15:32:28] (03PS1) 10Hoo man: Enable the ArticlePlaceholder on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304036 (https://phabricator.wikimedia.org/T142468) [15:36:59] (03CR) 10Gehel: [C: 032] maps - osm-initial-import fixes [puppet] - 10https://gerrit.wikimedia.org/r/303816 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [15:41:32] !log rebooting labvirt1014 to 3.16.0-77-generic for testing secgroup issues [15:41:34] (03PS4) 10Giuseppe Lavagetto: restbase: allow pooling/depooling programmatically [puppet] - 10https://gerrit.wikimedia.org/r/304033 (https://phabricator.wikimedia.org/T140895) [15:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:44:14] (03Abandoned) 10Paladox: [Timeline] Update path to extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244739 (owner: 10Paladox) [15:45:10] PROBLEM - Host labvirt1014 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:10] RECOVERY - Host labvirt1014 is UP: PING OK - Packet loss = 0%, RTA = 1.50 ms [15:46:58] (03CR) 10Giuseppe Lavagetto: [C: 032] restbase: allow pooling/depooling programmatically [puppet] - 10https://gerrit.wikimedia.org/r/304033 (https://phabricator.wikimedia.org/T140895) (owner: 10Giuseppe Lavagetto) [15:48:30] (03CR) 10Gehel: "Puppet compiler: https://puppet-compiler.wmflabs.org/3658/" [puppet] - 10https://gerrit.wikimedia.org/r/304019 (https://phabricator.wikimedia.org/T138092) (owner: 10Yurik) [15:50:32] <_joe_> !log restarting parsoid for T137878 [15:50:34] T137878: Allow *-admin groups to see systemd logs for their units - https://phabricator.wikimedia.org/T137878 [15:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:51:35] (03CR) 10Elukey: "Couple of comments but it looks good! I am not sure about the bit in which we define /srv/log/etc.. but I guess that it is the best compro" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/304029 (https://phabricator.wikimedia.org/T138265) (owner: 10Ottomata) [15:55:50] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - rendering_80 - Could not depool server mw2087.codfw.wmnet because of too many down! [15:56:08] ^ ? [15:56:38] it's codfw, but still [16:00:02] (03Abandoned) 10Paladox: Block access to jice.ddns.net instead of ip [puppet] - 10https://gerrit.wikimedia.org/r/277904 (owner: 10Paladox) [16:00:05] hoo and frimelle: Dear anthropoid, the time has come. Please deploy ArticlePlaceholder (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160810T1600). [16:03:10] (03CR) 10Hoo man: [C: 032] Enable the ArticlePlaceholder on cywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304035 (https://phabricator.wikimedia.org/T140725) (owner: 10Hoo man) [16:03:12] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch: Reclaim nobelium - https://phabricator.wikimedia.org/T142581#2540089 (10Gehel) [16:03:36] (03Merged) 10jenkins-bot: Enable the ArticlePlaceholder on cywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304035 (https://phabricator.wikimedia.org/T140725) (owner: 10Hoo man) [16:05:46] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch: Reclaim nobelium - https://phabricator.wikimedia.org/T142581#2540108 (10Gehel) @dcausse, @EBernhardson can you confirm that nobelium isn't used anymore and that everything that was using it has moved to relforge? [16:05:51] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [16:07:41] (03CR) 10Gehel: mwgrep: fails gracefully when an invalid regex is provided (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/302892 (https://phabricator.wikimedia.org/T141996) (owner: 10DCausse) [16:08:03] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Enable the ArticlePlaceholder on cywiki (T140725) (duration: 02m 49s) [16:08:04] T140725: Enable article placeholder on cyWiki - https://phabricator.wikimedia.org/T140725 [16:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:05] (03CR) 10Hoo man: [C: 032] Enable the ArticlePlaceholder on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304036 (https://phabricator.wikimedia.org/T142468) (owner: 10Hoo man) [16:09:31] (03Merged) 10jenkins-bot: Enable the ArticlePlaceholder on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304036 (https://phabricator.wikimedia.org/T142468) (owner: 10Hoo man) [16:11:41] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: puppet fail [16:13:04] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Enable the ArticlePlaceholder on knwiki (T142468) (duration: 02m 55s) [16:13:05] T142468: Enable ArticlePlaceholder on knwiki - https://phabricator.wikimedia.org/T142468 [16:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:13:40] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [16:13:53] (03CR) 10Eevans: "> (1 comment)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/303846 (owner: 10Eevans) [16:16:31] (03CR) 10Ottomata: Add error_output to eventlogging service and make eventbus write EventErrors to log file (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/304029 (https://phabricator.wikimedia.org/T138265) (owner: 10Ottomata) [16:22:39] (03CR) 10Eevans: "https://gerrit.wikimedia.org/r/301425 took a slightly different approach to this by letting the kernel perform this function for all insta" [puppet] - 10https://gerrit.wikimedia.org/r/300100 (https://phabricator.wikimedia.org/T140825) (owner: 10GWicke) [16:25:52] bblack, so there is ongoing flow-related maintenance, mlitn and matt_flaschen are involved [16:26:07] we are going to check 100% sure it is not a db issue [16:26:15] if not, probably it is a cache thing [16:26:55] and purging seams not to fix it; it is not urgent, but in case you had any advice [16:27:22] I will add you to the ticket as an observer [16:27:23] Not sure if Varnish is involved, though. May just be memcached. mlitn? [16:27:28] oh [16:27:38] I thought cache == varnish [16:28:01] then if it is another layer, we should wait [16:28:11] maybe even parsercache [16:28:27] (03PS2) 10Jforrester: Enable VisualEditor by default for logged-out users on Arabic-script Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303587 (https://phabricator.wikimedia.org/T142587) [16:29:14] matt_flaschen, feel free to give me more info and I will both check the database and the disk-app-cache to discard those [16:30:17] I work nicely offline (on a ticket), so just centralize there the requests [16:32:59] (03PS4) 10ArielGlenn: make the 'force' (lock stealing) option for dumps useful [dumps] - 10https://gerrit.wikimedia.org/r/304013 [16:35:44] (03PS5) 10ArielGlenn: make the 'force' (lock stealing) option for dumps useful [dumps] - 10https://gerrit.wikimedia.org/r/304013 [16:46:47] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and bast1001 for ovasileva - https://phabricator.wikimedia.org/T142502#2540437 (10JKatzWMF) Thanks @akosiaris I made developer access a more explicit requirement for obtaining shell access here: https://wikitech.wikimedia.org/wiki/Pro... [16:48:38] (03PS10) 10Ottomata: Add error_output to eventlogging service and make eventbus write EventErrors to log file [puppet] - 10https://gerrit.wikimedia.org/r/304029 (https://phabricator.wikimedia.org/T138265) [16:50:29] 06Operations, 10RESTBase, 06Services, 15User-mobrovac: Allow RB to be programmatically pooled/depooled during restarts - https://phabricator.wikimedia.org/T140895#2540458 (10mobrovac) 05Open>03Resolved a:03mobrovac With the above patch merged and [PR #99](https://github.com/wikimedia/ansible-deploy/p... [16:51:25] (03CR) 10ArielGlenn: [C: 032] make the 'force' (lock stealing) option for dumps useful [dumps] - 10https://gerrit.wikimedia.org/r/304013 (owner: 10ArielGlenn) [16:55:36] 06Operations, 06Services, 06Services-next, 05Security, 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2540504 (10mobrovac) [17:04:21] (03CR) 10Ottomata: [C: 032] Add error_output to eventlogging service and make eventbus write EventErrors to log file [puppet] - 10https://gerrit.wikimedia.org/r/304029 (https://phabricator.wikimedia.org/T138265) (owner: 10Ottomata) [17:12:22] !log restarting eventbus in eqiad to apply error_output change [17:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:12:50] !log otto@palladium conftool action : set/pooled=no; selector: kafka1001.eqiad.wmnet [17:13:00] (03PS2) 10Sbisson: Enabling thank-you-edit on beta for testing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303957 (https://phabricator.wikimedia.org/T128249) [17:14:07] !log otto@palladium conftool action : set/pooled=yes; selector: kafka1001.eqiad.wmnet [17:14:27] !log otto@palladium conftool action : set/pooled=no; selector: kafka1002.eqiad.wmnet [17:15:41] !log otto@palladium conftool action : set/pooled=yes; selector: kafka1002.eqiad.wmnet [17:15:55] (03PS8) 10ArielGlenn: Make scheduler hupable. [dumps] - 10https://gerrit.wikimedia.org/r/302831 (https://phabricator.wikimedia.org/T142488) [17:17:26] AaronSchulz: hiiii [17:21:02] looking for review on this: https://gerrit.wikimedia.org/r/#/c/299008/ [17:26:07] 06Operations, 10Ops-Access-Requests: Access for p858snake to chanops in #wikimedia-operations - https://phabricator.wikimedia.org/T142270#2540632 (10RobH) 05Open>03Resolved a:03RobH As @Danny_B points out, anyone with the full 'ARefiorstv' flags includes the ability to modify the flags list for the chann... [17:29:08] 06Operations, 10Ops-Access-Requests: Access for p858snake to chanops in #wikimedia-operations - https://phabricator.wikimedia.org/T142270#2540666 (10RobH) a:05RobH>03None [17:29:44] jynus, shouldn't be parser cache, but I think it might be memcached. I think mlitn said he was going to follow up with you. [17:30:21] sure [17:30:36] jynus, BTW, is there any update on External Store? [17:31:06] I have yet to setup the backups, I have the hardware [17:31:26] jynus, OK, let me know if you want to discuss anything or anything is blocked on our end. [17:31:30] but I need to organize it- unlike this wiki [17:31:37] it is a ton of space [17:31:42] (03PS3) 10Addshore: Add new ssh key for Addshore [puppet] - 10https://gerrit.wikimedia.org/r/303863 [17:31:51] and some of it it is blocked by check scripts [17:32:05] It would be great if someone could +2 that one for me ^^ [17:32:27] and some hardware movements in the latest months [17:33:41] 06Operations, 06WMF-Legal, 06WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#2540685 (10ZhouZ) Ok I see - I was mistaken then. It sounds like maybe the process should be someone should email an address requesting to be added by giving their Phabricator userna... [17:37:18] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations: Analytics cluster access request for ISI Foundation team - https://phabricator.wikimedia.org/T141634#2540707 (10DarTar) >>! In T141634#2536199, @MoritzMuehlenhoff wrote: > For clarification: While there's a defined offboar... [17:47:20] (03PS2) 10Yuvipanda: prometheus: Add blackbox exporter role/class [puppet] - 10https://gerrit.wikimedia.org/r/303986 [17:52:49] (03PS3) 10Yuvipanda: prometheus: Add blackbox exporter role/class [puppet] - 10https://gerrit.wikimedia.org/r/303986 [17:59:25] (03CR) 10BBlack: [C: 032] labs dnsrecursor metaldns: Change hook to ensure SOA records get passed properly but with NOERROR instead of NXDOMAIN [puppet] - 10https://gerrit.wikimedia.org/r/303833 (https://phabricator.wikimedia.org/T139438) (owner: 10Alex Monk) [17:59:33] (03PS2) 10BBlack: labs dnsrecursor metaldns: Change hook to ensure SOA records get passed properly but with NOERROR instead of NXDOMAIN [puppet] - 10https://gerrit.wikimedia.org/r/303833 (https://phabricator.wikimedia.org/T139438) (owner: 10Alex Monk) [17:59:46] (03PS4) 10Yuvipanda: prometheus: Add blackbox exporter role/class [puppet] - 10https://gerrit.wikimedia.org/r/303986 [17:59:59] (03CR) 10BBlack: [V: 032] labs dnsrecursor metaldns: Change hook to ensure SOA records get passed properly but with NOERROR instead of NXDOMAIN [puppet] - 10https://gerrit.wikimedia.org/r/303833 (https://phabricator.wikimedia.org/T139438) (owner: 10Alex Monk) [18:02:20] 06Operations, 06Services, 06Services-next, 05Security, 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2540859 (10GWicke) To get information on the relative frequency of single-page vs. multi-page collection requests, I... [18:03:45] 06Operations, 10Ops-Access-Requests: Access for p858snake to chanops in #wikimedia-operations - https://phabricator.wikimedia.org/T142270#2540879 (10akosiaris) >>! In T142270#2540632, @RobH wrote: > As @Danny_B points out, anyone with the +f flag includes the ability to modify the flags list for the channel.... [18:04:58] (03CR) 10Mobrovac: "I was thinking about that, but we have three groups of people: deployers, those with ssh access and finally service admins. Which group(s)" [puppet] - 10https://gerrit.wikimedia.org/r/303846 (owner: 10Eevans) [18:07:08] (03PS4) 10Alex Monk: puppetmaster: Kill is_labs_master [puppet] - 10https://gerrit.wikimedia.org/r/303761 [18:07:10] (03PS5) 10Alex Monk: puppetmaster: Split extra_auth_rules from is_labs_master [puppet] - 10https://gerrit.wikimedia.org/r/303757 [18:18:47] bblack, hey, thanks. did you run puppet on labservices100x? [18:19:09] (iirc that's where labs-recursor* run, I think?) [18:22:18] Krenair: nope [18:22:43] but by now it may have already run [18:22:55] maybe, guess we'll see [18:24:43] 06Operations, 06Services, 06Services-next, 05Security, 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2541055 (10faidon) I'm with Subbu here, and specifically: >>! In T142226#2538081, @ssastry wrote: > All this is fin... [18:30:03] (03PS1) 10Andrew Bogott: nova: Increase rpc_response_timeout to 180 [puppet] - 10https://gerrit.wikimedia.org/r/304047 (https://phabricator.wikimedia.org/T142165) [18:34:04] 06Operations, 10Wikimedia-SVG-rendering, 07Upstream: Incorrect text positioning in SVG rasterization (any extreme down scale) (fixed in upstream 2.40.13) - https://phabricator.wikimedia.org/T65703#2541077 (10kaldari) 05Open>03Resolved a:03kaldari Looks like this is fixed now from my tests. [18:46:56] 06Operations, 10Cassandra, 06Services, 10hardware-requests: 9x or 15x additional Cassandra/RESTBase nodes - https://phabricator.wikimedia.org/T139961#2541110 (10Eevans) To summarize a conversation with @mark on IRC, there are 10 systems total, at least 3 of which satisfy the following: * 2 ea. Intel(R) Xe... [18:48:34] 06Operations, 10Cassandra, 06Services, 10hardware-requests: 9x or 15x additional Cassandra/RESTBase nodes - https://phabricator.wikimedia.org/T139961#2541125 (10mark) a:05mark>03RobH Small correction: JBOD is not an option on this controller. :) You have to make RAID arrays, although single-disk RAID0... [18:50:01] (03PS1) 10Rush: Revert "labs dnsrecursor metaldns: Change hook to ensure SOA records get passed properly but with NOERROR instead of NXDOMAIN" [puppet] - 10https://gerrit.wikimedia.org/r/304049 [18:50:23] yuvipanda: https://gerrit.wikimedia.org/r/#/c/304049/ ping [18:50:36] 07Puppet, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Move vagrant role to use ores in production - https://phabricator.wikimedia.org/T142618#2541130 (10Ladsgroup) [18:51:04] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "labs dnsrecursor metaldns: Change hook to ensure SOA records get passed properly but with NOERROR instead of NXDOMAIN" [puppet] - 10https://gerrit.wikimedia.org/r/304049 (owner: 10Rush) [19:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160810T1900). [19:00:59] wait a sec [19:02:05] I thought it's services deployment window :D [19:02:10] I should wait for an hour [19:03:22] (03CR) 10Andrew Bogott: [C: 032] nova: Increase rpc_response_timeout to 180 [puppet] - 10https://gerrit.wikimedia.org/r/304047 (https://phabricator.wikimedia.org/T142165) (owner: 10Andrew Bogott) [19:03:29] (03PS2) 10Andrew Bogott: nova: Increase rpc_response_timeout to 180 [puppet] - 10https://gerrit.wikimedia.org/r/304047 (https://phabricator.wikimedia.org/T142165) [19:10:50] (03PS9) 10ArielGlenn: Make scheduler hupable. [dumps] - 10https://gerrit.wikimedia.org/r/302831 (https://phabricator.wikimedia.org/T142488) [19:12:36] (03PS1) 10BBlack: CVE-2016-5696 mitigation [puppet] - 10https://gerrit.wikimedia.org/r/304050 [19:22:51] 06Operations, 06Editing-Department, 06Parsing-Team, 06Services: Services team goals April - June 2016 (Q4 2015/16) - https://phabricator.wikimedia.org/T118871#2541236 (10mobrovac) 05Open>03Resolved a:03mobrovac [19:22:54] 06Operations, 10Phabricator-Bot-Requests: Creation of bot for Operations - https://phabricator.wikimedia.org/T142362#2532287 (10mmodell) I'm somewhat indifferent to having multiple bot accounts vs. a single account. I can think only two concerns to consider here: 1. Security of the api tokens. We might want... [19:23:26] Amir1: the train shouldn't take long. I'm deploying now. [19:24:21] thanks :) It was just this stupid time zone converting (specially for me it's UTC +4:30, complicating everything by an order of magnitude) [19:24:32] ewwe [19:24:40] that does seem complicated [19:24:53] and I thought daylight savings was bad [19:26:01] (03PS1) 1020after4: group1 wikis to 1.28.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304054 [19:26:48] (03CR) 1020after4: [C: 032] group1 wikis to 1.28.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304054 (owner: 1020after4) [19:27:14] (03Merged) 10jenkins-bot: group1 wikis to 1.28.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304054 (owner: 1020after4) [19:28:02] 06Operations, 10ops-eqiad: investigate ores1002 - not in racktables but shows up on switch - https://phabricator.wikimedia.org/T142621#2541264 (10RobH) [19:30:50] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: (no message) [19:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:31:39] mw2086.codfw.wmnet failed to sync [19:31:56] !log sync-wikiversions failed for mw2086.codfw.wmnet [19:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:33:49] I have a super trivial change in mediawiki/vagrant. Anyone to merge it? https://gerrit.wikimedia.org/r/#/c/304052/1 [19:34:13] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: (no message) [19:35:40] (03PS1) 10Dzahn: lists: renamed wikidata-l list, redirect archive URL [puppet] - 10https://gerrit.wikimedia.org/r/304055 (https://phabricator.wikimedia.org/T136798) [19:36:30] Amir1: +2 [19:36:46] twentyafterfour: awesome! Thanks [19:37:30] (03PS2) 10Dzahn: lists: renamed wikidata-l list, redirect archive URL [puppet] - 10https://gerrit.wikimedia.org/r/304055 (https://phabricator.wikimedia.org/T136798) [19:38:02] I'm seeing quite a few errors like this: "Model contains an error for 363175094: RevisionNotFound: Could not find revision ({revision}:363175094)" [19:38:50] (03CR) 10Paladox: [C: 031] lists: renamed wikidata-l list, redirect archive URL [puppet] - 10https://gerrit.wikimedia.org/r/304055 (https://phabricator.wikimedia.org/T136798) (owner: 10Dzahn) [19:38:51] twentyafterfour: It's old, it seems API acts crazy for the ores service [19:38:51] we were checking it today [19:39:15] sometimes = 1% [19:39:51] (03CR) 10BBlack: [C: 032] CVE-2016-5696 mitigation [puppet] - 10https://gerrit.wikimedia.org/r/304050 (owner: 10BBlack) [19:40:38] Amir1: self-merge on stuff like that in mw-vagrant is totally ok too. [19:41:58] bd808: I don't have the rights :) [19:43:21] Amir1: apparently I don't have the rights to change that in gerrit or I would [19:44:05] Amir1: get +2 in mw-core and then you will have it in mw-vagrant :) [19:44:53] It would be nice :) let me see what is the process [19:45:11] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [19:48:30] Amir1: maybe mysql related? I was seeing a bunch of mysql connection failures. Those usually have cascading effects in the error log (that is, mysql errors trigger a lot of other errors which aren't always obviously related) [19:49:10] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [19:50:25] twentyafterfour: it can be too. We should check for any related mysql error too [19:50:52] but I must note that the extension hits the service, the service hits the api and api calls mysql [19:51:21] so if api returns the revision not found, probably that's an issue in api [19:52:04] (by api I mean en.wikipedia.org/w/api.php [19:52:22] well it looks like it isn't related to the train so I'm not too worried ;) [19:52:40] 06Operations, 10Cassandra, 10hardware-requests, 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2541391 (10Eevans) One option that has been made available: There are 10 Varnish machines that are coming down in //esams//, 3 of which could be made... [19:53:35] :D [19:56:12] (03PS1) 10BBlack: ciphersuite: deprioritize non-AEAD AES256 [puppet] - 10https://gerrit.wikimedia.org/r/304059 [19:56:26] FWIW starting from wmf.15 if there is an error in the service, ORES sends out warning and retries instead of throwing exception and spamming in logs [19:56:45] we can back port it if it's making too much noise in error logs [19:58:47] (03CR) 10BBlack: [C: 032] ciphersuite: deprioritize non-AEAD AES256 [puppet] - 10https://gerrit.wikimedia.org/r/304059 (owner: 10BBlack) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160810T2000). [20:00:21] We have a deployment today for ores [20:00:49] I wait for others to comment if they have deployment [20:02:17] i am going to start parsoid deployment in a couple mins. [20:04:10] subbu: okay, tell me when you're done [20:04:33] will do. [20:05:38] Should I be worried about mw2086.codfw.wmnet no syncing? [20:05:41] !log starting parsoid deploy [20:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:06:32] twentyafterfour: eqiad is still primary, isn't it? [20:06:57] twentyafterfour: It looks like moritzm depooled it for reinstall [20:07:06] 12:26 moritzm: depooling image scalers mw2086-mw2089 for reimaging with jessie [20:07:07] Yesterday [20:07:20] !log finished syncing code; restarted parsoid on wtp1001 as canary [20:07:21] Reedy: ok so not worried about it ;) [20:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:07:29] thanks [20:07:31] I saw the same thing during SWAT. thcipriani: mw2086 is down for temporary hardware analysis by Papaul, should be back up soon [20:07:57] twentyafterfour: Well, slightly confusing why 87-89 are ok/didn't error [20:08:11] Might be worth filing a ticket for moritzm [20:08:28] canary lgtm .. restarting on all nodes. [20:09:19] twentyafterfour: the others are up. I'd file a ticket, nothing relevant in the SAL [20:09:55] (as to why it's down) [20:10:02] Reedy: see above ^ moritzm was at least aware of it this morning [20:10:10] (03CR) 10Mattflaschen: [C: 032] Enabling thank-you-edit on beta for testing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303957 (https://phabricator.wikimedia.org/T128249) (owner: 10Sbisson) [20:11:04] Ah [20:11:16] !log finished deploying parsoid sha 4de49e26 [20:11:19] time to verify next [20:11:19] (03PS3) 10Mattflaschen: Enabling thank-you-edit on beta for testing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303957 (https://phabricator.wikimedia.org/T128249) (owner: 10Sbisson) [20:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:11:29] Presumably it should have an ops filed ticket :P [20:11:33] (03CR) 10Mattflaschen: Enabling thank-you-edit on beta for testing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303957 (https://phabricator.wikimedia.org/T128249) (owner: 10Sbisson) [20:11:41] (03CR) 10Mattflaschen: [C: 032] Enabling thank-you-edit on beta for testing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303957 (https://phabricator.wikimedia.org/T128249) (owner: 10Sbisson) [20:12:07] (03Merged) 10jenkins-bot: Enabling thank-you-edit on beta for testing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303957 (https://phabricator.wikimedia.org/T128249) (owner: 10Sbisson) [20:14:53] Amir1, done. [20:15:23] Awesome! [20:15:49] thanks [20:17:18] halfak: o/ [20:17:20] !log mattflaschen@tin Synchronized wmf-config/CommonSettings-labs.php: Re-enable thank-you-edit (milestone notifications for 1st, 10th, 100th, etc. edits) on Beta Cluster (duration: 02m 45s) [20:17:21] o/ Amir1, just reconnected. What's the status? [20:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:17:29] I'm deploying [20:17:35] Cool. Standing by [20:17:43] With coffee :) [20:18:53] !log deploygin 7aad8e9 for ores in canary (scb2001.codfw.wmnet) [20:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:21:32] !log deploygin 7aad8e9 for ores in all nodes [20:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:22:40] (03Draft1) 10Gehel: WIP - elasticsearch - cleanup roles [puppet] - 10https://gerrit.wikimedia.org/r/304067 [20:24:15] restarting services [20:25:00] And we're back [20:25:22] Err. Timing out again [20:25:32] Nope. Got it back [20:26:12] the restart hasn't finished yet halfak [20:26:13] 2/3 [20:26:18] (03PS1) 10Rush: WIP labstore nfs: nfs client mount manager [puppet] - 10https://gerrit.wikimedia.org/r/304070 (https://phabricator.wikimedia.org/T140483) [20:26:36] Gotcha. [20:26:47] Looks like we have some periodic timeouts and slowness while services restart. [20:26:49] Not surprising [20:26:50] (03PS3) 10Dzahn: lists: renamed wikidata-l list, redirect archive URL [puppet] - 10https://gerrit.wikimedia.org/r/304055 (https://phabricator.wikimedia.org/T136798) [20:26:57] (03CR) 10Dzahn: [C: 032] lists: renamed wikidata-l list, redirect archive URL [puppet] - 10https://gerrit.wikimedia.org/r/304055 (https://phabricator.wikimedia.org/T136798) (owner: 10Dzahn) [20:27:09] (03PS2) 10Gehel: WIP - elasticsearch - cleanup roles [puppet] - 10https://gerrit.wikimedia.org/r/304067 [20:27:26] (03CR) 10jenkins-bot: [V: 04-1] WIP labstore nfs: nfs client mount manager [puppet] - 10https://gerrit.wikimedia.org/r/304070 (https://phabricator.wikimedia.org/T140483) (owner: 10Rush) [20:27:54] Looks like we're back now [20:28:03] deployment is done [20:28:11] time to monitor [20:29:34] we had spike during the deployment but it's okay now [20:29:34] https://grafana.wikimedia.org/dashboard/db/ores-extension [20:30:42] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 634 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4680385 keys - replication_delay is 634 [20:31:18] Amir1: ^ is that related? [20:31:32] Our redis hardware is dedicated [20:31:46] oresdb something [20:31:51] what is the server name? [20:31:53] ah [20:31:57] let me find it [20:32:21] oresrdb [20:32:25] https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions [20:33:23] grrrit-wm: please wake up [20:33:25] number of failed jobs is back to zero now [20:33:30] great. [20:33:37] oh, nevermind [20:33:47] Looks like nothing went crazy with this deploy CPU-wise [20:33:50] halfak: one thing that worries me a little is that memory usage is not up now [20:34:08] That is surprising. [20:34:17] we should have an increase in available memory [20:34:30] It's not *bad* [20:34:36] * halfak logs in to check on uwsgi workers [20:36:34] 06Operations, 10Wikimedia-Logstash, 03Discovery-Search-Sprint: Elasticsearch restarts are failing in the logstash cluster - https://phabricator.wikimedia.org/T142357#2541547 (10dcausse) I could reproduce the issue with a shard from logstash1004 (logstash-2016.07.16). This is 77068993msec to merge doc values... [20:36:40] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4669733 keys - replication_delay is 0 [20:37:24] Amir, our memory footprint per worker is where I want it to be. [20:37:27] (03PS2) 10Dzahn: alsafi: add missing IPV6 AAAA and reverse [dns] - 10https://gerrit.wikimedia.org/r/302868 [20:37:48] RES ~550MB [20:37:48] (03CR) 10Dzahn: [C: 032] alsafi: add missing IPV6 AAAA and reverse [dns] - 10https://gerrit.wikimedia.org/r/302868 (owner: 10Dzahn) [20:38:06] halfak: what about uwsgi memory footprint? [20:38:21] You mean celery? [20:38:43] yup [20:38:55] As expected around 1000MB per worker [20:39:49] no, uwsgi [20:39:49] oh, okay [20:40:07] It doesn't seem like memory pressure is *bad* [20:40:13] We're definitely using less per-worker [20:40:21] We're *almost* cutting uwsgi memory usage in half [20:41:52] 49 * 400MB = ~ 14GB more memory should be free [20:41:54] Per node [20:42:11] \o/ [20:42:11] I'm going to make the change propagation settings to reduce even more pressure [20:42:26] Yes. By making it requests multiple models per request? [20:42:43] We should see CPU drop by a factor of 3 [20:42:44] For ORES [20:43:05] yup [20:43:16] We probably won't see this on the dashboard, but API requests should drop by a similar factor [20:43:32] (03PS3) 10Dzahn: ganglia: Limit to production networks and fundraising networks [puppet] - 10https://gerrit.wikimedia.org/r/304032 (owner: 10Muehlenhoff) [20:43:38] So, we should have fewer active uwsgi processes from precaching [20:44:13] bblack: in DNS templates, when there are multiple records for one host, which format do you prefer https://phabricator.wikimedia.org/P3807 [20:45:33] (03CR) 10Dzahn: [C: 032] ganglia: Limit to production networks and fundraising networks [puppet] - 10https://gerrit.wikimedia.org/r/304032 (owner: 10Muehlenhoff) [20:45:56] From my notes, uwsgi memory usage used to be between 964 MB and 1167 MB. It is now 550MB. Not sure what's using up the available memory. [20:46:09] which imagemagick version is installed in the scalers? [20:47:00] OK. Declaring victory. Time to speed read a paper. [20:47:02] Platonides: on the 14.04 boxes [20:47:03] 8:6.7.7.10-6ubuntu3.1 [20:47:03] Thanks Amir1! [20:47:06] And nice work :D [20:47:09] 8:6.8.9.9-5+deb8u3 [20:47:21] mw1293 [20:47:29] mw2087 has the above.. [20:47:34] lol [20:47:38] thanks. I do some stuff and call it a day [20:47:42] great [20:47:46] conflicting versions xD [20:47:54] we are seeing some odd thumbnails [20:48:13] 1293 is running jessie [20:48:45] ofc, we're not serving from codfw [20:49:27] hmm [20:49:28] odd [20:49:47] I only see 6.8.9.9-5+deb8u2 in jessie [20:49:56] as if there's no deb8u3 [20:50:20] 2165 # ROW A eqiad imagescalers [20:50:20] 2166 node /^mw129[3-8]\.eqiad\.wmnet$/ { [20:50:25] i confirmed these are all jessie [20:50:42] Platonides: We've probably rebuilt it for some specific reason [20:51:19] Hey, did you guys recently mess with imagemagick, or anythin else related to thumbnail creation for new uploads? [20:51:45] I was asking about it, Reven :) [20:51:52] Cool. [20:52:04] I thought you would have added a wmf suffix, not u3, Reedy [20:52:15] Revent: yet it does not have a _wmf suffix [20:52:21] eh, i meant Reedy [20:52:31] it may have been patched for not following outside links [20:52:53] plus, it got sandboxed [20:53:06] Platonides: If you did not hear me say, it’s also on other images in Special:NewFiles [20:53:12] ops [20:53:17] I didn't see it [20:53:37] there is no https://gerrit.wikimedia.org/r/#/q/project:operations/debs/imagemagick [20:55:16] ok, u3 is really there upstream [20:55:21] https://packages.debian.org/jessie/imagemagick [20:55:34] i used reprepro to check apt.wikimedia.org [20:55:39] that is not there [20:55:43] alright, yea [20:56:09] apt-cache show imagemagick [20:56:13] lists both u2 and u3 [20:56:20] my fault [20:56:27] didn't scroll anough [20:56:31] *enough [20:59:29] 06Operations, 10Deployment-Systems: Have fallback communication channel when freenode has problems - https://phabricator.wikimedia.org/T127904#2057999 (10Dzahn) WMF used to run a freenode server (T82958) but we are now decom;ing it (T120752). That would have been the perfect fallback, like if there are netspli... [21:04:02] (03PS1) 10Dzahn: remove nobelium from puppet and install-server [puppet] - 10https://gerrit.wikimedia.org/r/304112 (https://phabricator.wikimedia.org/T142581) [21:05:32] (03CR) 10Dzahn: "do you know about this part?" [puppet] - 10https://gerrit.wikimedia.org/r/304112 (https://phabricator.wikimedia.org/T142581) (owner: 10Dzahn) [21:05:45] (03PS10) 10ArielGlenn: Make scheduler hupable. [dumps] - 10https://gerrit.wikimedia.org/r/302831 (https://phabricator.wikimedia.org/T142488) [21:07:03] (03PS1) 10Dzahn: remove nobelium.eqiad.wmnet, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/304113 (https://phabricator.wikimedia.org/T142581) [21:09:02] Action failed: Could not acquire locks on server rdb2. [21:09:09] deployment-prep is annoying [21:17:33] (03PS3) 10Jforrester: Enable VisualEditor by default for logged-out users on Arabic-script Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303587 (https://phabricator.wikimedia.org/T142587) [21:17:35] (03PS1) 10Jforrester: Enable VisualEditor by default for logged-in users on Indic-script Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304117 (https://phabricator.wikimedia.org/T142586) [21:17:37] (03PS1) 10Jforrester: Enable VisualEditor by default for logged-out users on Indic-script Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304118 (https://phabricator.wikimedia.org/T142586) [21:18:14] (03PS1) 10Andrew Bogott: Revert "Temporarily disable instance creation and security group editing" [puppet] - 10https://gerrit.wikimedia.org/r/304119 [21:18:28] (03CR) 10Jforrester: "(PS3 is just a rebase; still waiting for green light for this.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303587 (https://phabricator.wikimedia.org/T142587) (owner: 10Jforrester) [21:18:43] (03CR) 10Dzahn: "recheck" [dumps] - 10https://gerrit.wikimedia.org/r/302831 (https://phabricator.wikimedia.org/T142488) (owner: 10ArielGlenn) [21:18:47] (03CR) 10Jforrester: [C: 04-2] "Not yet announced, subject to community discussion first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304117 (https://phabricator.wikimedia.org/T142586) (owner: 10Jforrester) [21:18:54] (03CR) 10Jforrester: [C: 04-2] "Not yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304118 (https://phabricator.wikimedia.org/T142586) (owner: 10Jforrester) [21:19:04] I saw that mutante [21:19:13] there's a bunch of older than me in the queue [21:20:26] Yeah it's kinda slow at the moment. [21:20:40] Nothing seems wrong, just looks like a bunch of people decided to push within the same few minute period :) [21:21:51] yeah I'm watching it [21:21:55] grind slowly throught he mw cores [21:22:04] ostriches it seems only three instances are being made avilable for nodepool [21:22:16] it's all good. we'll live :) [21:22:36] Yep, but i thought the gate had higher priority [21:22:48] since that is how we set it out in zuul? [21:22:49] 06Operations, 10Deployment-Systems: Have fallback communication channel when freenode has problems - https://phabricator.wikimedia.org/T127904#2541749 (10greg) I'm inclined to just have an official Conpherence room for this. It'd need to be clear that this room (or any solution, really) is **only for backup pu... [21:23:18] I dunno. I'm just gonna go have a coffee or something while I wait. nbd :) [21:23:30] Lol [21:23:58] ostriches yeh we do but as we doint have any priority set in test it will always presume it is the high priority low [21:24:01] low = lol [21:28:26] (03CR) 10Dzahn: "btw the task about labs -> prod ganglia was https://phabricator.wikimedia.org/T115330" [puppet] - 10https://gerrit.wikimedia.org/r/304032 (owner: 10Muehlenhoff) [21:28:51] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2541764 (10Dzahn) also https://gerrit.wikimedia.org/r/#/c/304032 has been merged today now [21:34:41] (03PS2) 10Dzahn: Stop icinga git remote update [puppet] - 10https://gerrit.wikimedia.org/r/303955 (https://phabricator.wikimedia.org/T127093) (owner: 10Thcipriani) [21:35:22] (03PS1) 10Ladsgroup: changeprop: use one request for multiple models [puppet] - 10https://gerrit.wikimedia.org/r/304125 (https://phabricator.wikimedia.org/T142360) [21:40:52] mutante: I prefer the duplication actually, but I think the blank thing is more common in our files so far. Could do a patch to duplicate them all around and fix it. (my problem with the blanks is it kills grepping, and sorting) [21:40:58] (03CR) 10Andrew Bogott: [C: 032] Revert "Temporarily disable instance creation and security group editing" [puppet] - 10https://gerrit.wikimedia.org/r/304119 (owner: 10Andrew Bogott) [21:42:11] bblack: thanks, that confirms what i was thinking [21:42:25] !log added tbayer(HaeB) to wmf LDAP group [21:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:44:12] 06Operations, 10Deployment-Systems: Have fallback communication channel when freenode has problems - https://phabricator.wikimedia.org/T127904#2057999 (10Southparkfan) Do we prefer a fallback that cannot be impacted by a Wikimedia outage of any kind? Conpherence is an option, but it is not off-site; a network... [21:45:23] 06Operations, 10Deployment-Systems: Have fallback communication channel when freenode has problems - https://phabricator.wikimedia.org/T127904#2541910 (10Dzahn) There is also the external VM that runs wikitech-static. It is outside WMF infra for this reason. [21:46:24] greg-g: hey... Any special considerations for deploying a switch from old extension format to user registration for CentralNotice [21:46:26] ? [21:46:49] AndyRussG: ask legoktm and Reedy and MaxSem that you're doing it right [21:46:51] (sorry for the extraneous ping, not sure if it's u or someone else I should bug...) [21:46:55] Ah K cool thx! [21:47:27] AndyRussG: Are you updating CommonSettings too? [21:47:50] Reedy noo.... https://gerrit.wikimedia.org/r/#/c/186899/ [21:48:08] On beta cluster it has "just worked" [21:48:20] Yeah [21:48:25] I was gonna put it on a SWAT deploy but we could do it otherwise too [21:48:25] You can just let it ride the train if you want [21:48:59] I have one other CN patch to go out, thought this evening's SWAT might be fine [21:49:35] Oh, you need to use wmf_deploy branch... [21:49:36] The other patch is clogging some logs so it's a bit more pressing... Not necessary to push them both out at the same time, but it's easier from the point of view of monitoring at least... [21:50:05] * AndyRussG hurls wmf_deploy branch at random unsuspecting bystanders [21:50:13] https://integration.wikimedia.org/zuul/ [21:50:27] it seems either 1- zuul is super slow 2- it's super busy [21:50:32] known [21:50:33] Amir1: Known issue [21:50:35] see -releng [21:50:41] okay [21:50:48] AndyRussG: Did you change any config type things? [21:50:56] If it just worked on beta... should be fine in prod [21:51:16] You've some post deploy cleanup of config.. [21:51:18] Reedy: well, they got moved around a lot [21:51:50] No actual changes in how to config stuff. A lot of changes in autoloading and hooks tho [21:52:12] If your tests etc are still working... Tested locally [21:52:17] You should be fine to just deploy [21:52:30] How does the post deploy cleanup go specifically? [21:52:37] You don't have to do it now [21:52:42] or anytime soon [21:52:44] Ah K, but what is it? [21:52:57] well, replace include php file with wfLoadExtension [21:53:00] Then things like $wgNoticeProject = $wmgNoticeProject; [21:53:04] you don't need that [21:53:23] Remove those lines, rename $wmg -> $wg in InitialiseSettings [21:53:31] Yeah. Browser tests are working. Stuff is tested locally. (Just a couple odd code paths I still want to smoke test.) Integration tests are working. [21:53:39] This is in the wmf-config repo? [21:53:42] Yeah [21:53:57] Looks like you've only got 2 wmg configs [21:54:04] $wgCentralNoticeLoader = $wmgCentralNoticeLoader; [21:54:24] I'm in the dark wrt wmg [21:54:33] (03CR) 10ArielGlenn: [C: 032] Make scheduler hupable. [dumps] - 10https://gerrit.wikimedia.org/r/302831 (https://phabricator.wikimedia.org/T142488) (owner: 10ArielGlenn) [21:54:55] 3 lines in CommonSettings... 2 in IS [21:55:07] AndyRussG: Basically, the $wg = $wmg are from the old registration [21:55:20] They'd be populated out, and then when you loaded the extension, the config would be overwritten [21:55:36] It's not needed with extension registration [21:55:53] Zuul seems to have stuck jobs: https://integration.wikimedia.org/zuul/ [21:56:02] Known problem [21:56:05] see -releng [21:56:18] ah ok [21:57:15] AndyRussG: Your cleanup should be easy :) [21:57:58] Reedy: cool beans! I'll ping you on the future gerrit change, in any case :) [21:58:09] Or I can just make it for you now :P [21:58:27] Ah sure also! thx much :) [21:58:42] Reedy: so no issues u can think of for deploying the actual CN change? [21:58:49] Nope [21:58:57] It's crappy config, and not testing that cause more problems :) [21:58:58] Cool beans!! Much appreciated :) [21:59:25] 07Puppet, 10ORES, 06Revision-Scoring-As-A-Service: Change CP to do several models at once. - https://phabricator.wikimedia.org/T142360#2541943 (10Ladsgroup) https://github.com/wikimedia/change-propagation/pull/78 [22:03:39] AndyRussG: I'd suggest updating all active MW branches too [22:04:09] * Reedy waits for git review [22:04:27] Reedy: just 13 and 14 right? https://www.mediawiki.org/wiki/MediaWiki_1.28/Roadmap [22:04:36] Yup [22:04:50] I think 13 goes away in a couple of days [22:04:50] K yeah [22:05:07] Mmm yeah was planning to push it out everywhere [22:05:54] Yeah, tomorrow .14 is everywhere [22:06:19] (03PS1) 10Reedy: Load CentralNotice via wfLoadExtension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304126 (https://phabricator.wikimedia.org/T140852) [22:06:24] K [22:06:33] And that should be your config change [22:07:16] K cool [22:12:30] 06Operations, 06Commons: Commons bug: Thumbnail generation with horizontal lines - https://phabricator.wikimedia.org/T142638#2541691 (10Josve05a) [22:12:35] would mw2086-mw2089 scalers be active? [22:12:43] 06Operations, 06Commons: Commons bug: Thumbnail generation with horizontal lines - https://phabricator.wikimedia.org/T142638#2542002 (10Josve05a) p:05Triage>03Unbreak! [22:12:43] or are they on a dormant cpd? [22:13:37] should be dormant AFAIK [22:15:31] so, their jessie reinstall shouldn't have produced the bad thumbnails [22:16:08] where is the list of scalers? [22:16:22] I see jessie scalers were installed already on July [22:17:08] 06Operations, 06Commons: Commons bug: Thumbnail generation with horizontal lines - https://phabricator.wikimedia.org/T142638#2541691 (10Josve05a) | {F4351817} | {F4351819} | 799px | original file [22:18:27] 06Operations, 10Phabricator-Bot-Requests: Creation of bot for Operations - https://phabricator.wikimedia.org/T142362#2542025 (10Aklapper) I'm fine with @ops-monitoring-bot. I cannot comment on security of API tokens, so please explicitly indicate once things have been agreed on (or just steal this task from m... [22:20:13] 06Operations, 06Commons, 07Regression: Commons bug: Thumbnail generation with horizontal lines - https://phabricator.wikimedia.org/T142638#2542031 (10Josve05a) [22:21:45] There's a list in ganglia probably... else in the puppet repo [22:22:19] from site.pp [22:22:35] it seems they would be mw129[3-8]\.eqiad\.wmnet [22:22:48] which are jessie since 2016-07-04 [22:22:57] according to SAL entry by moritzm [22:23:07] 06Operations, 06Commons, 07Regression: Commons bug: Thumbnail generation with horizontal lines - https://phabricator.wikimedia.org/T142638#2541691 (10Yann) Can be seen on all tumbnails of https://commons.wikimedia.org/wiki/File:Von_Bach_bis_Tango_(ZMF_2016)_jm61021.jpg [22:23:39] what host did I log into earlier... [22:24:00] They seem to be, yeah [22:24:38] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 for chelsyx - https://phabricator.wikimedia.org/T142648#2542048 (10chelsyx) [22:26:05] (03CR) 10Alex Monk: "I think I've figured out why: Although we set up multiple files with hook functions inside them, PowerDNS only supports one. So we have a " [puppet] - 10https://gerrit.wikimedia.org/r/304049 (owner: 10Rush) [22:27:12] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and fluorine for chelsyx - https://phabricator.wikimedia.org/T142648#2542080 (10chelsyx) [22:27:46] do we have a ETA for T142638? or should we add a site notice? [22:27:47] T142638: Commons bug: Thumbnail generation with horizontal lines - https://phabricator.wikimedia.org/T142638 [22:33:33] Hi yannf. Looking. [22:34:08] 06Operations, 10Phabricator-Bot-Requests: Creation of bot for Operations - https://phabricator.wikimedia.org/T142362#2542103 (10mmodell) @aklapper: The security token thing is a non-issue. The semantic naming is perhaps important to someone but I say go ahead and create `@ops-monitoring-bot` [22:34:37] yannf: you can help, could you find the time the problem appeared for the first time? [22:36:03] 06Operations, 06Commons, 07Regression: Commons bug: Thumbnail generation with horizontal lines - https://phabricator.wikimedia.org/T142638#2542106 (10Josve05a) # Add new image (with this issue) to random Wikipedia article # A "new thumb size" of that file is generated # Note lines in that version of the file... [22:36:35] !log restarting rabbitmq-server on labvirt1001 [22:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:36:48] Dereckson, it was seen at least 2 hours ago [22:39:17] Platonides: so, bringing the discussion of this channel in here. you mentoined a shared ban list? is that somethign that this channel is already using or can we enable? [22:39:34] seems sensible as we've had a recent rash of trolls [22:40:19] it is already using it [22:40:22] +b $j:#channel-where-the-bans-are-set [22:40:30] $j:#wikimedia-bans [22:40:34] bans on #wikimedia-bans [22:40:38] will affect here [22:40:42] (03PS5) 10Yuvipanda: prometheus: Add blackbox exporter role/class [puppet] - 10https://gerrit.wikimedia.org/r/303986 [22:40:48] caveats: [22:40:57] it doesn't block from speaking, only from joining [22:41:01] so a kick is still needed [22:41:27] and we aren't really banning the trol much there [22:41:36] where is the listing that ties to that list? (its not channel flags) [22:42:01] my irc-foo is weak. [22:42:02] the bans on #wikipedia-es-ops are kept more up-to-date [22:42:10] but we are stricter [22:42:19] Dereckson: 2 hours 30 min ago was the first report on IRC bout the thumb issue...I think... [22:42:20] like banning some server ranges [22:42:27] robh: /mode #wikimedia-operations +b [22:42:42] you will see the $j: "ban" [22:42:44] ahhh, ok [22:42:45] (03CR) 10Yuvipanda: [C: 032 V: 032] prometheus: Add blackbox exporter role/class [puppet] - 10https://gerrit.wikimedia.org/r/303986 (owner: 10Yuvipanda) [22:42:47] i do now, thanks! [22:43:42] 13:54 moritzm: depooling image scaler mw1298 for some local tests with huge SVGs [22:43:56] 12:26 moritzm: depooling image scalers mw2086-mw2089 for reimaging with jessie [22:44:23] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, 07Regression: Commons bug: Thumbnail generation with horizontal lines - https://phabricator.wikimedia.org/T142638#2542141 (10matmarex) [22:45:30] MatmaRex: Welcome to the thumb party [22:45:57] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, 07Regression: Commons bug: Thumbnail generation with horizontal lines - https://phabricator.wikimedia.org/T142638#2542157 (10Dereckson) **From server admin log** 13:54 moritzm: depooling image scaler mw1298 for some local tests with huge... [22:47:09] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, and 2 others: Commons bug: Thumbnail generation with horizontal lines - https://phabricator.wikimedia.org/T142638#2542161 (10greg) [22:48:32] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, and 2 others: Commons bug: Thumbnail generation with horizontal lines - https://phabricator.wikimedia.org/T142638#2541691 (10greg) I've added #wikimedia-incident as this is pretty bad. @MoritzMuehlenhoff @Gilles help asap please. What can... [22:49:37] 06Operations, 10Ops-Access-Requests: Access for p858snake to chanops in #wikimedia-operations - https://phabricator.wikimedia.org/T142270#2529677 (10Platonides) >>! In T142270#2540632, @RobH wrote: > So I don't add folks to access lists by the nick unless protection for the nick is enabled: ChanServ would ign... [22:50:02] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, and 2 others: Commons bug: Thumbnail generation with horizontal lines - https://phabricator.wikimedia.org/T142638#2542184 (10Josve05a) After a fix has been made/patch reverted, we will need to purge/reparse all thumbnails generated during... [22:50:10] Platonides: ahh, cool, thx for correction [22:50:21] i wasnt sure about the enforce thing other than the invite exemption omde flag for other channels [22:50:33] :) [22:50:36] i assumed chanserv required ID before sending commands and oping but wasnt 100% sure [22:50:54] I didn't *try* it [22:50:58] but it would be insane [22:51:33] irc is so old, it may be slightly insane in its old age. i think its more polite to say 'confused' though.... [22:51:39] heh [22:51:48] but i think you are likely correct. [22:52:39] chanserv and nickserv are usually the same process [22:52:42] in a virtual node [22:53:00] https://commons.wikimedia.org/wiki/File:Building_for_Democracy-Adeiladu_ar_gyfer_Democratiaeth_(25709638912).jpg [22:53:10] ^ is the oldest upload where I’ve found it so far. [22:53:22] he managed to exploit that bug in one of our bots yesterday, though [22:53:45] it had been configured to only accept commands from a given list [22:54:11] but he used allowed nicks to flood it [22:54:14] even those with enforce [22:54:53] cuz bot just cared what the nick was when it was sent [22:54:59] not what it was 30 seconds later [22:55:12] just nick text comparison yea? [22:55:16] Revent, curiously I only see lines on this thumb https://upload.wikimedia.org/wikipedia/commons/thumb/d/d1/Building_for_Democracy-Adeiladu_ar_gyfer_Democratiaeth_%2825709638912%29.jpg/1024px-Building_for_Democracy-Adeiladu_ar_gyfer_Democratiaeth_%2825709638912%29.jpg [22:55:24] not the others [22:55:41] robh: here he is [22:55:48] As far as ‘how far back’, tho…. there are images where it is ‘only’ on the thumb in recent images, and not the others. [22:56:17] well, old images has already "gotten thumbs generated"..maybe... [22:57:36] lets just not deal with that when the other channels already have. [22:57:51] I'm not sure if *!eva@* matches ~eva [22:57:58] that ~ is a bit special in freenode [22:58:23] shit [22:58:28] !! [22:58:34] ok [22:58:35] oh [22:58:35] whew [22:58:42] lol [22:58:44] almost [22:58:45] hah [22:58:58] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and fluorine for chelsyx - https://phabricator.wikimedia.org/T142648#2542219 (10Deskana) Approved. [22:59:16] fun with irc! [22:59:20] xD [22:59:37] I think I will open a task requesting op [23:00:03] Platonides: i'd +1 it fwiw [23:00:09] RoanKattouw, ostriches, MaxSem, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160810T2300). [23:00:09] AndyRussG: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:14] then at one point I will have an error in the script [23:00:16] and next week im on ops duty, so itd fall to me to do it [23:00:20] and ban everyone, probably [23:00:26] xD [23:00:32] i just almost banned everyone [23:00:34] it happens. [23:00:36] yeah [23:00:47] andrewbogott: we can probably work on this [23:00:48] so, swat might not happen as labs is having issues which is affecting CI [23:00:56] but you have more leeway in some channels than others [23:00:59] oh yes I forgot Jenkins issues [23:01:02] ostriches: MaxSem Dereckson who will swat? [23:01:07] Ah hmmm [23:01:12] * AndyRussG reads backscroll [23:01:35] Way too busy sorry [23:01:37] (my comment was about the Commons thumbnail issue not related with MediaWiki) [23:01:56] (deployment, but with image scalers) [23:02:10] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [23:02:11] Oh…crud. I see it on the ‘file page’ thumb of a file from over a day ago. [23:02:18] https://commons.wikimedia.org/wiki/File:Cladonia_subtenuis_-_Flickr_-_pellaea_(4).jpg [23:02:21] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [23:03:36] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, and 2 others: Commons bug: Thumbnail generation with horizontal lines - https://phabricator.wikimedia.org/T142638#2542238 (10greg) @Bawolff or @brion can either of you help out here, per chance? [23:03:58] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, and 2 others: Thumbnails being generated in a corrupted state with horizontal lines across them - https://phabricator.wikimedia.org/T142638#2542240 (10Jdforrester-WMF) [23:08:06] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, and 2 others: Thumbnails being generated in a corrupted state with horizontal lines across them - https://phabricator.wikimedia.org/T142638#2542246 (10Platonides) @Dereckson in T142638#2542157, depooling a scaler shouldn't matter, and mw20... [23:09:18] Dereckson: ? I just added the change for CentralNotice wmf deploy https://gerrit.wikimedia.org/r/#/c/304130/ [23:09:27] yuvipanda: prometheus::blackbox_exporter change should be merged on master? [23:09:41] mutante whoops, yes. thanks! [23:09:51] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [23:10:02] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [23:10:02] done. ^ np [23:11:23] AndyRussG: https://integration.wikimedia.org/zuul/ is too busy to merge it now [23:12:09] (03PS1) 10Reedy: Revert "group1 wikis to 1.28.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304132 [23:12:28] (03CR) 10Reedy: [C: 032] "Reverting due to thumbnail corruption in T142638" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304132 (owner: 10Reedy) [23:12:31] AndyRussG: by the way, on a side topic, wouldn't it be a good idea to plan specific deploy windows for CentralNotice, every n weeks according your needs? [23:13:45] (03PS2) 10Reedy: Revert "group1 wikis to 1.28.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304132 [23:13:50] (03CR) 10Reedy: Revert "group1 wikis to 1.28.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304132 (owner: 10Reedy) [23:13:53] (03CR) 10Reedy: [C: 032] Revert "group1 wikis to 1.28.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304132 (owner: 10Reedy) [23:14:22] Dereckson: would you prefer that I do that and not do this deploy? It just varies [23:14:23] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.28.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304132 (owner: 10Reedy) [23:14:38] swat is off [23:14:41] we have to revert [23:14:42] Dereckson: ^ [23:14:50] K Dereckson oh well thx anyway!!!! [23:14:59] unless it's an emergency fix that is applicable to wmf.13 [23:15:31] (03PS1) 10Yuvipanda: tools: Increase centralized logging retention to 14d [puppet] - 10https://gerrit.wikimedia.org/r/304133 [23:15:33] (03PS1) 10Yuvipanda: labs: Depool labvirt1011 [puppet] - 10https://gerrit.wikimedia.org/r/304134 [23:15:38] andrewbogott ^ wanna +1? [23:15:40] I'll merge [23:16:08] on of the imagescalers is down [23:16:13] but it's codfw [23:16:20] (03CR) 10Andrew Bogott: [C: 031] "For now, the better part of valor" [puppet] - 10https://gerrit.wikimedia.org/r/304134 (owner: 10Yuvipanda) [23:16:30] mutante: It's still in the lists scap is using though [23:16:38] (03CR) 10Zppix: [C: 031] labs: Depool labvirt1011 [puppet] - 10https://gerrit.wikimedia.org/r/304134 (owner: 10Yuvipanda) [23:16:40] Reedy: i'm going to mgmt [23:16:50] (03PS2) 10Yuvipanda: tools: Increase centralized logging retention to 14d [puppet] - 10https://gerrit.wikimedia.org/r/304133 [23:16:53] mutante: It's down for papaul to look at IIRC [23:17:00] some hardware issue possibly [23:17:04] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Increase centralized logging retention to 14d [puppet] - 10https://gerrit.wikimedia.org/r/304133 (owner: 10Yuvipanda) [23:17:08] !log reedy@tin rebuilt wikiversions.php and synchronized wikiversions files: Revert to .13 to attempt to fix T142638 [23:17:10] T142638: Thumbnails being generated in a corrupted state with horizontal lines across them - https://phabricator.wikimedia.org/T142638 [23:17:13] (03PS2) 10Yuvipanda: labs: Depool labvirt1011 [puppet] - 10https://gerrit.wikimedia.org/r/304134 [23:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:17] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Depool labvirt1011 [puppet] - 10https://gerrit.wikimedia.org/r/304134 (owner: 10Yuvipanda) [23:17:23] oh, ok. 2086 ? searched phab [23:17:42] fuck you and your disappearing submit button, gerrit [23:17:54] andrewbogott merged [23:18:12] LOL [23:19:14] Reedy: console shows me it is .. in the middle of booting up.. cycle ?:p [23:19:19] lol [23:19:28] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, and 2 others: Thumbnails being generated in a corrupted state with horizontal lines across them - https://phabricator.wikimedia.org/T142638#2542292 (10Jdforrester-WMF) OK, the rollback looks to have fixed this. Meh. [23:19:28] mutante: can you depool it properly please? [23:19:30] well, it is back [23:19:32] but i did nothing [23:19:39] i see the login now [23:19:51] RECOVERY - Host mw2086 is UP: PING OK - Packet loss = 0%, RTA = 36.82 ms [23:19:54] ^ :p [23:20:05] i did not powercycle it, i just looked at it [23:20:11] lol [23:20:23] Reedy: FYI - just looked at about 20 thumbnails of jpegs uploaded since revert, no lines. [23:20:28] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, and 2 others: Thumbnails being generated in a corrupted state with horizontal lines across them - https://phabricator.wikimedia.org/T142638#2542295 (10greg) >>! In T142638#2542184, @Josve05a wrote: > After a fix has been made/patch reverte... [23:20:36] Revent: Thanks for confirming... We were coming to t hat conclusion too :) [23:20:37] Revent: thanks [23:20:42] thanks all [23:22:22] !log connected to mw2086.mgmt (which icinga said was down since a couple hours). i saw it booting up..it came back but i did not powercyle or reboot, just view console [23:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:00] PROBLEM - puppet last run on elastic2013 is CRITICAL: CRITICAL: puppet fail [23:23:14] Also seeing lines disappearing from previously broken omages. [23:23:19] *images [23:23:23] so something in wmf.14 was adding lines? [23:23:58] Revent: Yeah, if you purge them, it'll fix them [23:23:58] Revent: May regerating image when reaccessing? [23:23:58] Yep seems so. But could it have been an extension? [23:24:05] legoktm: Seems almost certain it's MW, not WMF config or the scalers [23:24:15] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, and 2 others: Thumbnails being generated in a corrupted state with horizontal lines across them - https://phabricator.wikimedia.org/T142638#2542316 (10greg) [23:24:33] https://www.mediawiki.org/wiki/MediaWiki_1.28/wmf.14 [23:24:41] legoktm: Not related to the mediahandler config changes? [23:25:06] it could be? it was in wmf.14... [23:25:06] git #20bd328e - MediaHandlerFactory: Don't use any global state [23:25:07] git #1b1b3cdb - Move MediaHandler defaults out of global scope (task T141305) [23:25:07] T141305: Unable to override cores default mediahandlers - https://phabricator.wikimedia.org/T141305 [23:25:17] but what would be adding lines? [23:25:27] not sure [23:25:31] probably a red herring [23:25:32] Reedy: i confirmed that mw2086 is not pooled (with "confctl" on palladium) [23:25:46] mutante: it's in the dsh lists though it seems [23:26:14] oh, ok, doing [23:26:16] reedy@tin:/srv/mediawiki-staging$ grep mw2086 /etc/dsh/group/mediawiki-installation [23:26:16] mw2086.codfw.wmnet [23:26:18] thanks :) [23:26:23] mutante: Might be worth opening a ticket for it [23:26:24] Reedy was there anything in the logs? [23:26:28] If it’s at all helpful, on ‘some’ but not most images the ‘lines’ were not solid black, but had white segements….tho, dubious it’s udeful info. [23:26:39] Revent: Is it just jpg? All image types? [23:26:55] I ONLY say it on thumbnails of jpegs. [23:27:00] *saw [23:27:16] * Josve05a too [23:27:21] legoktm: VIPS? [23:27:39] jpgs is 80%+ of uploads, did anyone see a clean non-jpg uploaded in that timeframe? [23:27:52] https://gerrit.wikimedia.org/r/#/c/263028/ [23:27:53] hmm [23:27:55] Reedy ^^ [23:28:02] Yeah, fae was running tiffs, they looked okay. [23:28:04] Maybe the conversion got wrong. [23:28:12] Platonides: Yes, hence suggesting Vips [23:28:29] I think it's Vips... [23:28:41] oh, vips [23:28:45] It seems someone wiped the entry point [23:28:51] did anyone update mw-config [23:28:53] I don't think it would have merged those config settings properly [23:28:54] witht he change [23:29:12] paladox: No [23:29:26] VipsTest isn't the entry point, it's for a special page [23:29:29] oh [23:29:35] legoktm: [23:29:36] woops [23:29:38] I know what it'll be [23:29:45] It'll be the WMF config overrides [23:29:47] i was looking in the wrong place [23:29:51] thanks [23:30:05] legoktm: It'll be doing some weird ass merge [23:30:11] Not overriding the whole lot [23:30:38] lol, yup [23:30:40] so it'll keep the defaults from the VIPS code, which makes vips used for jpegs, which we don't normally do [23:30:47] 06Operations: mw2086 was down - https://phabricator.wikimedia.org/T142661#2542327 (10Dzahn) [23:30:56] gilles: Yeah, exactly [23:31:00] Eurgh. [23:31:07] var_dump( $wgVipsOptions ) confirms it [23:31:16] Yay for too much complexity in our setup. [23:31:21] a lot more in the array than they should be [23:31:29] yay for terrible untested extension defaults [23:31:30] legoktm: How do we make it completely override something? [23:31:37] gilles: That too. [23:31:37] Actually, per gilles [23:31:40] let's nuke them [23:31:40] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [23:31:41] Reedy we are using "_merge_strategy": "array_plus_2d" in the wrong place [23:31:44] oh good [23:31:48] paladox: We don't want it to merge [23:31:52] (03PS1) 10Dzahn: remove mw2086 from dsh group mw-installation [puppet] - 10https://gerrit.wikimedia.org/r/304139 (https://phabricator.wikimedia.org/T142661) [23:31:56] Oh [23:32:35] https://gerrit.wikimedia.org/r/304140 [23:32:42] Just remove the lot of them [23:33:11] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [23:33:25] https://logstash.wikimedia.org/goto/941dfd55b1364b3d77b0734e2c9a0d4f [23:33:51] Cherry picked here https://gerrit.wikimedia.org/r/#/c/304141/ [23:35:15] * Reedy waits for jenkins [23:35:24] Thats going to be a while LOL [23:35:34] are we still swatting? [23:35:41] just force it [23:35:44] * mafk has a patch for commonsettings.php [23:35:45] mafk: Right now we're unbreaking prod. [23:35:49] mafk: So, no SWAT. [23:35:50] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [23:35:55] Nodepool is down [23:35:57] (03PS2) 10Dzahn: remove mw2086 from conftool, dsh group [puppet] - 10https://gerrit.wikimedia.org/r/304139 (https://phabricator.wikimedia.org/T142661) [23:36:04] James_F: ops, no swat indeed [23:36:12] We need a test image on testwiki before I deploy the fix [23:36:13] mafk: Tomorrow OK? [23:36:27] Reedy: Test wiki or mw1099? [23:36:40] James_F: when it's good for you folks [23:36:48] well, makes sense to test on testwiki, rather than swap all wikis there and back again? [23:36:53] it's just a line [23:36:59] mafk: Kk. Thanks for being understanding. :-) [23:37:14] I guess, we can do it on 1099 on commons if wanted [23:37:36] Reedy: https://test.wikipedia.org/wiki/File:Test_image_of_testing.png [23:37:36] Reedy it is going to be a long while, nodepool has gone down. [23:37:44] paladox: No it's not [23:37:53] I V+2 and submitted it [23:38:01] (03PS2) 10Ladsgroup: changeprop: use one request for multiple models [puppet] - 10https://gerrit.wikimedia.org/r/304125 (https://phabricator.wikimedia.org/T142360) [23:38:02] oh [23:38:07] thanks [23:38:12] James_F: I see no lines... [23:38:21] oh, damn, it's a PNG. [23:38:22] Ignore me. [23:38:25] https://test.wikipedia.org/wiki/File:Interior_Iglesia_Claret.jpg [23:38:31] Platonides: Thank you. [23:38:32] lmfao [23:38:42] Lots of lines. [23:38:45] yes [23:38:46] 06Operations: mw2086 was down - https://phabricator.wikimedia.org/T142661#2542350 (10Dzahn) [23:38:50] it's a good testing image [23:38:51] heh [23:38:53] Indeed. [23:38:56] All the lines. [23:39:03] I had been trying to upload that to deployment-prep on labs [23:39:20] !log restarted rabbitmq on labcontrol1001 [23:39:22] but was erroring the upload [23:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:39:29] config is right [23:39:30] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [23:39:30] I had the Special:Upload prepared :) [23:39:39] if you are operator on an IRC channel and then i copy/paste IRC logs into phabricator tickets, the @ sign is perfect :p [23:40:03] Reedy: So the idea is to push the new config to test and see if it works? [23:40:33] Yup, it's incoming [23:40:47] config looks correct on eval.php [23:40:50] !log reedy@tin Synchronized php-1.28.0-wmf.14/extensions/VipsScaler: Remove old broken config causing T142638 (duration: 00m 50s) [23:40:52] T142638: Thumbnails being generated in a corrupted state with horizontal lines across them - https://phabricator.wikimedia.org/T142638 [23:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:40:59] Who wants to purge? :) [23:41:00] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:41:06] * Platonides purges [23:41:25] :) [23:41:36] it's good now [23:41:38] Tada, it's all prettyful [23:41:51] Confirmed. [23:41:54] OK, taht works. [23:42:11] (03PS3) 10Dzahn: remove mw2086 from conftool, dsh group [puppet] - 10https://gerrit.wikimedia.org/r/304139 (https://phabricator.wikimedia.org/T142661) [23:42:21] (03PS1) 10Reedy: Re-instate group1 wikis to 1.28.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304143 [23:42:24] Do we want to put that into wmf.14, and push the train back to group1? greg-g's call? [23:42:35] Yeah, true [23:42:45] (03CR) 10Dzahn: [C: 032 V: 032] remove mw2086 from conftool, dsh group [puppet] - 10https://gerrit.wikimedia.org/r/304139 (https://phabricator.wikimedia.org/T142661) (owner: 10Dzahn) [23:42:49] Fix is live on .14 wikis now [23:42:52] sorry, I haven't been following, dealing with CI stuff [23:43:06] (03PS4) 10Dzahn: remove mw2086 from conftool, dsh group [puppet] - 10https://gerrit.wikimedia.org/r/304139 (https://phabricator.wikimedia.org/T142661) [23:43:07] so, confirmed fixed in .14? [23:43:10] Yup [23:43:19] (03CR) 10Dzahn: [V: 032] remove mw2086 from conftool, dsh group [puppet] - 10https://gerrit.wikimedia.org/r/304139 (https://phabricator.wikimedia.org/T142661) (owner: 10Dzahn) [23:43:22] Crap default config in VipsScaler, which meant it was being used for jpegs [23:43:23] rollforward then [23:43:35] status of the image regens? [23:43:37] Pre extension registration, it was overriden, it's fine [23:43:44] someone should test large tiffs & pngs, to make sure that VipsScaler, when it actually kicks in under the expected conditions, isn't broken [23:43:45] Post... [23:44:36] Reedy: James_F can you test what gilles just said? [23:44:41] before rollingforward? [23:44:41] !log mw2086 - removing node from cluster failed - backend error, request requires authentication [23:44:43] 06Operations, 06WMF-Legal, 06WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#1275510 (10Tbayer) >>! In T98722#2537158, @AlexMonk-WMF wrote: > @wikimedia.org isn't just managed by OIT, non-staff run addresses there, for example anything going through OTRS. Tru... [23:44:45] The config *is* correct now [23:44:48] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, and 2 others: Thumbnails being generated in a corrupted state with horizontal lines across them - https://phabricator.wikimedia.org/T142638#2542376 (10Jdforrester-WMF) Update from IRC: * We rolled back the train (wmf.14) to wmf.13, which... [23:44:48] And how it was previously [23:44:55] ok, I believe you/the config :) [23:45:08] doit [23:45:09] Reedy: (just to note the obvious, the CN extension registration hasn't gone out...) [23:45:31] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [23:45:49] (03CR) 10Reedy: [C: 032 V: 032] "C+2 V+2 due to jenkins" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304143 (owner: 10Reedy) [23:46:10] (03PS1) 10Alex Monk: [WIP] dnsrecursor: Rewrite code setting up lua hooks [puppet] - 10https://gerrit.wikimedia.org/r/304146 (https://phabricator.wikimedia.org/T139438) [23:46:31] large enough means a 20MP+ PNG or a 50MP+ TIFF [23:46:38] !log reedy@tin rebuilt wikiversions.php and synchronized wikiversions files: Reinstate .14 as T142638 is fixed [23:46:38] T142638: Thumbnails being generated in a corrupted state with horizontal lines across them - https://phabricator.wikimedia.org/T142638 [23:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:47:19] Why don't we have Special:LargeFiles? ;) [23:47:49] 06Operations, 06Commons, 10MediaWiki-File-management, 10MediaWiki-extensions-VipsScaler, and 3 others: JPEG thumbnails being generated in a corrupted state with horizontal lines across them - https://phabricator.wikimedia.org/T142638#2542401 (10Jdforrester-WMF) p:05Unbreak!>03High a:03Reedy [23:47:53] Reedy: If you write it, I'll write Special:SmallFiles [23:47:59] And then for completion, Special:MediumFiles [23:48:00] :) [23:48:03] :D [23:48:09] I guess, define large? [23:48:13] File size? Pixels? [23:48:41] 06Operations, 06Commons, 10MediaWiki-File-management, 10MediaWiki-extensions-VipsScaler, and 3 others: JPEG thumbnails being generated in a corrupted state with horizontal lines across them - https://phabricator.wikimedia.org/T142638#2542433 (10Josve05a) [23:48:42] 06Operations: mw2086 was down - https://phabricator.wikimedia.org/T142661#2542434 (10Dzahn) so i confirmed this on palladium [palladium:~] $ sudo confctl select name=mw2086.codfw.wmnet get {"mw2086.codfw.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=codfw,cluster=imagescaler,service=apache2"} then remove... [23:48:45] Reedy: https://commons.wikimedia.org/wiki/File:Tolbachik_volcano_1975_cone_pano_Kamchatka_on_2015-07-28.png [23:48:51] (130MB) [23:49:19] Ooh [23:49:24] 06Operations, 06WMF-Legal, 06WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#2542439 (10AlexMonk-WMF) >>! In T98722#2542374, @Tbayer wrote: >>>! In T98722#2537158, @AlexMonk-WMF wrote: >> @wikimedia.org isn't just managed by OIT, non-staff run addresses there,... [23:49:32] purged... [23:50:10] (I searched Phab for complaints about large PNGs not being rendered.) [23:50:26] heh [23:50:28] curl -I [23:50:29] Last-Modified: Wed, 10 Aug 2016 23:49:35 GMT [23:50:31] RECOVERY - puppet last run on elastic2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:50:36] So, that'd suggest the image was regenerated [23:50:44] Hmm. Is that bad? [23:50:52] Is what bad? [23:51:11] Regenerated thumbs for all unaffected images would be a lot of load on the imageservers. [23:51:31] They don't get done ad hoc, do they? [23:51:50] and $wgThumbnailEpoch = '20130601000000'; [23:52:16] The amount of broken images onwiki shouldn't be too much to cause that much load for long [23:52:20] we could purge individually images uploaded in the last hours [23:52:20] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [23:52:27] and any that users come accross [23:52:35] that nodepoold alert can be ack'd [23:52:47] Platonides: That's what I'd do tbh [23:53:02] Most older users would know to try purging in the first instance [23:53:37] 06Operations, 10ops-codfw: mw2086 was down - https://phabricator.wikimedia.org/T142661#2542444 (10Dzahn) [23:54:08] * James_F nods. [23:54:30] I'll comment on commons [23:54:38] I don't have perms to ack it [23:54:45] the nodepoold alert [23:54:48] "some thumbnails may still be broken but should be repaired soon as a maintenance script is running now" [23:54:52] uh, what maintenance script? [23:55:33] Reedy: The one AaronSchulz is fixing right now, see https://gerrit.wikimedia.org/r/304147 [23:55:51] heh [23:55:55] Reedy: Also if we need to document it don't we keep the task open? I've lost track of what current protocol is. [23:55:59] Is that related [23:56:03] Oh, do we? [23:56:13] I don't know. [23:56:16] 06Operations, 10ops-codfw: mw2086 was down - https://phabricator.wikimedia.org/T142661#2542327 (10Dzahn) [palladium:~] $ sudo confctl select name=mw2086.codfw.wmnet set/pooled=no ERROR:conftool:Error when trying to set/pooled=no on name=mw2086.codfw.wmnet ERROR:conftool:Failure writing to the kvstore: Backend... [23:56:17] Hence my question. [23:56:28] "I don't work here" [23:56:30] * Reedy grins [23:57:05] James_F: which protocol? [23:57:27] Do we need to write the post mortem before closing the task [23:57:28] I think [23:57:36] greg-g: Your one, I guess. Tasks in "Wikimedia-Incident". [23:57:37] oh, whatever [23:58:06] OK, I'll leave it. :-) [23:58:07] if the only thing blocking resolving the task is the report, just close it and write it, we'll file follow-up tasks after [23:58:17] or don't, I don't really care, honestly :) [23:59:03] !log Running purgeChangedFiles.php on all wikis on a terbium screen (T142638) [23:59:04] T142638: JPEG thumbnails being generated in a corrupted state with horizontal lines across them - https://phabricator.wikimedia.org/T142638 [23:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master