[02:11:40] (03CR) 10Krinkle: [C: 04-1] profiler-labs: Use FlameGraph-compatible format for xhprof sampler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434522 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [02:24:56] away Away - Detached [02:59:59] !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.4) (duration: 13m 18s) [03:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:38] !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.5) (duration: 14m 29s) [03:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:12:10] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue May 29 04:12:10 UTC 2018 (duration 14m 32s) [04:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:53] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 223.61 seconds [05:08:04] (03PS1) 10Marostegui: check_private_data: Adapt it for the new sanitariums [puppet] - 10https://gerrit.wikimedia.org/r/435959 (https://phabricator.wikimedia.org/T190704) [05:08:41] (03PS2) 10Marostegui: check_private_data: Adapt it for the new sanitariums [puppet] - 10https://gerrit.wikimedia.org/r/435959 (https://phabricator.wikimedia.org/T190704) [05:09:21] (03CR) 10Marostegui: [C: 032] check_private_data: Adapt it for the new sanitariums [puppet] - 10https://gerrit.wikimedia.org/r/435959 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [05:11:53] (03PS1) 10Marostegui: s*.hosts: Add db2094 to s1,s3,s5,s8 [software] - 10https://gerrit.wikimedia.org/r/435960 (https://phabricator.wikimedia.org/T190704) [05:13:12] (03CR) 10Marostegui: [C: 032] s*.hosts: Add db2094 to s1,s3,s5,s8 [software] - 10https://gerrit.wikimedia.org/r/435960 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [05:13:59] !log Stop MySQL on db2094 and db2095 for testing - T190704 [05:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:03] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [05:14:11] (03Merged) 10jenkins-bot: s*.hosts: Add db2094 to s1,s3,s5,s8 [software] - 10https://gerrit.wikimedia.org/r/435960 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [05:14:51] (03PS7) 10Marostegui: mariadb: Depool all row C databases (except s6 master) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433014 (https://phabricator.wikimedia.org/T187962) (owner: 10Jcrespo) [05:20:50] !log Restart MySQL on db2045 (s8 codfw master) - T195598 [05:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:54] T195598: Differing database server ips and server_id numbers - https://phabricator.wikimedia.org/T195598 [05:51:43] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0 [05:51:53] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [05:55:21] ^ I guess that comes from the email from CenturyLink [06:30:24] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apparmor.d/abstractions/ssl_certs] [06:31:13] PROBLEM - puppet last run on labvirt1021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [06:37:08] (03PS1) 10Elukey: Swap zookeeper from conf1003 to conf1006 [puppet] - 10https://gerrit.wikimedia.org/r/435963 (https://phabricator.wikimedia.org/T182924) [06:44:42] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 [06:44:52] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [06:50:22] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11299/" [puppet] - 10https://gerrit.wikimedia.org/r/435963 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [06:52:20] !log roll restart hadoop master daemons to pick up the new zookeeper settings [06:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:52] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:56:32] RECOVERY - puppet last run on labvirt1021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:15:59] (03PS1) 10Elukey: role::analytics_cluster::hadoop::worker: deploy Ores pkgs only on Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/435966 [07:17:08] (03PS1) 10Alexandros Kosiaris: url-downloader: Point to actinium [dns] - 10https://gerrit.wikimedia.org/r/435967 (https://phabricator.wikimedia.org/T187962) [07:18:55] (03PS1) 10Alexandros Kosiaris: Depool poolcounter1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435968 (https://phabricator.wikimedia.org/T187962) [07:19:13] (03CR) 10Alexandros Kosiaris: [C: 032] url-downloader: Point to actinium [dns] - 10https://gerrit.wikimedia.org/r/435967 (https://phabricator.wikimedia.org/T187962) (owner: 10Alexandros Kosiaris) [07:21:36] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Having it ready just in case, but let's wait for an actual problem before deploying per https://phabricator.wikimedia.org/T187962#4206574" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435968 (https://phabricator.wikimedia.org/T187962) (owner: 10Alexandros Kosiaris) [07:22:58] (03PS2) 10Elukey: role::analytics_cluster::hadoop::worker: deploy Ores pkgs only on Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/435966 [07:23:45] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4237839 (10akosiaris) >>! In T187962#4224809, @jcrespo wrote: > m1 master is on C, so the following services may go down: > > * etherpadlite Per @jcrespo's c... [07:24:19] (03PS3) 10Elukey: profile::hadoop::common: selectively deploy Ores packages [puppet] - 10https://gerrit.wikimedia.org/r/435966 [07:25:11] (03CR) 10Elukey: [C: 032] profile::hadoop::common: selectively deploy Ores packages [puppet] - 10https://gerrit.wikimedia.org/r/435966 (owner: 10Elukey) [07:26:03] !log upgrading remaining app servers in eqiad to hhvm-wikidiff 1.7.0 (HHVM bytecode cache needs to be pruned during rollout) [07:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:40] Amir1: o/ [07:26:42] 10Operations, 10Maps-Sprint, 10Patch-For-Review: reimage maps-test2004 to stretch and cassandra 2.2 - https://phabricator.wikimedia.org/T195741#4237841 (10Gehel) >>! In T195741#4237117, @MoritzMuehlenhoff wrote: >>>! In T195741#4237089, @Gehel wrote: >> It looks like we have a few missing dependencies in Str... [07:27:01] Amir1: the ores pkg issue should be solved, do you want to send the code review again? [07:28:16] (03CR) 10Vgutierrez: Use MemoryReactorClock for testing the UDP monitor (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/435763 (owner: 10Mark Bergsma) [07:29:45] 10Operations, 10Maps-Sprint, 10Patch-For-Review: reimage maps-test2004 to stretch and cassandra 2.2 - https://phabricator.wikimedia.org/T195741#4237844 (10Gehel) >>! In T195741#4237539, @Pnorman wrote: > We may need to deploy different binaries for kartotherian and tilerator on Jessie and Stretch. The packag... [07:32:37] (03CR) 10Vgutierrez: Use arping to detect duplicated IPs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/435797 (https://phabricator.wikimedia.org/T189522) (owner: 10Ayounsi) [07:34:35] (03PS1) 10Gilles: Enable performance survey on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435969 (https://phabricator.wikimedia.org/T187299) [07:35:57] 10Operations, 10Maps-Sprint, 10Patch-For-Review: reimage maps-test2004 to stretch and cassandra 2.2 - https://phabricator.wikimedia.org/T195741#4237849 (10MoritzMuehlenhoff) >>! In T195741#4237841, @Gehel wrote: > gehel@maps-test2004:~$ apt-cache show openjdk-8-jre-headless=8u171-b11-1~deb9u1 | grep Provides... [07:37:11] (03CR) 10Gilles: [C: 032] Enable performance survey on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435969 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [07:38:30] (03Merged) 10jenkins-bot: Enable performance survey on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435969 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [07:39:31] (03CR) 10jenkins-bot: Enable performance survey on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435969 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [07:47:45] !log gilles@tin Synchronized wmf-config/InitialiseSettings.php: T187299 Launch performance survey on ruwiki (duration: 01m 50s) [07:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:50] T187299: User-perceived page load performance study - https://phabricator.wikimedia.org/T187299 [07:48:31] (03CR) 10Jcrespo: [C: 04-1] "Past events made phabricator go down- better to link to a page on meta with a copy of the information." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435755 (https://phabricator.wikimedia.org/T194939) (owner: 10Marostegui) [07:49:40] !log reimage druid1002 to debian stretch [07:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:04] (03PS1) 10Elukey: Set zookeeper verion for druid1002 [puppet] - 10https://gerrit.wikimedia.org/r/435970 (https://phabricator.wikimedia.org/T192636) [07:52:06] (03PS5) 10Ayounsi: Use arping to detect duplicated IPs [puppet] - 10https://gerrit.wikimedia.org/r/435797 (https://phabricator.wikimedia.org/T189522) [07:52:08] (03CR) 10Jcrespo: [C: 031] db1061: Upgrade socket location [puppet] - 10https://gerrit.wikimedia.org/r/435757 (https://phabricator.wikimedia.org/T187962) (owner: 10Marostegui) [07:52:28] (03CR) 10Jcrespo: [C: 031] "It will require a restart of the heartbeat daemon." [puppet] - 10https://gerrit.wikimedia.org/r/435757 (https://phabricator.wikimedia.org/T187962) (owner: 10Marostegui) [07:52:55] (03CR) 10Elukey: [C: 032] Set zookeeper verion for druid1002 [puppet] - 10https://gerrit.wikimedia.org/r/435970 (https://phabricator.wikimedia.org/T192636) (owner: 10Elukey) [07:53:03] (03CR) 10Jcrespo: [C: 031] "And prometheus." [puppet] - 10https://gerrit.wikimedia.org/r/435757 (https://phabricator.wikimedia.org/T187962) (owner: 10Marostegui) [07:54:19] (03CR) 10Jcrespo: [C: 031] "Note the load 10 is not really needed (but it may be nice to add load 1 to another host)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435760 (https://phabricator.wikimedia.org/T187962) (owner: 10Marostegui) [07:55:47] (03CR) 10Ayounsi: "Fixing comments, make the script NOT a template." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/435797 (https://phabricator.wikimedia.org/T189522) (owner: 10Ayounsi) [07:56:46] !log upgrading mw1276-mw1290 to hhvm-wikidiff 1.7.0 (HHVM bytecode cache needs to be pruned during rollout) [07:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:05] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler02/11301/" [puppet] - 10https://gerrit.wikimedia.org/r/435797 (https://phabricator.wikimedia.org/T189522) (owner: 10Ayounsi) [08:05:26] (03CR) 10Vgutierrez: Use arping to detect duplicated IPs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/435797 (https://phabricator.wikimedia.org/T189522) (owner: 10Ayounsi) [08:05:57] (03PS8) 10Jcrespo: mariadb: Depool all row C databases (except s6 master) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433014 (https://phabricator.wikimedia.org/T187962) [08:07:07] (03PS1) 10Alexandros Kosiaris: icinga: Populate additional hostgroups based on LLDP [puppet] - 10https://gerrit.wikimedia.org/r/435972 (https://phabricator.wikimedia.org/T187962) [08:07:54] (03CR) 10Marostegui: [C: 04-2] "> Past events made phabricator go down- better to link to a page on" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435755 (https://phabricator.wikimedia.org/T194939) (owner: 10Marostegui) [08:09:03] (03CR) 10Jcrespo: [C: 04-1] "We can just copy it to a random one. Phabricator is not ready to get wikipedia traffic load." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435755 (https://phabricator.wikimedia.org/T194939) (owner: 10Marostegui) [08:09:50] (03PS6) 10Ayounsi: Use arping to detect duplicated IPs [puppet] - 10https://gerrit.wikimedia.org/r/435797 (https://phabricator.wikimedia.org/T189522) [08:09:59] (03CR) 10Ayounsi: Use arping to detect duplicated IPs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/435797 (https://phabricator.wikimedia.org/T189522) (owner: 10Ayounsi) [08:10:27] 10Operations, 10Maps-Sprint, 10Patch-For-Review: reimage maps-test2004 to stretch and cassandra 2.2 - https://phabricator.wikimedia.org/T195741#4237927 (10Gehel) jvm-tools tested and copied from jessie-wikimedia to stretch-wikimedia [08:17:27] (03PS2) 10Alexandros Kosiaris: icinga: Populate additional hostgroups based on LLDP [puppet] - 10https://gerrit.wikimedia.org/r/435972 (https://phabricator.wikimedia.org/T187962) [08:18:04] (03PS2) 10Marostegui: db-eqiad.php: Enable read-only for s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435755 (https://phabricator.wikimedia.org/T194939) [08:24:02] (03CR) 10Vgutierrez: [C: 031] Implement common base class for "looping check" monitors [debs/pybal] - 10https://gerrit.wikimedia.org/r/435764 (owner: 10Mark Bergsma) [08:24:48] !log upgrading remaining API servers in eqiad to hhvm-wikidiff 1.7.0 (HHVM bytecode cache needs to be pruned during rollout) [08:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:49] (03PS3) 10Alexandros Kosiaris: icinga: Populate additional hostgroups based on LLDP [puppet] - 10https://gerrit.wikimedia.org/r/435972 (https://phabricator.wikimedia.org/T187962) [08:27:08] (03CR) 10Vgutierrez: [C: 031] "inline question, LGTM otherwise" (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/434695 (owner: 10Mark Bergsma) [08:31:09] (03Abandoned) 10Vgutierrez: toollabs: add mono_external class [puppet] - 10https://gerrit.wikimedia.org/r/433142 (https://phabricator.wikimedia.org/T194665) (owner: 10Arturo Borrero Gonzalez) [08:33:21] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: Populate additional hostgroups based on LLDP [puppet] - 10https://gerrit.wikimedia.org/r/435972 (https://phabricator.wikimedia.org/T187962) (owner: 10Alexandros Kosiaris) [08:40:46] !log performing topology changes on s6 ahead of a possible failover [08:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:14] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1339.eqiad.wmnet]) [08:48:33] :? [08:48:43] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw1339.eqiad.wmnet are marked down but pooled [08:48:52] moritzm: related with your reimages? [08:49:01] sorry, upgrades [08:49:03] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw1339.eqiad.wmnet are marked down but pooled [08:49:30] yup.. something regarding mw1339 [08:50:44] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [08:50:53] sorry, yeah. that was my bad, forgot to depool 1339 [08:50:58] np :) [08:51:04] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy [08:53:14] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal [09:01:55] (03CR) 10Alexandros Kosiaris: [C: 031] Remove at [puppet] - 10https://gerrit.wikimedia.org/r/435171 (owner: 10Muehlenhoff) [09:03:49] (03CR) 10Marostegui: [C: 032] mariadb: Depool all row C databases (except s6 master) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433014 (https://phabricator.wikimedia.org/T187962) (owner: 10Jcrespo) [09:05:02] !log upgrading labweb servers to hhvm-wikidiff 1.7.0 (HHVM bytecode cache needs to be pruned during rollout) [09:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:15] (03Merged) 10jenkins-bot: mariadb: Depool all row C databases (except s6 master) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433014 (https://phabricator.wikimedia.org/T187962) (owner: 10Jcrespo) [09:06:53] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4238096 (10Marostegui) [09:07:30] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool all databases in row C - T187962 (duration: 01m 35s) [09:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:34] T187962: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 [09:08:54] (03PS3) 10Marostegui: db-eqiad.php: Enable read-only for s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435755 (https://phabricator.wikimedia.org/T194939) [09:09:35] (03CR) 10jenkins-bot: mariadb: Depool all row C databases (except s6 master) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433014 (https://phabricator.wikimedia.org/T187962) (owner: 10Jcrespo) [09:13:00] !log Downtime s6 replicas for 4 hours - T195595 [09:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:22] T195595: db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart - https://phabricator.wikimedia.org/T195595 [09:16:38] !log disable ping1001 redirect - T187962 [09:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:47] T187962: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 [09:20:44] !log redirect ns0 to baham - T187962 [09:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:44] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4238139 (10ayounsi) [09:25:12] 10Operations, 10Maps-Sprint, 10Patch-For-Review: reimage maps-test2004 to stretch and cassandra 2.2 - https://phabricator.wikimedia.org/T195741#4238145 (10Gehel) osmborder rebuilt and uploaded to reprepro. Only Cassandra left. [09:38:46] 10Operations, 10Maps-Sprint, 10Patch-For-Review: reimage maps-test2004 to stretch and cassandra 2.2 - https://phabricator.wikimedia.org/T195741#4238201 (10Gehel) I can't find a repo in gerrit for cassandra packaging. @elukey / @Eevans any idea where it could be? [09:40:21] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package of Blubber (0.4.0-1) - https://phabricator.wikimedia.org/T195609#4238212 (10akosiaris) 05Open>03Resolved a:03akosiaris `blubber_0.4.0-1_amd64` has been uploaded to both `stretch-wikimedia` and `jes... [09:42:42] gehel: I don't think there is one, I believe that Eric handles the packages by himself.. I'd wait for him later on today :) [09:44:48] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1972 bytes in 0.103 second response time [09:45:11] gehel: Eric is an upstream maintainer of cassandra and IIRC he maintains the debs upstream (and we simply import those) [09:45:44] ah! Didn't know it :) [09:46:17] gehel: I think it's those: http://dl.bintray.com/apache/cassandra/pool/main/c/cassandra/ [09:46:26] we have a special cassandra 2.2 version though (that runs on aqs and now maps2004-test) [09:46:41] it contains a patch that allows the use of the jmx exporter as java agent [09:47:51] * akosiaris shuts ears and goes lalalalalalalala [09:48:22] hmmh, no. those debs have someone from apache.org in debian/changelog, while the debs on restbase* have Eric in the Debian changelog [09:49:39] but the cassandra 2 packages currently used on maps seem to be from the upstream debian repo (compared to cassandra 3) [09:49:51] best to wait for him I'd say :-) [09:58:54] (03PS2) 10Mobrovac: Switch all job apart from exceptions for everything. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434891 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [10:01:25] (03CR) 10Mobrovac: [C: 032] Switch all job apart from exceptions for everything. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434891 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [10:02:30] elukey, moritzm : thanks! I'll check with him [10:02:42] (03Merged) 10jenkins-bot: Switch all job apart from exceptions for everything. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434891 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [10:02:47] (03CR) 10jenkins-bot: Switch all job apart from exceptions for everything. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434891 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [10:03:08] (03Abandoned) 10Aklapper: Allow discourse-mediawiki.wmflabs.org RSS feed on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404653 (https://phabricator.wikimedia.org/T185087) (owner: 10Aklapper) [10:03:42] (03CR) 10Aklapper: [C: 031] Add WMDS support question feed to mediawikiwiki RSS whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434901 (https://phabricator.wikimedia.org/T185087) (owner: 10Gergő Tisza) [10:04:48] !log ppchelko@tin Started deploy [cpjobqueue/deploy@c6dc83d]: Enable all jobs apart from exceptions for everything. T190327 [10:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:54] T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327 [10:05:46] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@c6dc83d]: Enable all jobs apart from exceptions for everything. T190327 (duration: 00m 58s) [10:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:26] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: Switch all jobs to EventBus file 1/2 - T190327 T195500 (duration: 01m 39s) [10:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:31] T195500: Global mass message delivered on meta but not on other wikis? - https://phabricator.wikimedia.org/T195500 [10:09:33] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Switch all jobs to EventBus file 2/2 - T190327 T195500 (duration: 01m 47s) [10:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:21] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#4238262 (10Pchelolo) [10:12:13] (03PS4) 10Marostegui: db-eqiad.php: Enable read-only for s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435755 (https://phabricator.wikimedia.org/T194939) [10:12:58] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Enable read-only for s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435755 (https://phabricator.wikimedia.org/T194939) (owner: 10Marostegui) [10:15:54] (03CR) 10Marostegui: db-eqiad.php: Enable read-only for s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435755 (https://phabricator.wikimedia.org/T194939) (owner: 10Marostegui) [10:16:06] (03PS5) 10Marostegui: db-eqiad.php: Enable read-only for s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435755 (https://phabricator.wikimedia.org/T194939) [10:25:04] (03PS1) 10Elukey: role::druid::analytics|public::worker: remove hadoop-cdh client dep [puppet] - 10https://gerrit.wikimedia.org/r/435983 (https://phabricator.wikimedia.org/T192636) [10:26:30] (03CR) 10Elukey: [C: 032] role::druid::analytics|public::worker: remove hadoop-cdh client dep [puppet] - 10https://gerrit.wikimedia.org/r/435983 (https://phabricator.wikimedia.org/T192636) (owner: 10Elukey) [10:26:46] (03PS4) 10Volans: Initial working version [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/432597 (https://phabricator.wikimedia.org/T191299) [10:29:31] (03PS1) 10Aklapper: phabricator: List new and recent assignees [puppet] - 10https://gerrit.wikimedia.org/r/435984 (https://phabricator.wikimedia.org/T195780) [10:33:21] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4238328 (10ayounsi) [10:35:59] !log upgrading mw1308-mw1311 to hhvm-wikidiff 1.7.0 (HHVM bytecode cache needs to be pruned during rollout) [10:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:31] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] "We do have a +1 from two of these people. To be honest I would remove Aude for now as she is – from what I know – currently not able to re" [puppet] - 10https://gerrit.wikimedia.org/r/434479 (https://phabricator.wikimedia.org/T195289) (owner: 10ArielGlenn) [10:52:22] (03PS5) 10Volans: Initial working version [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/432597 (https://phabricator.wikimedia.org/T191299) [10:53:01] Eqiad row C server move is starting. bast1002 first then s6 master [10:53:19] !log Eqiad row C server move starting [10:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:37] XioNoX: [nitpick] if you add the T number of the task to the log messages it will add the log on Phab too, might be easier to keep track for the people involved ;) [10:54:59] yeah, I usually do it :) [10:55:08] !log Eqiad row C server move starting - T187962 [10:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:12] T187962: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 [10:56:03] (03PS2) 10Marostegui: db1061: Upgrade socket location [puppet] - 10https://gerrit.wikimedia.org/r/435757 (https://phabricator.wikimedia.org/T187962) [10:56:10] _joe_: bast1002 is moved and working [10:56:18] <_joe_> XioNoX: <3 [10:56:30] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Enable read-only for s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435755 (https://phabricator.wikimedia.org/T194939) (owner: 10Marostegui) [10:57:59] (03Merged) 10jenkins-bot: db-eqiad.php: Enable read-only for s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435755 (https://phabricator.wikimedia.org/T194939) (owner: 10Marostegui) [10:58:31] (03PS1) 10Marostegui: Revert "db-eqiad.php: Enable read-only for s6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435985 [10:59:33] (03CR) 10jenkins-bot: db-eqiad.php: Enable read-only for s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435755 (https://phabricator.wikimedia.org/T194939) (owner: 10Marostegui) [10:59:59] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Enable read only on s6 T194939 T187962 (duration: 01m 35s) [11:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:06] T194939: Announce read-only time on-wiki to users of frwiki, ruwiki and jawiki (May 29, 2018) - https://phabricator.wikimedia.org/T194939 [11:02:35] (03CR) 10Marostegui: [C: 032] db1061: Upgrade socket location [puppet] - 10https://gerrit.wikimedia.org/r/435757 (https://phabricator.wikimedia.org/T187962) (owner: 10Marostegui) [11:03:46] shout if anything is bad [11:04:12] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Enable read-only for s6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435985 (owner: 10Marostegui) [11:05:21] (03CR) 10Jcrespo: [C: 031] Revert "db-eqiad.php: Enable read-only for s6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435985 (owner: 10Marostegui) [11:05:48] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Enable read-only for s6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435985 (owner: 10Marostegui) [11:06:03] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Enable read-only for s6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435985 (owner: 10Marostegui) [11:08:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Disable read only on s6 T194939 T187962 (duration: 01m 37s) [11:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:45] T194939: Announce read-only time on-wiki to users of frwiki, ruwiki and jawiki (May 29, 2018) - https://phabricator.wikimedia.org/T194939 [11:08:45] T187962: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 [11:10:59] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1971 bytes in 0.084 second response time [11:13:43] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4238455 (10Marostegui) s6 primary master maintenance completed: Read only lasted from: 10:59:59 to 11:08:39 (times are in UTC) [11:14:02] !log upgrading snapshot hosts to hhvm-wikidiff 1.7.0 (HHVM is unused, just for completeness) [11:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:15] Going to start rack C2 which is DB, analytics and es [11:14:44] 10Operations, 10DBA, 10Patch-For-Review: db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart - https://phabricator.wikimedia.org/T195595#4238457 (10Marostegui) This restart has been done, current values: ``` +------------+ | @@hostname | +------------+ | db1061 | +------------+... [11:14:47] marostegui: io is a bit high on s6 master, which is expected, but FYI [11:15:09] yeah, I was seeing that [11:15:53] 10Operations, 10DBA, 10Patch-For-Review: db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart - https://phabricator.wikimedia.org/T195595#4238458 (10Marostegui) We can now revert: https://gerrit.wikimedia.org/r/#/c/435182/ [11:16:11] I wonder how new queues behave with a read only [11:16:21] if they do not start and stop slowly [11:17:33] (03PS1) 10Marostegui: Revert "sanitarium_multi: Hardcode db1125 server_id" [puppet] - 10https://gerrit.wikimedia.org/r/435986 [11:17:39] (03PS2) 10Marostegui: Revert "sanitarium_multi: Hardcode db1125 server_id" [puppet] - 10https://gerrit.wikimedia.org/r/435986 [11:17:51] (03Abandoned) 10Marostegui: db-eqiad.php: Promote db1093 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435760 (https://phabricator.wikimedia.org/T187962) (owner: 10Marostegui) [11:18:02] (03CR) 10Jcrespo: [C: 031] Revert "sanitarium_multi: Hardcode db1125 server_id" [puppet] - 10https://gerrit.wikimedia.org/r/435986 (owner: 10Marostegui) [11:18:17] (03CR) 10Jcrespo: [C: 031] "It requires a restart, though." [puppet] - 10https://gerrit.wikimedia.org/r/435986 (owner: 10Marostegui) [11:18:42] (03CR) 10Marostegui: "> It requires a restart, though." [puppet] - 10https://gerrit.wikimedia.org/r/435986 (owner: 10Marostegui) [11:18:44] (03CR) 10Marostegui: [C: 032] Revert "sanitarium_multi: Hardcode db1125 server_id" [puppet] - 10https://gerrit.wikimedia.org/r/435986 (owner: 10Marostegui) [11:19:21] should I do the topology changes back or should I wait a bit just in case? [11:19:27] let's wait a bit [11:19:33] I am not going to abandon the puppet change yet [11:19:35] just in case [11:19:48] yeah, it seems sane [11:20:28] so aparently, the jobqueue doesn't care if mediawiki is in read only [11:20:43] we got 5 user errors [11:20:48] So we _have_ to set read only on a mysql level then? [11:20:51] and 2519 jobqueue errors [11:20:58] no, I mean they error out [11:21:02] 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4238479 (10Nuria) 05Open>03Resolved [11:21:04] not that they are successful [11:21:15] !log Restar db1125 mysql - T195595 [11:21:18] PROBLEM - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [11:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:19] T195595: db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart - https://phabricator.wikimedia.org/T195595 [11:21:44] That dbproxy failover is probably part of the maintenance [11:23:04] Yeah, db1108 [11:23:12] 1004 is not active [11:23:18] ah! [11:23:24] so se can reaload it right away [11:23:26] db1108 is in row C [11:23:43] yes, but remember the latest arch change [11:23:44] db1108 is still down to me [11:23:54] when it gets up [11:23:56] (03CR) 10Mark Bergsma: Adapt ProxyFetch tests to use tcpClients and sslClients (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/434695 (owner: 10Mark Bergsma) [11:25:02] 10Operations, 10DBA, 10Patch-For-Review: db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart - https://phabricator.wikimedia.org/T195595#4238487 (10Marostegui) 05Open>03Resolved a:03Marostegui This is all done, including restarting db1125:s6 to pick up the new server_id. [11:28:59] RECOVERY - haproxy failover on dbproxy1004 is OK: OK check_failover servers up 2 down 0 [11:29:44] (03CR) 10Mark Bergsma: Use MemoryReactorClock for testing the UDP monitor (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/435763 (owner: 10Mark Bergsma) [11:29:55] (03PS2) 10Mark Bergsma: Use MemoryReactorClock for testing the UDP monitor [debs/pybal] - 10https://gerrit.wikimedia.org/r/435763 [11:29:56] (03PS5) 10Mark Bergsma: Adapt ProxyFetch tests to use tcpClients and sslClients [debs/pybal] - 10https://gerrit.wikimedia.org/r/434695 [11:29:59] (03PS2) 10Mark Bergsma: Implement common base class for "looping check" monitors [debs/pybal] - 10https://gerrit.wikimedia.org/r/435764 [11:30:01] (03PS1) 10Marostegui: Revert "mariadb: Depool all row C databases (except s6 master)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435987 [11:30:12] (03CR) 10Marostegui: [C: 04-2] "Wait for the network maintenance to be finished" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435987 (owner: 10Marostegui) [11:31:02] (03CR) 10Mark Bergsma: [C: 032] Use MemoryReactorClock for testing the UDP monitor [debs/pybal] - 10https://gerrit.wikimedia.org/r/435763 (owner: 10Mark Bergsma) [11:31:33] (03Merged) 10jenkins-bot: Use MemoryReactorClock for testing the UDP monitor [debs/pybal] - 10https://gerrit.wikimedia.org/r/435763 (owner: 10Mark Bergsma) [11:33:43] Moving on to C3, mostly db, ores, analytics [11:34:12] (03CR) 10Mark Bergsma: [C: 032] Adapt ProxyFetch tests to use tcpClients and sslClients [debs/pybal] - 10https://gerrit.wikimedia.org/r/434695 (owner: 10Mark Bergsma) [11:34:50] (03Merged) 10jenkins-bot: Adapt ProxyFetch tests to use tcpClients and sslClients [debs/pybal] - 10https://gerrit.wikimedia.org/r/434695 (owner: 10Mark Bergsma) [11:38:20] jynus, marostegui: I noticed you have a lot going on. Any suggestions when would be a good time to run a maintenance script (refresh-translatable-pages) on a bunch of wikis to not disrupt anything you work on? [11:38:36] not today [11:38:38] Nikerabbit: can you wait a bit until the network maintenance is finished? [11:38:40] send a ticket [11:38:46] We are monitoring still [11:39:11] or do you mean you want to run it? [11:39:41] jynus: yeah I would run it myself, but I can wait [11:39:47] I would wait a few hours [11:40:05] sure, no problem [11:41:25] Nikerabbit: while technically nothing is problematic new, maintenance is ongoing right now [11:41:28] *now [11:41:33] (03PS1) 10Muehlenhoff: Update SSH key for ciro [puppet] - 10https://gerrit.wikimedia.org/r/435988 [11:41:37] which could arise problems at any moment [11:41:44] plus s6 master is a bit cold [11:48:59] (03CR) 10Muehlenhoff: [C: 032] Update SSH key for ciro [puppet] - 10https://gerrit.wikimedia.org/r/435988 (owner: 10Muehlenhoff) [11:59:09] Moving to C4, wide variety of hosts, including 2 ganeti servers [12:03:44] PROBLEM - Host ganeti1001 is DOWN: PING CRITICAL - Packet loss = 100% [12:04:17] (03PS3) 10Mark Bergsma: Implement common base class for "looping check" monitors [debs/pybal] - 10https://gerrit.wikimedia.org/r/435764 [12:04:55] ...once upon a time i knew by heart which servers were in which racks... [12:05:14] RECOVERY - Host ganeti1001 is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [12:12:13] PROBLEM - puppet last run on ganeti1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:14:03] 10Operations: Phase out DSA keys for SSH access (ssh-dss) - https://phabricator.wikimedia.org/T177371#4238645 (10MoritzMuehlenhoff) [12:17:14] RECOVERY - puppet last run on ganeti1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:17:18] good [12:17:32] I've run puppet on ganeti1001, it was a temporary failure of puppetdb [12:18:28] thanks [12:18:45] Moving to C5, wide variety of hosts as well [12:34:48] Moving to C6, LVS and MW [12:50:03] jouncebot: next [12:50:03] In 0 hour(s) and 9 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180529T1300) [12:52:57] 10Operations: Broken pinning on some WMCS servers - https://phabricator.wikimedia.org/T195835#4238754 (10MoritzMuehlenhoff) [12:54:30] !log installing xdg-utils security updates [12:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:24] Moving to C7, various hosts [12:59:18] 10Operations: Broken pinning on some WMCS servers - https://phabricator.wikimedia.org/T195835#4238785 (10chasemp) a:03aborrero [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180529T1300). [13:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:01:23] I can SWAT today [13:03:33] o/ I have a patch to swat if not to late to add the deployment page (and jenkins gives me a V+2) [13:04:01] dcausse: it's never to late :) [13:04:06] great :) [13:04:08] Urbanecm: around for SWAT? [13:05:05] dcausse: go ahead, looks like Urbanecm is not yet around [13:05:21] zeljkof: I'm waiting for jenkins on https://gerrit.wikimedia.org/r/#/c/435121/ [13:06:44] PROBLEM - Host argon is DOWN: PING CRITICAL - Packet loss = 100% [13:06:54] PROBLEM - Host darmstadtium is DOWN: PING CRITICAL - Packet loss = 100% [13:07:04] PROBLEM - Host etcd1006 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:24] PROBLEM - Host proton1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:24] PROBLEM - Host krypton is DOWN: PING CRITICAL - Packet loss = 100% [13:07:24] PROBLEM - Host puppetboard1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:34] PROBLEM - Host puppetdb1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:35] XioNoX: ^ [13:07:44] PROBLEM - Host mendelevium is DOWN: PING CRITICAL - Packet loss = 100% [13:07:44] PROBLEM - Host etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:44] PROBLEM - Host ganeti1004 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:44] PROBLEM - Host ping1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:44] PROBLEM - Host dbmonitor1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:08:08] looking [13:08:13] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:08:42] <_joe_> uh? [13:08:48] <_joe_> why ganeti1004? [13:09:13] PROBLEM - puppet last run on elastic1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:09:13] PROBLEM - puppet last run on wtp1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:09:23] <_joe_> uhm that doesn't look good [13:09:32] <_joe_> puppetdb1001 is off [13:09:37] <_joe_> puppet will fail everywhere [13:09:43] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:10:03] PROBLEM - puppet last run on ganeti1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:10:12] _joe_: puppetdb is physical? [13:10:20] <_joe_> no virtual on ganeti1004 I guess [13:10:35] <_joe_> are we moving ganeti1004 right now? [13:10:44] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:10:44] PROBLEM - puppet last run on aqs1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:10:44] PROBLEM - puppet last run on ganeti1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:10:44] PROBLEM - puppet last run on labpuppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:10:45] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:10:45] PROBLEM - puppet last run on mw1267 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:10:53] zeljkof, I'm here [13:10:54] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:10:54] <_joe_> someone kill ircecho [13:10:55] A little bit late [13:10:57] Sorry [13:11:14] PROBLEM - puppet last run on cp1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:11:14] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:11:14] PROBLEM - puppet last run on poolcounter1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:11:14] PROBLEM - puppet last run on rdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:11:14] PROBLEM - puppet last run on db1123 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:11:14] PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:11:23] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:11:23] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:11:24] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:11:24] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:11:34] PROBLEM - puppet last run on puppetmaster2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:11:35] zeljkof: still -1, I'll postpone the deploy, if the train rolls forward the will be deployed on group1 wednesday I bet it's good enough [13:11:37] <_joe_> this shower of alerts is expected [13:11:54] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:11:54] PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:11:54] PROBLEM - puppet last run on dbproxy1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:11:59] Urbanecm: no problem, let me just ask about all the problems reported [13:12:13] dcausse: ok, good luck :) [13:12:13] PROBLEM - puppet last run on cp1063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:12:13] PROBLEM - puppet last run on etcd1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:12:26] Sure, I have no problem with additional waiting. [13:12:31] !log stopped ircecho temporarily [13:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:38] XioNoX, _joe_: can we continue with SWAT, or should we wait? [13:13:03] 10Operations, 10Analytics, 10Performance-Team, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gilles - https://phabricator.wikimedia.org/T195837#4238806 (10Gilles) [13:13:07] zeljkof: best to wait a bit, should be resolved soon [13:13:16] moritzm: ok, thanks! [13:13:32] Urbanecm: I'll review the patches while waiting [13:13:37] ack [13:13:38] puppetdb1001 is back on [13:13:49] moritzm: just please let me know when we can start SWAT [13:14:00] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4238825 (10Gehel) >>! In T187962#4206574, @Joe wrote: > The only thing we really need to check beforehand are the distributed datastores like cassandra and ela... [13:14:20] (03PS2) 10Zfilipin: Revert "Revert "Revert "Temp rate limit for arwiki due to mass vandalism""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435629 (https://phabricator.wikimedia.org/T192668) (owner: 10Urbanecm) [13:14:32] the network link has been switched back and VMs seem to work fine, let me quickly check the mwdebug servers [13:14:59] I can force puppet runs on failed hosts [13:15:06] to clear alarms and re-start ircecho [13:15:42] that would be useful [13:15:47] _joe_, XioNoX ok whit that? ^^^ [13:15:52] *with [13:15:54] as there were other things affected that I need to see [13:16:09] volans: yes [13:16:26] * volans running [13:16:32] zeljkof: let's wait a bit until alarms have cleared and ircecho is back (so that there's proper alerting just in case SWAT causes problems) [13:16:37] !log running puppet on failed only hosts [13:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:51] moritzm: sure [13:17:06] (03CR) 10Zfilipin: [C: 031] Revert "Revert "Revert "Temp rate limit for arwiki due to mass vandalism""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435629 (https://phabricator.wikimedia.org/T192668) (owner: 10Urbanecm) [13:17:34] zeljkof, I think you can do CR+2, merge doesn't matter, deploying can cause issues :) [13:17:41] It'll save us from jenkins waiting [13:17:46] But up to you [13:18:24] Urbanecm: config changes should not take more than a minute to merge anyway, I'll wait until we can deploy [13:18:35] zeljkof, should not, but from time to time, they do [13:19:29] Urbanecm: I can merge the first one only anyway, I can't merge all of them, all 3 patches are for the same file, so they would all get deployed at the same time [13:19:49] !log rolling restart of relforge for plugin upgrade - T193734 [13:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:53] T193734: Move Serbian language wikis from extra-analysis to extra-analysis-serbian plugin - https://phabricator.wikimedia.org/T193734 [13:20:02] That doesn't matter as well, I have no problem with testing them at once [13:20:04] and CI looks completely free [13:20:10] gehel: any of them on rowC? [13:20:25] just to avoid mixing things ;) [13:20:40] But as I said, just thinking about possibilities. [13:20:50] Urbanecm: the problem is if something goes wrong, it might be hard to tell what caused the problem if there are multiple reasons, that's why we merge and deploy commits one by one [13:21:05] volans: actually, yes, relforge1002 is on row C... [13:21:21] Ok, ack [13:21:34] you might want to check with XioNoX if that was already moved or not, and decide accordingly ;) [13:21:45] jynus: still 50%, I'll log once done [13:21:52] there isn't user traffic on relforge, so no big deal, but yeah, make sense to not mix [13:21:58] volans: thanks for the reminder! [13:22:01] 10Operations, 10ops-codfw: mw2182 crash - https://phabricator.wikimedia.org/T194835#4238867 (10Papaul) @MoritzMuehlenhoff yes we do have some decommissioned servers. This can be also a bad main board. What we can do first is to swap CPU position. Since the error is showing on CPU1 we can move CPU1 to CPU0 a... [13:22:02] np [13:22:14] (03CR) 10Zfilipin: [C: 031] Create 2 extra namespaces for bdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435693 (https://phabricator.wikimedia.org/T195700) (owner: 10Urbanecm) [13:22:27] relforge1002 has been moved [13:22:40] XioNoX: thanks! so I'll continue with that restart [13:23:34] (03CR) 10Zfilipin: [C: 031] Add 2 namespace aliases to bdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435694 (https://phabricator.wikimedia.org/T195700) (owner: 10Urbanecm) [13:24:23] Urbanecm: is there a reason 435693 and 435694 are in two commits, looks like it could be just one? (no preference on my side, just curious) [13:24:42] (I actually prefer smaller commits, so two commits is fine with me, just asking) [13:24:54] 10Operations, 10ops-codfw: Degraded RAID on elastic2020 - https://phabricator.wikimedia.org/T195306#4238890 (10Papaul) @Gehel is it okay to resolve this task? [13:25:25] zeljkof, I did one request, uploaded patch, asked for review, and I was asked if I included the aliases. Was too lazy to download commit and amend it :D [13:25:30] There is no technical reason, just this [13:25:42] !log powered down mw2182 for hardware diagnosis [13:25:43] Urbanecm: no problem, that's what I thought :) [13:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:13] 10Operations, 10ops-codfw: Degraded RAID on elastic2020 - https://phabricator.wikimedia.org/T195306#4238926 (10Gehel) 05Open>03Resolved @Papaul I thought I did resolve it already, but I think we had a duplicate. So yes, resolving! [13:26:15] Urbanecm: in that case, those two commits could be merged and deployed together [13:26:36] 10Operations, 10ops-codfw: mw2182 crash - https://phabricator.wikimedia.org/T194835#4238932 (10MoritzMuehlenhoff) >>! In T194835#4238867, @Papaul wrote: > @MoritzMuehlenhoff yes we do have some decommissioned servers. This can be also a bad main board. What we can do first is to swap CPU position. Since the... [13:27:04] Up to you, I can test them in both situations [13:28:25] !log puppet run on failed hosts completed [13:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:41] I'm waiting icinga to catch up to re-enable ircecho, should be a couple of minutes more [13:29:30] 10Operations, 10ops-codfw: mw2182 crash - https://phabricator.wikimedia.org/T194835#4238937 (10Papaul) a:03Papaul [13:30:06] (03PS1) 10Elukey: turnilo: disable instrospection autofill-all for pageviews-* [puppet] - 10https://gerrit.wikimedia.org/r/435997 (https://phabricator.wikimedia.org/T195819) [13:31:26] (03CR) 10Elukey: [C: 032] turnilo: disable instrospection autofill-all for pageviews-* [puppet] - 10https://gerrit.wikimedia.org/r/435997 (https://phabricator.wikimedia.org/T195819) (owner: 10Elukey) [13:31:59] (03CR) 10Joal: [C: 031] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/435997 (https://phabricator.wikimedia.org/T195819) (owner: 10Elukey) [13:32:07] !log restarted ircecho [13:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:19] jynus, _joe_, XioNoX ^^^ FYI [13:32:27] thx [13:32:56] zeljkof, what's status? [13:33:42] Urbanecm: still waiting, moritzm do you have ETA on when the problems would be resolved? (asking for a friend) [13:33:45] zeljkof: you can resume [13:33:46] :D [13:33:54] moritzm: thanks! [13:33:58] icinga has cleared and ircecho is back on [13:34:09] Urbanecm: please stand by, merging the first commit [13:34:32] ack [13:34:32] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435629 (https://phabricator.wikimedia.org/T192668) (owner: 10Urbanecm) [13:35:07] (03PS2) 10Marostegui: Revert "mariadb: Depool all row C databases (except s6 master)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435987 [13:35:10] (03CR) 10Legoktm: [C: 04-1] "This won't work, production servers can't talk to *.wmflabs.org." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434901 (https://phabricator.wikimedia.org/T185087) (owner: 10Gergő Tisza) [13:36:01] (03Merged) 10jenkins-bot: Revert "Revert "Revert "Temp rate limit for arwiki due to mass vandalism""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435629 (https://phabricator.wikimedia.org/T192668) (owner: 10Urbanecm) [13:36:17] (03PS2) 10Zfilipin: Create 2 extra namespaces for bdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435693 (https://phabricator.wikimedia.org/T195700) (owner: 10Urbanecm) [13:36:56] Urbanecm: anything to test with 435629? [13:36:57] 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4238969 (10chasemp) ping :) I know there is much shuffling happening, it would be useful if this could happen sometime this week [13:37:03] (it's at mwdebug) [13:37:33] zeljkof, push it to production directly please [13:37:44] Urbanecm: deploying [13:38:01] will merge the next two commits in one go [13:38:05] ack both msgs [13:38:39] waiting for the first deploy, to make sure everything is ok [13:38:51] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4238981 (10ayounsi) [13:39:28] (03CR) 10jenkins-bot: Revert "Revert "Revert "Temp rate limit for arwiki due to mass vandalism""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435629 (https://phabricator.wikimedia.org/T192668) (owner: 10Urbanecm) [13:39:31] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:435629|Revert "Revert "Revert "Temp rate limit for arwiki due to mass vandalism""" (T192668)]] (duration: 01m 51s) [13:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:37] ack zeljkof [13:39:37] T192668: Mass vandalism in ar.wikipedia (throttle edits using wgRateLimits) - https://phabricator.wikimedia.org/T192668 [13:39:42] Urbanecm: 435629 deployed [13:39:45] ack [13:39:57] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435693 (https://phabricator.wikimedia.org/T195700) (owner: 10Urbanecm) [13:40:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] Initial working version (031 comment) [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/432597 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [13:41:00] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4239000 (10ayounsi) [13:41:08] (03Merged) 10jenkins-bot: Create 2 extra namespaces for bdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435693 (https://phabricator.wikimedia.org/T195700) (owner: 10Urbanecm) [13:41:41] (03PS2) 10Zfilipin: Add 2 namespace aliases to bdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435694 (https://phabricator.wikimedia.org/T195700) (owner: 10Urbanecm) [13:41:52] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435694 (https://phabricator.wikimedia.org/T195700) (owner: 10Urbanecm) [13:41:54] (03PS1) 10Ottomata: Remove os conditionals in statistics::packages [puppet] - 10https://gerrit.wikimedia.org/r/436008 [13:42:00] !log upgrading remaining job runners in eqiad to hhvm-wikidiff 1.7.0 [13:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:51] zeljkof: give me a ping once your all wrapped up please :) [13:43:09] addshore: sure, in a few minutes [13:43:34] (03Merged) 10jenkins-bot: Add 2 namespace aliases to bdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435694 (https://phabricator.wikimedia.org/T195700) (owner: 10Urbanecm) [13:44:35] (03PS2) 10Elukey: Swap zookeeper from conf1003 to conf1006 [puppet] - 10https://gerrit.wikimedia.org/r/435963 (https://phabricator.wikimedia.org/T182924) [13:44:43] Urbanecm: 435693 and 435694 are at mwdebug [13:45:26] (03CR) 10jenkins-bot: Create 2 extra namespaces for bdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435693 (https://phabricator.wikimedia.org/T195700) (owner: 10Urbanecm) [13:45:30] (03CR) 10jenkins-bot: Add 2 namespace aliases to bdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435694 (https://phabricator.wikimedia.org/T195700) (owner: 10Urbanecm) [13:45:46] zeljkof, please push them to the whole universe [13:46:06] Urbanecm: deploying [13:46:10] ack [13:46:18] eqiad row C maintenance is completed for today [13:46:29] (03PS1) 10Cmjohnson: remove dns osm-cp10* osm-web10* [dns] - 10https://gerrit.wikimedia.org/r/436009 (https://phabricator.wikimedia.org/T182033) [13:47:38] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:435693|Create 2 extra namespaces for bdwikimedia (T195700)]] (duration: 01m 39s) [13:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:43] T195700: Create new namespace for bd.wikimedia.org - https://phabricator.wikimedia.org/T195700 [13:47:51] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/11305/stat1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/436008 (owner: 10Ottomata) [13:47:54] (03PS2) 10Ottomata: Remove os conditionals in statistics::packages [puppet] - 10https://gerrit.wikimedia.org/r/436008 [13:47:58] (03CR) 10Ottomata: [V: 032 C: 032] Remove os conditionals in statistics::packages [puppet] - 10https://gerrit.wikimedia.org/r/436008 (owner: 10Ottomata) [13:47:59] Urbanecm: deployed, please test and thanks for deploying with #releng [13:48:07] addshore: all done [13:48:08] zeljkof, please post mwscript --wiki=bdwikimedia namespaceDupes.php to the task [13:48:14] cool! [13:48:28] Urbanecm: will do [13:48:45] Thanks. If there will be anything to fix, please run it with --fix as well [13:50:18] (03CR) 10Zfilipin: [C: 032] "T195700#4239039" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435693 (https://phabricator.wikimedia.org/T195700) (owner: 10Urbanecm) [13:50:34] Urbanecm: nothing to fix https://phabricator.wikimedia.org/T195700#4239039 [13:51:11] (03CR) 10Rush: [C: 031] "I haven't found any active use of atd for Cloud anywhere, afaict this is good to drop :)" [puppet] - 10https://gerrit.wikimedia.org/r/435171 (owner: 10Muehlenhoff) [13:51:16] zeljkof, ack, thanks! [13:52:08] (03PS3) 10Elukey: Swap zookeeper from conf1003 to conf1006 [puppet] - 10https://gerrit.wikimedia.org/r/435963 (https://phabricator.wikimedia.org/T182924) [13:52:37] jouncebot next [13:52:37] In 0 hour(s) and 7 minute(s): Wikibase - Re enable wb_terms things (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180529T1400) [13:52:43] * addshore is going to start now [13:52:55] got 2 backports before I get anywhere interesting [13:53:44] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4239047 (10ayounsi) First and main round of server move done. Went well overall, thanks to everybody who chipped in. Some notes: * Faulty SFP-T for ganeti1004... [13:55:13] PROBLEM - Host snapshot1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:55:34] bah, there is another commit on .4 of mediawiki that isn't deployed, but i bet the submodule has it << legoktm [13:55:46] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4239050 (10jcrespo) > 8min read only time, included db maintenance Yes, it was only that long because we had programmed a restart to reuse the read only windo... [13:56:22] AndyRussG: I think it is yours? [13:56:28] AndyRussG: https://gerrit.wikimedia.org/r/#/c/435817/ [13:56:44] !log rolling back ns0 and ping1001 redirects - T187962 [13:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:49] T187962: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 [13:56:50] addshore: legoktm: hi! [13:57:16] tin:/srv/mediawiki-staging/php-1.32.0-wmf.4 I see https://gerrit.wikimedia.org/r/#/c/435817/ still, did you deploy this yesterday? [13:58:04] RECOVERY - Host snapshot1001 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [13:58:13] twentyafterfour was going to deploy that CN thing yesterday when he updated stuff yesterday, but I think he was blocked by some CI issue [13:58:16] https://www.irccloud.com/pastebin/g3rAFLPc/ [13:59:08] (03PS1) 10DCausse: Remove extra-analysis [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/436011 [13:59:16] AndyRussG: hmm [13:59:19] *looks at SAL* [13:59:29] addshore It would be great to get it deployed [13:59:46] 19:22 twentyafterfour@tin: Synchronized php-1.32.0-wmf.5/extensions/CentralNotice/: sync wmf.5 CentralNotice for AndyRussG (duration: 01m 25s) [13:59:49] looks like it was done on .5 but not on .4 [13:59:55] addshore: hmm ok [14:00:04] addshore: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikibase - Re enable wb_terms things deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180529T1400). [14:00:30] AndyRussG: want it on .4 too? (currently group2 is still on wmf.4) [14:00:35] 10Operations, 10Math: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847#4239084 (10Physikerwelt) [14:00:52] i guess for CN the same branch is used for all wmf branches.... [14:01:02] addshore: yes that'd be great... Though if group2 is about to go to .5 it's not a huge issue [14:01:20] There are times when it's best for all CN versions to coordinate, but it isn't essential for this patch [14:01:34] AndyRussG: I'll sync it then. Its not an issue for users perhaps, but it is annoying for deployers worrying about deploying patches they were not expecting! [14:01:44] (03CR) 10Gehel: [V: 032 C: 032] "LGTM, checksums verified" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/436011 (owner: 10DCausse) [14:01:51] addshore: yeah...I don't know what happened yesterday [14:02:05] so, dunno why things ended up in this state [14:02:13] (03PS3) 10Jcrespo: Revert "mariadb: Depool all row C databases (except s6 master)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435987 (owner: 10Marostegui) [14:02:55] addshore shall I add this to the SWAT on the Deployments page just to record it happened? [14:03:05] (I mean, the patch) [14:03:12] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11306/conf1006.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/435963 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [14:03:24] It will be in the SAL, that will probably be enough [14:03:28] okok [14:03:34] !log swap zookeeper from conf1003 to conf1006 [14:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:04] AndyRussG: i commented on the gerrit ticket too [14:04:09] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4239130 (10MoritzMuehlenhoff) @Lea_WMDE, @WMDE-Fisch : wikidiff 1.7.0 has been rolled out to all our application servers in production (activ... [14:04:18] addshore: thx!!!! [14:05:21] (03CR) 10Jcrespo: [C: 031] Revert "mariadb: Depool all row C databases (except s6 master)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435987 (owner: 10Marostegui) [14:06:01] AndyRussG: syncing now [14:07:23] !log addshore@tin Synchronized php-1.32.0-wmf.4/extensions/CentralNotice: [[gerrit:435817|Convert numerical URL parameters to numbers]] for AndyRussG (was left on tin) (duration: 01m 25s) [14:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:09] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4239178 (10ayounsi) [14:10:55] !log addshore@tin Synchronized php-1.32.0-wmf.5/extensions/Wikibase: [[gerrit:436000|track all wb_terms table access via statsd]] (duration: 02m 21s) [14:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:43] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4239183 (10Lea_WMDE) @MoritzMuehlenhoff great, thank you! [14:12:26] (03PS1) 10Ottomata: Create profile::analytics::cluster::packages class [puppet] - 10https://gerrit.wikimedia.org/r/436012 [14:12:48] addshore: K just waiting for ResourceLoader to turn over to see the new code live [14:12:52] (it isn't there yet) [14:13:14] (03CR) 10jerkins-bot: [V: 04-1] Create profile::analytics::cluster::packages class [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [14:13:14] !log comleted rolling restart of relforge for plugin upgrade - T193734 [14:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:19] T193734: Move Serbian language wikis from extra-analysis to extra-analysis-serbian plugin - https://phabricator.wikimedia.org/T193734 [14:13:31] !log addshore@tin Synchronized php-1.32.0-wmf.4/extensions/Wikibase: [[gerrit:436001|track all wb_terms table access via statsd]] (duration: 02m 19s) [14:13:32] Dunno if it's relevant that https://en.wikipedia.org/wiki/Special:Version still shows the previous CN sha [14:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:59] (03PS2) 10Ottomata: Create profile::analytics::cluster::packages class [puppet] - 10https://gerrit.wikimedia.org/r/436012 [14:14:41] (03CR) 10jerkins-bot: [V: 04-1] Create profile::analytics::cluster::packages class [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [14:15:28] (03PS3) 10Ottomata: Create profile::analytics::cluster::packages class [puppet] - 10https://gerrit.wikimedia.org/r/436012 [14:16:47] addshore: Do you mind if I merge and deploy a db-eqiad change? we are in a bit of a hurry [14:16:52] marostegui: go for it [14:16:56] :) [14:16:56] thanks! [14:17:05] (03CR) 10Marostegui: [C: 032] Revert "mariadb: Depool all row C databases (except s6 master)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435987 (owner: 10Marostegui) [14:18:12] (03Merged) 10jenkins-bot: Revert "mariadb: Depool all row C databases (except s6 master)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435987 (owner: 10Marostegui) [14:19:37] (03CR) 10jenkins-bot: Revert "mariadb: Depool all row C databases (except s6 master)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435987 (owner: 10Marostegui) [14:19:41] (03PS4) 10Ottomata: Create profile::analytics::cluster::packages class [puppet] - 10https://gerrit.wikimedia.org/r/436012 [14:19:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool all databases in row C - T187962 (duration: 01m 19s) [14:19:48] addshore: All done, thank you! [14:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:51] T187962: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 [14:20:28] (03CR) 10jerkins-bot: [V: 04-1] Create profile::analytics::cluster::packages class [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [14:21:35] marostegui: no problem [14:22:24] addshore: the new code from that commit is indeed there and working fine on enwiki, so all good there... Is there some lag or issue on Special:Version? [14:22:40] hmm, issue with special:verison? [14:22:41] thxs so much again, btw! [14:23:00] (03PS1) 10Muehlenhoff: Create new admin group with root access on maps test cluster [puppet] - 10https://gerrit.wikimedia.org/r/436013 (https://phabricator.wikimedia.org/T195797) [14:23:02] (03PS1) 10Muehlenhoff: Add Stas to maps-test-roots group [puppet] - 10https://gerrit.wikimedia.org/r/436014 (https://phabricator.wikimedia.org/T195797) [14:23:30] (03CR) 10jerkins-bot: [V: 04-1] Create new admin group with root access on maps test cluster [puppet] - 10https://gerrit.wikimedia.org/r/436013 (https://phabricator.wikimedia.org/T195797) (owner: 10Muehlenhoff) [14:23:39] (03CR) 10jerkins-bot: [V: 04-1] Add Stas to maps-test-roots group [puppet] - 10https://gerrit.wikimedia.org/r/436014 (https://phabricator.wikimedia.org/T195797) (owner: 10Muehlenhoff) [14:23:45] (03PS1) 10WMDE-Fisch: Enable detection of changes in moved paragraphs on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436017 (https://phabricator.wikimedia.org/T195375) [14:24:15] (03PS5) 10Ottomata: Create profile::analytics::cluster::packages class [puppet] - 10https://gerrit.wikimedia.org/r/436012 [14:24:31] !log roll restart kafka on kafka100[1-3] (job queues) to pick up the new zookeeper settings [14:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:45] addshore: yeah, the commit it points to for CentralNotice isn't correct [14:24:56] (03CR) 10jerkins-bot: [V: 04-1] Create profile::analytics::cluster::packages class [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [14:25:08] AndyRussG: on which site? [14:25:24] https://en.wikipedia.org/wiki/Special:Version [14:25:55] And that's also where I tested the update u just sync'ed, and the new code is indeed there [14:26:05] AndyRussG: not sure, i do know there is some caching there though [14:26:09] so it could be that [14:26:51] (03PS2) 10Muehlenhoff: Create new admin group with root access on maps test cluster [puppet] - 10https://gerrit.wikimedia.org/r/436013 (https://phabricator.wikimedia.org/T195797) [14:27:12] addshore: hmmm okok... I'll check back again a bit later and see if it's updated, and if not, file a task or ping folks or something :) [14:27:51] (03PS2) 10Muehlenhoff: Add Stas to maps-test-roots group [puppet] - 10https://gerrit.wikimedia.org/r/436014 (https://phabricator.wikimedia.org/T195797) [14:29:40] !log addshore@tin Synchronized php-1.32.0-wmf.5/extensions/Wikibase: [[gerrit:436003|Re add TermSqlIndex::getMatchingTerms select, but dont call]] (duration: 02m 13s) [14:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:17] 10Operations, 10ops-eqiad: mw1280: CPU error - https://phabricator.wikimedia.org/T195734#4239254 (10Cmjohnson) pasting the racadm log before I clear it Record: 78 Date/Time: 05/27/2018 04:42:25 Source: system Severity: Critical Description: CPU 1 machine check error detected. ------------------... [14:31:48] (03CR) 10Volans: "Thanks Alex for the time to review this and the extensive chat we had" (031 comment) [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/432597 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [14:32:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Decommission osm-web100[1-4] - https://phabricator.wikimedia.org/T182033#4239258 (10Cmjohnson) [14:32:19] !log addshore@tin Synchronized php-1.32.0-wmf.4/extensions/Wikibase: [[gerrit:436004|Re add TermSqlIndex::getMatchingTerms select, but dont call]] (duration: 02m 18s) [14:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:50] 10Operations, 10ops-eqiad, 10DC-Ops: Decommission osm-cp100[1-4] - https://phabricator.wikimedia.org/T182034#4239261 (10Cmjohnson) [14:33:37] marostegui: FYI i'll be making that method return stuff again in the next 10 min [14:33:38] *mins [14:33:57] addshore: I don't need any more deployemts :) [14:34:07] also, of vauge interest, here is the extra monitoring we added inside wikibase for all methods querying wb_terms https://grafana-admin.wikimedia.org/dashboard/db/wikibase-wb_terms [14:34:20] ah nice [14:34:38] the problem method isn't currently listed there as it isn't turned on :) but it will be [14:34:48] addshore: typo: fetchTemrs [14:34:49] ;) [14:35:03] volans: balls [14:35:13] oh well... it is in graphite now, that typo will remain for now ;) [14:35:36] if fixed we can delete the files with the wrong name [14:35:40] are you monitoring every single query? [14:35:48] volans: indeed, and merge them [14:35:57] using wb_terms, I mean [14:36:13] jynus: yes, essentially [14:36:43] wouldn't that have a bad impact on performance (a write generated by a read?) [14:37:02] all reads to mw generate data in graphite [14:37:03] I don't know what is the rate [14:37:16] /all/most/ [14:37:24] 250K per second? [14:37:45] jynus: yes, it's for the main purpose of that table [14:38:02] if we fix that part, we can practically drop the table [14:38:11] I am not complaining [14:38:40] just that you may want to sample or evaluate the impact of that [14:38:51] I'm just saying these are two different things: 1- using wb_terms as search backend 2- getting labels of Q-ids [14:39:18] the former should not happen at all, the second is expected and needs to be fixed in soonish(tm) future [14:39:27] RECOVERY - Host labvirt1019 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [14:39:56] 250k per minute in that count doesn't mean it's actually send 250k packets to graphite [14:39:59] *sent [14:40:04] ok [14:40:17] if it is accumulated on memory it is ok [14:40:31] yup, and packged up with other metrics being sent [14:40:36] ok [14:40:40] and via statsd too? [14:41:02] volans: yup [14:41:29] (03CR) 10Gehel: [C: 04-1] "See comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/436014 (https://phabricator.wikimedia.org/T195797) (owner: 10Muehlenhoff) [14:41:36] yeah should be fine, I was looking at the graphite dashboard and don't see anything wrong so far [14:41:56] volans: im not sure there is actually any monitoring of metric throughput for statsd? [14:42:00] (03PS1) 10Cmjohnson: Found old ipv6 dns for cp1069/1070 [dns] - 10https://gerrit.wikimedia.org/r/436021 (https://phabricator.wikimedia.org/T130884) [14:42:52] addshore: not sure tbh [14:44:42] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4239349 (10Cmjohnson) labvirt1019 is now connected to the new switch and the second ethernet port is connected. @ayounsi can you help gett... [14:45:14] (03CR) 10Volans: [V: 032 C: 032] "Merging as agreed, CR with the generated artifacts will follow." [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/432597 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [14:45:56] Amir1: going with https://gerrit.wikimedia.org/r/#/c/436006/ on .4 first, but dont think anything is even calling the method on .4 [14:46:26] maybe .5 is better [14:46:38] I may as well sync .4 first :) [14:46:53] syncing .4 [14:47:19] (03PS2) 10Dzahn: phabricator: List parent projects for archived projects with open tasks [puppet] - 10https://gerrit.wikimedia.org/r/435776 (owner: 10Aklapper) [14:48:35] (03CR) 10Dzahn: "tested in prod. result is:" [puppet] - 10https://gerrit.wikimedia.org/r/435776 (owner: 10Aklapper) [14:49:09] !log addshore@tin Synchronized php-1.32.0-wmf.4/extensions/Wikibase: [[gerrit:436006|TermSqlIndex::getMatchingTerms actually execute select]] (duration: 02m 18s) [14:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:18] RECOVERY - configured eth on labvirt1020 is OK: OK - interfaces up [14:49:33] Amir1: .4 done [14:49:41] * addshore prepares the .5 patch [14:49:58] tendril is clean [14:50:02] (03PS1) 10Volans: Release v0.1.0 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/436024 (https://phabricator.wikimedia.org/T191299) [14:50:04] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4239389 (10Cmjohnson) @ayounsi that goes for labvirt1020 as well, I connected the 2nd port to asw2-b ge-7/0/14 [14:50:19] Amir1: yeh, after looking at the adhoc log I think that last sync was essentially a noop [14:50:24] nothing calls it :) [14:51:03] property Id resolver calls it and it has 100 calls per hour [14:51:08] that worries me :/ [14:51:25] hmmmm [14:51:35] maybe noone uses property id resolver? :P [14:51:46] they shouldn't [14:51:52] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4239398 (10ayounsi) ge-4/0/33 labvirt1019:eth1 moved to the cloud-instance-ports interface-range. [14:51:53] but they do [14:51:56] (03PS2) 10Cmjohnson: remove dns osm-cp10* osm-web10* [dns] - 10https://gerrit.wikimedia.org/r/436009 (https://phabricator.wikimedia.org/T182033) [14:52:17] (03PS10) 10Volans: debmonitor: add server side puppettization [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) [14:52:20] (03CR) 10Cmjohnson: [C: 032] remove dns osm-cp10* osm-web10* [dns] - 10https://gerrit.wikimedia.org/r/436009 (https://phabricator.wikimedia.org/T182033) (owner: 10Cmjohnson) [14:52:40] Amir1: I'll wait 5 mins before doing .5 [14:53:29] the adhoc log is super quiet now we turned all the callers off one by one :P [14:54:28] (03PS2) 10Cmjohnson: Found old ipv6 dns for cp1069/1070 [dns] - 10https://gerrit.wikimedia.org/r/436021 (https://phabricator.wikimedia.org/T130884) [14:54:56] tendril is still clean [14:54:59] Amir1: right, I'll go for the .5 sync [14:55:13] since the index is there, it should not make much problem [14:55:17] (03CR) 10Cmjohnson: [C: 032] Found old ipv6 dns for cp1069/1070 [dns] - 10https://gerrit.wikimedia.org/r/436021 (https://phabricator.wikimedia.org/T130884) (owner: 10Cmjohnson) [14:55:20] indeed [14:55:21] that's the biggest reason [14:55:34] syncing [14:56:00] Amir1: what tendril link are you watching? [14:56:16] https://tendril.wikimedia.org/report/slow_queries?host=%5Edb&user=wikiuser&schema=wikidatawiki&qmode=eq&query=wb_terms&hours=2 [14:57:15] 10Operations, 10ops-codfw, 10DBA: db2059 disk on predictive failure - https://phabricator.wikimedia.org/T195626#4239421 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete [14:57:23] Amir1: added a link to it from the grafana dashboard [14:57:29] !log Move s6 topology back to its normal status [14:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:44] !log addshore@tin Synchronized php-1.32.0-wmf.5/extensions/Wikibase: [[gerrit:436007|TermSqlIndex::getMatchingTerms actually execute select]] (duration: 02m 19s) [14:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:48] there we go [14:59:19] 10Operations, 10ops-codfw, 10DBA: db2059 disk on predictive failure - https://phabricator.wikimedia.org/T195626#4239430 (10Marostegui) Thanks! ``` physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, Rebuilding) ``` [14:59:22] marostegui: all look good to you? [14:59:40] PROBLEM - Host analytics1030 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:51] it wasn't us ^ [15:00:10] (I know you know it but just to be clear) [15:00:47] hahaah [15:00:50] thanks Amir1 :D [15:01:04] did you see my message about ores::base ? [15:01:09] you can now send again the patch [15:01:10] I'll merge it [15:01:53] thanks! [15:02:03] Amir1: I saw one of the queries show up in the slow log on tendril :P [15:02:42] :/ [15:02:50] (03CR) 10Dzahn: [C: 032] phabricator: List parent projects for archived projects with open tasks [puppet] - 10https://gerrit.wikimedia.org/r/435776 (owner: 10Aklapper) [15:03:10] 10Operations, 10Math: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847#4239437 (10MoritzMuehlenhoff) p:05Triage>03Normal Should we split this into three tickets since the actionables (and people acting on those) are fairly disjunct? (So one task to remove it fr... [15:03:14] Given that it's 1% of what it used to happen we have some time [15:03:27] but we need to disable or fix that thing ASAP [15:03:31] which way to go? [15:03:33] and an increase in innodb I/O on one of the slaves at the time of sync, but I guess that is probably normal for what we did [15:03:36] https://usercontent.irccloud-cdn.com/file/o29zb2qb/image.png [15:04:01] 10Operations, 10ops-codfw, 10DBA: db2059 disk on predictive failure - https://phabricator.wikimedia.org/T195626#4239454 (10Marostegui) a:05Marostegui>03Papaul Disk failed ``` physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, Failed) ``` Can we get another one? [15:04:22] and that dropped off now, I guess that is simply it realizing it needs to use these indexes again [15:04:27] actually fixing it is almost impossible as all of the elastic things are part of repo and not client... [15:05:37] so, the 3 things that remain disabled are currently PropertySuggestor, the ArticlePlaceholder search links and ItemDisambiguation? [15:06:22] yup and the property id resolver is enabled [15:06:59] (03PS2) 10Dzahn: phabricator: List new and recent assignees [puppet] - 10https://gerrit.wikimedia.org/r/435984 (https://phabricator.wikimedia.org/T195780) (owner: 10Aklapper) [15:07:13] ACKNOWLEDGEMENT - HP RAID on labvirt1019 is CRITICAL: CRITICAL: Slot 0: no logical drives --- Slot 0: no drives --- Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T195862 [15:07:18] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T195862#4239492 (10ops-monitoring-bot) [15:07:38] Amir1: looking at the grafana graph getMatchingTerms is one of the things that hits that table the least, but it is just the worst :P [15:07:50] exactly [15:08:30] 10Operations, 10Math: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847#4239495 (10Physikerwelt) @MoritzMuehlenhoff yes. I just started with this list as overview while brainstorming. [15:09:13] addshore: now, do you want to backport the change on property suggester and enable it? [15:09:26] Amir1: see #wikimedia-de-tech :) [15:09:46] (03PS3) 10Muehlenhoff: Create new admin group with root access on WDQS test cluster [puppet] - 10https://gerrit.wikimedia.org/r/436013 (https://phabricator.wikimedia.org/T195797) [15:10:38] 10Operations, 10ops-codfw, 10DBA: db2059 disk on predictive failure - https://phabricator.wikimedia.org/T195626#4239500 (10Papaul) a:05Papaul>03Marostegui Another disk in place [15:10:41] kk [15:11:17] jouncebot: refresh [15:11:18] I refreshed my knowledge about deployments. [15:11:19] jouncebot: now [15:11:19] For the next 0 hour(s) and 18 minute(s): Wikibase - Re enable wb_terms things (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180529T1400) [15:11:28] !log Wikibase - Re enable wb_terms things window done [15:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:56] (03PS3) 10Muehlenhoff: Add Stas to wdqs-test-roots group [puppet] - 10https://gerrit.wikimedia.org/r/436014 (https://phabricator.wikimedia.org/T195797) [15:12:13] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186#4239504 (10Cmjohnson) @chasemp please let me know network requirements. [15:15:05] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Enable detection of changes in moved paragraphs on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436017 (https://phabricator.wikimedia.org/T195375) (owner: 10WMDE-Fisch) [15:15:11] RECOVERY - Host analytics1030 is UP: PING WARNING - Packet loss = 50%, RTA = 66.55 ms [15:15:12] (03CR) 10Dzahn: "tried this query in prod but it would not finish with a couple minutes, so i killed it again and didn't merge this" [puppet] - 10https://gerrit.wikimedia.org/r/435984 (https://phabricator.wikimedia.org/T195780) (owner: 10Aklapper) [15:18:31] PROBLEM - Host mw2182.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:20:11] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:20:42] (03PS1) 10Ladsgroup: ores: Install hunspell-bs on ores nodes [puppet] - 10https://gerrit.wikimedia.org/r/436033 (https://phabricator.wikimedia.org/T194876) [15:21:08] 10Operations, 10Wikidata, 10Wikimedia-General-or-Unknown, 10MW-1.32-release-notes (WMF-deploy-2018-05-15 (1.32.0-wmf.4)), and 4 others: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4239549 (10Addshore) The TermSqlIndex::get... [15:24:53] (03PS18) 10Paladox: Planet: Redesgn UI for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/435327 (https://phabricator.wikimedia.org/T180498) [15:25:11] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:25:25] 10Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation: Rack and setup snapshot1008 - https://phabricator.wikimedia.org/T195385#4239567 (10RobH) [15:25:48] (03PS19) 10Paladox: Planet: Redesgn UI for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/435327 (https://phabricator.wikimedia.org/T180498) [15:26:08] jouncebot: refresh [15:26:08] I refreshed my knowledge about deployments. [15:26:09] jouncebot: now [15:26:10] For the next 0 hour(s) and 33 minute(s): Wikibase - Re enable wb_terms things (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180529T1400) [15:26:19] (03CR) 10jerkins-bot: [V: 04-1] Planet: Redesgn UI for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/435327 (https://phabricator.wikimedia.org/T180498) (owner: 10Paladox) [15:26:52] Amir1: mind watching things with me again? :) [15:26:59] !log restart hadoop yarn/hdfs daemons to pick up the new zookeeper settings [15:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:04] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186#4239586 (10bd808) >>! In T194186#4191439, @RobH wrote: > Then we would want to know if they will be the labs-support-vlan is the one we'll b... [15:28:19] yup [15:28:50] elukey: https://gerrit.wikimedia.org/r/#/c/436033/ Thank you! [15:29:01] addshore: sure thing [15:29:09] RECOVERY - Host mw2182.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.78 ms [15:29:41] (03CR) 10Elukey: [C: 032] ores: Install hunspell-bs on ores nodes [puppet] - 10https://gerrit.wikimedia.org/r/436033 (https://phabricator.wikimedia.org/T194876) (owner: 10Ladsgroup) [15:30:08] right, 2 more patches incoming then, *waits for jenkins* [15:30:25] 10Operations, 10ops-codfw: mw2182 crash - https://phabricator.wikimedia.org/T194835#4239609 (10Papaul) a:05Papaul>03MoritzMuehlenhoff @MoritzMuehlenhoff - update IDRAC and BIOS - clean log - Swap CPU1 with CPU0 Lets see what happen. [15:30:47] (03PS2) 10Addshore: Revert "Don't load PropertySuggester" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435147 [15:31:01] (03PS3) 10Addshore: Revert "Don't load PropertySuggester" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435147 [15:31:10] (03CR) 10Alexandros Kosiaris: [C: 031] Release v0.1.0 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/436024 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [15:33:27] (03CR) 10Aklapper: [C: 04-1] "Urgh. Thanks for trying! I should have added a comment that I don't know about the performance but then I got distracted. :-/" [puppet] - 10https://gerrit.wikimedia.org/r/435984 (https://phabricator.wikimedia.org/T195780) (owner: 10Aklapper) [15:35:17] * addshore continues waiting for ci [15:35:55] (03PS20) 10Paladox: Planet: Redesgn UI for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/435327 (https://phabricator.wikimedia.org/T180498) [15:36:26] (03CR) 10jerkins-bot: [V: 04-1] Planet: Redesgn UI for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/435327 (https://phabricator.wikimedia.org/T180498) (owner: 10Paladox) [15:38:00] 10Operations, 10ops-codfw, 10DBA: db2059 disk on predictive failure - https://phabricator.wikimedia.org/T195626#4239681 (10Marostegui) Let's see how this one goes: ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 25% complete) ``` [15:39:59] 10Operations, 10Math: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847#4239721 (10Physikerwelt) [15:40:29] RECOVERY - Device not healthy -SMART- on db2059 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2059&var-datasource=codfw%2520prometheus%252Fops [15:43:48] Amir1: syncing [15:43:56] (just the property suggestor patch) [15:44:01] nice [15:44:01] not the mw-config one :) [15:44:08] It should not affect anything [15:44:18] indeed [15:44:19] as it's disabled (ta-duh) [15:45:05] !log addshore@tin Synchronized php-1.32.0-wmf.5/extensions/PropertySuggester: [[gerrit:436038|Use CirrusSearch for PropertySuggester]] (duration: 01m 21s) [15:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:25] (03CR) 10Addshore: [C: 032] Revert "Don't load PropertySuggester" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435147 (owner: 10Addshore) [15:45:32] last one! [15:46:43] (03Merged) 10jenkins-bot: Revert "Don't load PropertySuggester" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435147 (owner: 10Addshore) [15:47:30] (03CR) 10Herron: [C: 032] "Ideally we would let puppetdb http bind both 127.0.0.1 and ::1. However, this is more difficult than it should be as afaict the jetty.ini " [puppet] - 10https://gerrit.wikimedia.org/r/435670 (owner: 10Alex Monk) [15:47:40] (03PS2) 10Herron: puppet DB nginx: Talk to upstream only over IPv4 localhost [puppet] - 10https://gerrit.wikimedia.org/r/435670 (owner: 10Alex Monk) [15:47:59] on mwdebug1002 Amir1 [15:48:44] !log roll restart kafka on kafka-jumbo* to pick up new zookeeper settings [15:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:59] PROBLEM - Hadoop NodeManager on analytics1031 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:49:04] and Amir1 looks like it is working to me [15:49:14] (03CR) 10Jkroll: [C: 031] Enable detection of changes in moved paragraphs on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436017 (https://phabricator.wikimedia.org/T195375) (owner: 10WMDE-Fisch) [15:49:23] addshore: works for me too [15:49:29] * addshore will sync [15:49:42] (03CR) 10jenkins-bot: Revert "Don't load PropertySuggester" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435147 (owner: 10Addshore) [15:50:48] syncing Amir1 [15:50:49] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 624.28 seconds [15:52:02] !log addshore@tin Synchronized wmf-config/Wikibase.php: [[gerrit:435147|Revert - Dont load PropertySuggester]] T195520 (duration: 01m 19s) [15:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:07] Amir1: done [15:52:07] T195520: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520 [15:52:14] I'm not being sync'd. the patch is [15:52:21] :P [15:52:47] tendril is clean [15:54:29] PROBLEM - puppet last run on bast4002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:56:10] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186#4239822 (10chasemp) >>! In T194186#4239586, @bd808 wrote: >>>! In T194186#4191439, @RobH wrote: >> Then we would want to know if they will b... [15:56:56] looking at bast4002 [15:57:04] !log really done with wb_terms related syncs now [15:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:30] RECOVERY - puppet last run on bast4002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:04] godog, moritzm, and _joe_: #bothumor My software never has bugs. It just develops random features. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180529T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:58] bast4002 hit puppetdb right when it was reloading. re-ran puppet agent without issues [16:04:42] (03PS21) 10Paladox: Planet: Redesgn UI for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/435327 (https://phabricator.wikimedia.org/T180498) [16:12:39] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 55, down: 1, dormant: 0, excluded: 0, unused: 0 [16:13:18] (03PS6) 10Matěj Suchánek: Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430045 [16:13:56] what [16:14:52] one of our many transit links is down, no big deal [16:16:15] (03CR) 10Brian Wolff: "> @Brian: I don't get the question and how it is related, sorry. Could you elaborate?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404653 (https://phabricator.wikimedia.org/T185087) (owner: 10Aklapper) [16:20:19] (03CR) 10Jforrester: [C: 031] Enable RemexHtml on a bunch of additional wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435829 (https://phabricator.wikimedia.org/T195263) (owner: 10Subramanya Sastry) [16:22:02] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T195862#4240002 (10Volans) [16:22:09] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T194907#4240006 (10Volans) [16:28:47] (03CR) 10Volans: [V: 032 C: 032] Release v0.1.0 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/436024 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [16:31:49] (03CR) 10Dzahn: "thank you! per IRC talk we just had, tabs removed, let's try to let puppet apply the patch to rawdog.py from Debian package in a separate " [puppet] - 10https://gerrit.wikimedia.org/r/435327 (https://phabricator.wikimedia.org/T180498) (owner: 10Paladox) [16:32:19] !log upgrading blubber to 0.4.0 for integration machines [16:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:03] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T194907#4240084 (10Volans) @Cmjohnson is this controller really missing the battery or it's a software problem that is just not recognized? [16:33:26] (03CR) 10Dzahn: [C: 031] Remove at [puppet] - 10https://gerrit.wikimedia.org/r/435171 (owner: 10Muehlenhoff) [16:34:50] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T194907#4240108 (10Cmjohnson) @vlolans, it's possible the battery is wrong. I disconnected it during the card upgrade and I may have left the old battery and not replaced with the new one. [16:37:14] (03PS2) 10Dzahn: phabricator: Make account names link to their Phab profiles [puppet] - 10https://gerrit.wikimedia.org/r/435713 (owner: 10Aklapper) [16:37:29] (03CR) 10Dzahn: [C: 032] "yep, tested." [puppet] - 10https://gerrit.wikimedia.org/r/435713 (owner: 10Aklapper) [16:39:38] (03CR) 10Dzahn: [C: 032] "@Aklapper: for the current result of this see https://phabricator.wikimedia.org/people/tasks/3266/" [puppet] - 10https://gerrit.wikimedia.org/r/435713 (owner: 10Aklapper) [16:40:55] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1034 - https://phabricator.wikimedia.org/T195569#4231083 (10Volans) Actually it seems that this already recovered: `OK: Active: 4, Working: 4, Failed: 0, Spare: 0` ``` ms-be1034 0 ~$ cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6... [16:44:10] !log roll restart of kafka mirror maker on kafka100[1-3] to pick up new zk settings [16:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:50] 10Operations: Broken pinning on some WMCS servers - https://phabricator.wikimedia.org/T195835#4240179 (10aborrero) 05Open>03Resolved Probably stalled files. Cleaned up by "rm & puppet run". [16:49:41] 10Operations: Broken pinning on some WMCS servers - https://phabricator.wikimedia.org/T195835#4240183 (10aborrero) Well, this script: ``` #!/bin/bash SERVERS='labpuppetmaster1001.wikimedia.org labtestpuppetmaster2001.wikimedia.org labstore1004.eqiad.wmnet labstore1005.eqiad.wmnet' FILES='/etc/apt/preferences.... [16:51:57] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4240202 (10Cmjohnson) Support ticket created with HPE Case ID: 5329764075 Case title: Failed Hard Drive Severity 3-Normal Product serial number: MXQ62005Z0 Product number: 767032-B21 Submitted: 5/29/201... [16:52:30] PROBLEM - Hadoop NodeManager on analytics1031 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:52:33] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1034 - https://phabricator.wikimedia.org/T195569#4240203 (10Volans) I now see a SAL entry from @akosiaris: `11:18 akosiaris: powercycling ms-be1034, box is unresposive, tons of logs "sd 0:1:0:1: rejecting I/O to offline device"` So at reboot I guess it somehow... [16:52:53] we are working on 1031 [16:54:19] 10Operations, 10Design-Research: Edit optoutresearch@ mailing list recipients - https://phabricator.wikimedia.org/T100860#4240212 (10Aklapper) a:05bbogaert>03aripstra Assigning to @aripstra for them being as per T100860#4195541, as @bbogaert has left. (Feel free to re-/un-assign.) [16:54:40] RECOVERY - Hadoop NodeManager on analytics1031 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:56:50] !log bounced analytics1031 switchport to fix weird issue of that host not being able to receive traffic from analytics1001 [16:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:11] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1034 - https://phabricator.wikimedia.org/T195569#4240226 (10Cmjohnson) A ticket has been opened with HPE Case ID: 5329764199 Case title: Failed Disk Severity 3-Normal Product serial number: MXQ70601RQ Product number: 719061-B21 Submitted: 5/29/2018 12:56:35 P... [16:59:57] !log roll restart of kafka mirror maker on kafka-jumbo100* to pick up the new zookeeper settings [17:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180529T1700). [17:04:16] (03PS11) 10Volans: debmonitor: add server side puppettization [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) [17:05:47] !log bsitzmann@tin Started deploy [mobileapps/deploy@ac4c6be]: Update mobileapps to b2fb793 (T192664) [17:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:51] T192664: Announce browser extension support for Reading Lists in the apps - https://phabricator.wikimedia.org/T192664 [17:06:44] (03CR) 10Volans: [C: 032] debmonitor: add server side puppettization [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [17:07:57] (03PS1) 10Elukey: Add notes about conf100[1-3] -> conf100[4-6] migration status [puppet] - 10https://gerrit.wikimedia.org/r/436064 (https://phabricator.wikimedia.org/T182924) [17:10:17] 10Operations, 10ops-codfw, 10DBA: db2059 disk on predictive failure - https://phabricator.wikimedia.org/T195626#4240271 (10Marostegui) 05Open>03Resolved All good - thanks! ``` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physicaldriv... [17:10:20] (03CR) 10Elukey: [C: 032] Add notes about conf100[1-3] -> conf100[4-6] migration status [puppet] - 10https://gerrit.wikimedia.org/r/436064 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [17:11:02] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4240277 (10Marostegui) Awesome! Thank you! [17:12:15] !log bsitzmann@tin Finished deploy [mobileapps/deploy@ac4c6be]: Update mobileapps to b2fb793 (T192664) (duration: 06m 28s) [17:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:19] T192664: Announce browser extension support for Reading Lists in the apps - https://phabricator.wikimedia.org/T192664 [17:19:27] 10Operations, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#4240309 (10elukey) a:05elukey>03None [17:20:36] 10Operations, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3284850 (10elukey) Zookeeper has been moved out as part of https://phabricator.wikimedia.org/T182924, so only etcd is remaining. Removing myself from the task since we'd need to figure out the next steps... [17:21:33] (03PS1) 10Muehlenhoff: Extend access for pnorman [puppet] - 10https://gerrit.wikimedia.org/r/436068 [17:23:12] (03CR) 10Muehlenhoff: [C: 032] Extend access for pnorman [puppet] - 10https://gerrit.wikimedia.org/r/436068 (owner: 10Muehlenhoff) [17:23:34] 10Operations, 10Design-Research: Edit optoutresearch@ mailing list recipients - https://phabricator.wikimedia.org/T100860#4240326 (10Dzahn) I removed dkrysiak but the question remains if the alias should be deleted altogether. [17:25:51] !log Removing 2FA - T187312 [17:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:55] T187312: Need to reconnect my 2-factors authentication for my wiki account - https://phabricator.wikimedia.org/T187312 [17:26:47] !log arlolra@tin Started deploy [parsoid/deploy@e87f54d]: Updating Parsoid to bf3a2fd2 [17:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:25] 10Operations, 10ops-codfw: mw2182 crash - https://phabricator.wikimedia.org/T194835#4240349 (10MoritzMuehlenhoff) Thanks, I've repooled the sever. I'm keeping an eye on it throughout the week whether it now holds fine. [17:32:43] !log repooled mw2182 (was down for hardware maintenance) [17:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:10] herron, thanks for https://gerrit.wikimedia.org/r/#/c/435670/ - was the error also occurring in production? [17:35:26] 10Operations, 10Analytics, 10Performance-Team, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gilles - https://phabricator.wikimedia.org/T195837#4240357 (10MoritzMuehlenhoff) p:05Triage>03Normal [17:36:11] 10Operations, 10Analytics, 10Performance-Team, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gilles - https://phabricator.wikimedia.org/T195837#4238806 (10MoritzMuehlenhoff) @gilles: Since this is a non-sudo change, it needs to only pass the three day waiting period. [17:36:27] 10Operations, 10Discovery, 10SRE-Access-Requests, 10Wikidata, and 2 others: Stas needs root access on WDQS test cluster - https://phabricator.wikimedia.org/T195797#4240364 (10MoritzMuehlenhoff) p:05Triage>03Normal [17:36:36] !log arlolra@tin Finished deploy [parsoid/deploy@e87f54d]: Updating Parsoid to bf3a2fd2 (duration: 09m 48s) [17:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:28] 10Operations, 10CirrusSearch, 10Discovery, 10Search-Platform-Programs, 10Discovery-Search (Current work): Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112#4240373 (10Gehel) Last action based on this issue is tracked... [17:37:33] 10Operations, 10CirrusSearch, 10Discovery, 10Search-Platform-Programs, 10Discovery-Search (Current work): Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112#4240376 (10Gehel) 05Open>03Resolved [17:39:04] (03PS1) 10Volans: debmonitor: add deployment server hiera [puppet] - 10https://gerrit.wikimedia.org/r/436070 (https://phabricator.wikimedia.org/T191299) [17:41:10] (03CR) 10Volans: [C: 032] debmonitor: add deployment server hiera [puppet] - 10https://gerrit.wikimedia.org/r/436070 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [17:41:39] Krenair: thanks for the patch! yes, that removed a lot of noise from the nginx error log [17:41:47] cool [17:52:50] (03PS22) 10Dzahn: Planet: Redesign UI for rawdog, add theme files [puppet] - 10https://gerrit.wikimedia.org/r/435327 (https://phabricator.wikimedia.org/T180498) (owner: 10Paladox) [17:59:07] (03PS2) 10Ottomata: Enable Kafka SSL listener for main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/434361 (https://phabricator.wikimedia.org/T193778) [17:59:24] !log beginning rolling restarts of kafka main-codfw to enable SSL listener [17:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:34] (03CR) 10Ottomata: [V: 032 C: 032] Enable Kafka SSL listener for main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/434361 (https://phabricator.wikimedia.org/T193778) (owner: 10Ottomata) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180529T1800) [18:00:12] (03CR) 10Dzahn: Planet: Redesign UI for rawdog, add theme files (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/435327 (https://phabricator.wikimedia.org/T180498) (owner: 10Paladox) [18:00:41] (03CR) 10Paladox: Planet: Redesign UI for rawdog, add theme files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/435327 (https://phabricator.wikimedia.org/T180498) (owner: 10Paladox) [18:01:01] PROBLEM - Keyholder SSH agent on deploy2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [18:01:30] PROBLEM - puppet last run on deploy1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[debmonitor/deploy] [18:01:59] !log arlolra@tin Started deploy [parsoid/deploy@e87f54d]: Reverting Parsoid to fd49ab4 [18:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:31] (03CR) 10Paladox: "I got bootstrap from https://cgit.kde.org/websites/planet-kde-org.git/tree/website/js/bootstrap.min.js" [puppet] - 10https://gerrit.wikimedia.org/r/435327 (https://phabricator.wikimedia.org/T180498) (owner: 10Paladox) [18:04:28] !log arlolra@tin Finished deploy [parsoid/deploy@e87f54d]: Reverting Parsoid to fd49ab4 (duration: 02m 29s) [18:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:43] (03PS1) 10Ottomata: Revert "Enable Kafka SSL listener for main-codfw" [puppet] - 10https://gerrit.wikimedia.org/r/436075 [18:04:48] (03CR) 10Ottomata: [V: 032 C: 032] Revert "Enable Kafka SSL listener for main-codfw" [puppet] - 10https://gerrit.wikimedia.org/r/436075 (owner: 10Ottomata) [18:05:01] PROBLEM - puppet last run on deploy2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[debmonitor/deploy] [18:06:06] this is me, fixing ^^^ [18:06:36] (03PS23) 10Paladox: Planet: Redesign UI for rawdog, add theme files [puppet] - 10https://gerrit.wikimedia.org/r/435327 (https://phabricator.wikimedia.org/T180498) [18:07:04] (03PS24) 10Dzahn: Planet: Redesign UI for rawdog, add theme files [puppet] - 10https://gerrit.wikimedia.org/r/435327 (https://phabricator.wikimedia.org/T180498) (owner: 10Paladox) [18:07:10] RECOVERY - Keyholder SSH agent on deploy2001 is OK: OK: Keyholder is armed with all configured keys. [18:07:50] (03PS1) 10Volans: debmonitor: add missing scap dependencies [puppet] - 10https://gerrit.wikimedia.org/r/436076 (https://phabricator.wikimedia.org/T191299) [18:09:09] (03CR) 10Volans: [C: 032] debmonitor: add missing scap dependencies [puppet] - 10https://gerrit.wikimedia.org/r/436076 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [18:09:11] (03CR) 10Dzahn: [C: 032] Planet: Redesign UI for rawdog, add theme files [puppet] - 10https://gerrit.wikimedia.org/r/435327 (https://phabricator.wikimedia.org/T180498) (owner: 10Paladox) [18:09:25] (03PS25) 10Dzahn: Planet: Redesign UI for rawdog, add theme files [puppet] - 10https://gerrit.wikimedia.org/r/435327 (https://phabricator.wikimedia.org/T180498) (owner: 10Paladox) [18:10:02] RECOVERY - puppet last run on deploy2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:13:33] (03CR) 10Dzahn: [C: 032] "stretch-only, so noop in prod for now, as expected. for labs though and then unblocking the upgrade to stretch in prod later" [puppet] - 10https://gerrit.wikimedia.org/r/435327 (https://phabricator.wikimedia.org/T180498) (owner: 10Paladox) [18:15:38] (03CR) 10Dzahn: [C: 032] "http://planet-hotdog.wmflabs.org/" [puppet] - 10https://gerrit.wikimedia.org/r/435327 (https://phabricator.wikimedia.org/T180498) (owner: 10Paladox) [18:16:05] it seems jquery 1.12 improved performance [18:16:31] RECOVERY - puppet last run on deploy1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:17:31] (03PS1) 10Volans: debmonitor: fix path for source code [puppet] - 10https://gerrit.wikimedia.org/r/436081 (https://phabricator.wikimedia.org/T191299) [18:18:24] (03CR) 10Volans: [C: 032] debmonitor: fix path for source code [puppet] - 10https://gerrit.wikimedia.org/r/436081 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [18:18:51] clear [18:18:54] (03CR) 10Dzahn: "adding the Hiera call in a role class will make jenkins bot vote -1 but of course i dont want to ask for a complete refactoring of the rol" [puppet] - 10https://gerrit.wikimedia.org/r/435814 (owner: 10Alex Monk) [18:21:20] (03CR) 10Dzahn: "yes, capitalization is relevant and a common cause of issues, since auth_ldap ignores it but Icinga internally does care, so you can be lo" [puppet] - 10https://gerrit.wikimedia.org/r/434479 (https://phabricator.wikimedia.org/T195289) (owner: 10ArielGlenn) [18:21:59] !log volans@tin Started deploy [debmonitor/deploy@e2efb6b]: Initial sync [18:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:34] !log volans@tin Finished deploy [debmonitor/deploy@e2efb6b]: Initial sync (duration: 00m 35s) [18:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:52] (03PS4) 10Dzahn: icinga: add addshore and ladsgroup to wikidata contact group [puppet] - 10https://gerrit.wikimedia.org/r/434479 (https://phabricator.wikimedia.org/T195289) (owner: 10ArielGlenn) [18:30:49] (03CR) 10Dzahn: [C: 032] "- removed aude as suggested" [puppet] - 10https://gerrit.wikimedia.org/r/434479 (https://phabricator.wikimedia.org/T195289) (owner: 10ArielGlenn) [18:30:56] (03PS5) 10Dzahn: icinga: add addshore and ladsgroup to wikidata contact group [puppet] - 10https://gerrit.wikimedia.org/r/434479 (https://phabricator.wikimedia.org/T195289) (owner: 10ArielGlenn) [18:33:41] 10Operations, 10Analytics, 10Performance-Team, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gilles - https://phabricator.wikimedia.org/T195837#4238806 (10JAllemandou) @Gilles : Feel free to ping when you're in if you want some help on the data or the way to play with it. [18:34:01] (03PS3) 10Dzahn: deployment-prep: Replace etcd cert after puppetmaster change [puppet] - 10https://gerrit.wikimedia.org/r/435715 (https://phabricator.wikimedia.org/T195686) (owner: 10Alex Monk) [18:34:53] (03CR) 10Dzahn: [C: 032] deployment-prep: Replace etcd cert after puppetmaster change [puppet] - 10https://gerrit.wikimedia.org/r/435715 (https://phabricator.wikimedia.org/T195686) (owner: 10Alex Monk) [18:35:39] (03CR) 10Dzahn: [C: 032] "per: already cherry-picked. thanks for fixing deployment-prep" [puppet] - 10https://gerrit.wikimedia.org/r/435715 (https://phabricator.wikimedia.org/T195686) (owner: 10Alex Monk) [18:38:27] thanks mutante [18:46:17] 10Operations, 10Ops-Access-Reviews, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to terbium/maintenance-log-readers for bmansurov - https://phabricator.wikimedia.org/T189285#4240614 (10bmansurov) @RobH I'm unable to send emails using the following command because sudo is asking for my passw... [18:48:08] (03CR) 10Gehel: [C: 031] "LGTM, pending approval in weekly SRE meeting" [puppet] - 10https://gerrit.wikimedia.org/r/436013 (https://phabricator.wikimedia.org/T195797) (owner: 10Muehlenhoff) [18:48:19] (03CR) 10Gehel: "LGTM, pending approval in weekly SRE meeting" [puppet] - 10https://gerrit.wikimedia.org/r/436014 (https://phabricator.wikimedia.org/T195797) (owner: 10Muehlenhoff) [18:49:17] 10Operations, 10Ops-Access-Reviews, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to terbium/maintenance-log-readers for bmansurov - https://phabricator.wikimedia.org/T189285#4037542 (10Krenair) You'd need to be in the restricted group to run that. maintenance-log-readers cannot. [18:52:24] (03CR) 10Gergő Tisza: "Seems to work fine:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434901 (https://phabricator.wikimedia.org/T185087) (owner: 10Gergő Tisza) [18:53:21] (03PS1) 10Volans: Fix deploy makefile [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/436083 (https://phabricator.wikimedia.org/T191299) [18:53:55] (03CR) 10Volans: [V: 032 C: 032] "Tested on the host." [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/436083 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [18:55:12] !log volans@tin Started deploy [debmonitor/deploy@fd06bd3]: Initial sync (2) [18:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:54] !log volans@tin Finished deploy [debmonitor/deploy@fd06bd3]: Initial sync (2) (duration: 01m 42s) [18:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:22] (03CR) 10Legoktm: "My bad, I didn't realize that T95714 had been resolved. :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434901 (https://phabricator.wikimedia.org/T185087) (owner: 10Gergő Tisza) [19:00:05] thcipriani: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180529T1900). [19:04:30] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [19:07:09] mutante: ^^^^ [19:07:42] looking as well [19:10:14] Error: Could not find any contact matching 'ladsgroup' (config file '/etc/icinga/contactgroups.cfg', starting on line 52) [19:10:28] I think it’s just case mismatch in https://gerrit.wikimedia.org/r/#/c/434479/5/modules/nagios_common/files/contactgroups.cfg [19:16:00] 10Operations, 10Analytics, 10Performance-Team, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gilles - https://phabricator.wikimedia.org/T195837#4240743 (10Nuria) Approved, please by all means use hadoop. [19:16:02] (03PS1) 10Herron: icinga: update contactgroups to reference Ladsgroup with capital L [puppet] - 10https://gerrit.wikimedia.org/r/436089 (https://phabricator.wikimedia.org/T195289) [19:16:40] !log cutting branch for wmf.6, will not deploy wmf.6 as wmf.5 is not currently on group2 as the train is blocked on T195514 which is blocked on T195868 which is blocked on T195906 [19:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:47] T195514: Can't copy and paste a list on office.wiki page in the visual editor - https://phabricator.wikimedia.org/T195514 [19:16:47] T195868: CI failures in php70: Undefined constant 'Wikibase\Client\Store\NS_MAIN' - https://phabricator.wikimedia.org/T195868 [19:16:47] T195906: Phan fails for Wikibase on wmf.x branches - https://phabricator.wikimedia.org/T195906 [19:17:03] (03CR) 10Herron: [C: 032] icinga: update contactgroups to reference Ladsgroup with capital L [puppet] - 10https://gerrit.wikimedia.org/r/436089 (https://phabricator.wikimedia.org/T195289) (owner: 10Herron) [19:20:16] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [19:20:30] correctness corrected [19:34:43] (03PS1) 10Gilles: Add gilles to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/436093 (https://phabricator.wikimedia.org/T195837) [19:36:19] herron: thanks, you beat me to it. i knew about this but lost internet for a little [19:36:40] np! [19:37:15] it's always case-sensitive ..each time we run into it.. see comments on original change.. duh :) [19:37:16] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 [19:38:45] PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 324.03 seconds [19:40:36] PROBLEM - MariaDB Slave Lag: s8 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.85 seconds [19:41:10] (03CR) 10Dzahn: [C: 032] "see https://gerrit.wikimedia.org/r/#/c/436089/ :p" [puppet] - 10https://gerrit.wikimedia.org/r/434479 (https://phabricator.wikimedia.org/T195289) (owner: 10ArielGlenn) [19:47:12] (03PS1) 10Volans: Fix path for the submodule [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/436096 (https://phabricator.wikimedia.org/T191299) [19:47:34] (03CR) 10Volans: [V: 032 C: 032] Fix path for the submodule [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/436096 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [19:47:35] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 34 probes of 322 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [19:48:19] !log volans@tin Started deploy [debmonitor/deploy@361c94a]: Initial sync (3) [19:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:06] PROBLEM - MariaDB Slave SQL: m2 on db1117 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1049, Errmsg: Error Unknown database debmonitor on query. Default database: debmonitor. [Query snipped] [19:51:48] volans_: you saw the part about unknown database? [19:52:36] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 0 probes of 322 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [19:54:18] mutante: yeah, checking [19:54:37] alright, cool [19:57:45] (03PS1) 10Ottomata: Kafka - set super.users even if auth acls is not enabled [puppet] - 10https://gerrit.wikimedia.org/r/436097 (https://phabricator.wikimedia.org/T193778) [19:59:26] PROBLEM - MariaDB Slave Lag: m2 on db1117 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 660.54 seconds [20:02:18] (03CR) 10Ottomata: [C: 032] "No-op in prod" [puppet] - 10https://gerrit.wikimedia.org/r/436097 (https://phabricator.wikimedia.org/T193778) (owner: 10Ottomata) [20:03:55] RECOVERY - MariaDB Slave Lag: m2 on db1117 is OK: OK slave_sql_lag Replication lag: 0.48 seconds [20:04:15] RECOVERY - MariaDB Slave SQL: m2 on db1117 is OK: OK slave_sql_state Slave_SQL_Running: Yes [20:08:11] (03PS1) 10Ottomata: Kafka vary ssl_client_auth only if auth_acls_enabled is true [puppet] - 10https://gerrit.wikimedia.org/r/436103 (https://phabricator.wikimedia.org/T193778) [20:10:44] (03PS2) 10Ottomata: Kafka vary ssl_client_auth only if auth_acls_enabled is true [puppet] - 10https://gerrit.wikimedia.org/r/436103 (https://phabricator.wikimedia.org/T193778) [20:13:30] (03CR) 10Ottomata: [C: 032] "no op" [puppet] - 10https://gerrit.wikimedia.org/r/436103 (https://phabricator.wikimedia.org/T193778) (owner: 10Ottomata) [20:16:45] (03Draft1) 10Paladox: Planet: Apply a rawdog patch that allows rawdog to split index.html into multiple files [puppet] - 10https://gerrit.wikimedia.org/r/436099 [20:16:48] (03PS2) 10Paladox: Planet: Some customisations for rawdog through patch file [puppet] - 10https://gerrit.wikimedia.org/r/436099 [20:17:14] (03PS3) 10Paladox: Planet: Some customisations for rawdog through patch file [puppet] - 10https://gerrit.wikimedia.org/r/436099 [20:17:19] (03CR) 10jerkins-bot: [V: 04-1] Planet: Some customisations for rawdog through patch file [puppet] - 10https://gerrit.wikimedia.org/r/436099 (owner: 10Paladox) [20:17:42] (03CR) 10jerkins-bot: [V: 04-1] Planet: Some customisations for rawdog through patch file [puppet] - 10https://gerrit.wikimedia.org/r/436099 (owner: 10Paladox) [20:17:58] (03PS4) 10Paladox: Planet: Some customisations for rawdog through patch file [puppet] - 10https://gerrit.wikimedia.org/r/436099 [20:31:49] (03PS2) 10Bstorm: toolforge: refactor some python in maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/434752 [20:33:29] (03CR) 10Bstorm: [C: 032] toolforge: refactor some python in maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/434752 (owner: 10Bstorm) [20:38:32] !log volans@tin Finished deploy [debmonitor/deploy@361c94a]: Initial sync (3) (duration: 50m 13s) [20:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:19] (03PS1) 10Ottomata: Kafka - Don't manage Cluster ACLs [puppet] - 10https://gerrit.wikimedia.org/r/436165 (https://phabricator.wikimedia.org/T193778) [20:41:31] (03PS1) 10Herron: puppetdb: set jetty.ini host = 127.0.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/436166 [20:43:32] (03CR) 10Herron: "this will cause prod puppetdbs to restart. it should be deployed with agents temporarily disabled across the fleet to avoid an alert showe" [puppet] - 10https://gerrit.wikimedia.org/r/436166 (owner: 10Herron) [20:44:27] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11314/" [puppet] - 10https://gerrit.wikimedia.org/r/436165 (https://phabricator.wikimedia.org/T193778) (owner: 10Ottomata) [20:45:16] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Requesting access to analytics-privatedata-users for gilles - https://phabricator.wikimedia.org/T195837#4238806 (10Imarlier) [20:46:17] (03PS1) 10Urbanecm: Set wgProofreadPagePageSeparator to '' for jawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436168 (https://phabricator.wikimedia.org/T195873) [20:49:07] (03PS1) 10Urbanecm: Set wgProofreadPagePageSeparator='' on zhwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436169 (https://phabricator.wikimedia.org/T194875) [20:52:03] (03CR) 10Smalyshev: Add Stas to wdqs-test-roots group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/436014 (https://phabricator.wikimedia.org/T195797) (owner: 10Muehlenhoff) [20:53:03] (03CR) 10Smalyshev: [C: 031] Create new admin group with root access on WDQS test cluster [puppet] - 10https://gerrit.wikimedia.org/r/436013 (https://phabricator.wikimedia.org/T195797) (owner: 10Muehlenhoff) [20:53:28] (03PS1) 10Ottomata: Enable Kafka SSL listener for main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/436171 [20:53:42] (03CR) 10Smalyshev: "Does this one also need approval? It only creates empty group but does not add anybody to it yet." [puppet] - 10https://gerrit.wikimedia.org/r/436014 (https://phabricator.wikimedia.org/T195797) (owner: 10Muehlenhoff) [20:54:06] (03CR) 10Smalyshev: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/436014 (https://phabricator.wikimedia.org/T195797) (owner: 10Muehlenhoff) [20:54:21] (03CR) 10Smalyshev: [C: 031] "Does this one also need approval? It only creates empty group but does not add anybody to it yet." [puppet] - 10https://gerrit.wikimedia.org/r/436013 (https://phabricator.wikimedia.org/T195797) (owner: 10Muehlenhoff) [20:57:28] 10Operations, 10Maps-Sprint, 10Patch-For-Review: reimage maps-test2004 to stretch and cassandra 2.2 - https://phabricator.wikimedia.org/T195741#4241182 (10Pnorman) `tail-kartotherian` shows a undefined symbol error. The best way to reproduce this is `cd /srv/deployment/kartotherian/deploy-cache/current && no... [20:59:28] (03CR) 10Muehlenhoff: "It doesn't need meeting approval per se , but in the (unlikely case) the access request is denied, we'd pile up unused groups, so we usual" [puppet] - 10https://gerrit.wikimedia.org/r/436013 (https://phabricator.wikimedia.org/T195797) (owner: 10Muehlenhoff) [21:01:32] James_F: your change is live on mwdebug1002, check please [21:01:43] Looking. [21:02:38] thcipriani: Yup, LGTM. [21:02:56] awesome, going live [21:06:07] !log thcipriani@tin Synchronized php-1.32.0-wmf.5/extensions/VisualEditor/lib/ve: SWAT: [[gerrit:436049|Update VE core submodule to wmf/1.32.0-wmf.5 HEAD (9032a90ca)]] T195514 (duration: 01m 23s) [21:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:12] T195514: Can't copy and paste a list on office.wiki page in the visual editor - https://phabricator.wikimedia.org/T195514 [21:06:14] James_F: ^ live now [21:06:30] Thanks, thcipriani! [21:06:38] thank you! [21:23:15] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1976 bytes in 0.152 second response time [21:24:07] 10Operations, 10Citoid, 10Code-Stewardship-Reviews, 10VisualEditor, 10Services (watching): zotero translation server: code stewardship request - https://phabricator.wikimedia.org/T187194#4241327 (10greg) [21:24:10] (03PS1) 10Thcipriani: All wikis to 1.32.0-wmf.5 refs T191051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436179 [21:26:09] (03CR) 10Thcipriani: [C: 032] All wikis to 1.32.0-wmf.5 refs T191051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436179 (owner: 10Thcipriani) [21:28:02] (03Merged) 10jenkins-bot: All wikis to 1.32.0-wmf.5 refs T191051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436179 (owner: 10Thcipriani) [21:29:07] okay, dispatch lag is high [21:29:42] (03CR) 10jenkins-bot: All wikis to 1.32.0-wmf.5 refs T191051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436179 (owner: 10Thcipriani) [21:31:08] !log thcipriani@tin rebuilt and synchronized wikiversions files: All wikis to 1.32.0-wmf.5 [21:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:15] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 285.36 seconds [21:40:52] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: planet.wikimedia.org: replace planet-venus software with rawdog - https://phabricator.wikimedia.org/T180498#4241369 (10Dzahn) {F18602976} rawdog.py on stretch after applying " patch < rawdog-2.22-patch-multiple-pages-paladox1-unified.diff " from P177 h... [21:49:32] (03CR) 10Dzahn: "file { '/usr/lib/hadoop/lib/logstash-gelf.jar':" [puppet] - 10https://gerrit.wikimedia.org/r/436099 (owner: 10Paladox) [21:50:12] (03PS5) 10Paladox: Planet: Some customisations for rawdog through patch file [puppet] - 10https://gerrit.wikimedia.org/r/436099 [21:50:23] (03CR) 10Dzahn: "let's build our own .deb instead of having puppet patch the file" [puppet] - 10https://gerrit.wikimedia.org/r/436099 (owner: 10Paladox) [21:50:29] (03Abandoned) 10Paladox: Planet: Some customisations for rawdog through patch file [puppet] - 10https://gerrit.wikimedia.org/r/436099 (owner: 10Paladox) [21:52:45] 10Operations, 10ops-eqiad, 10Traffic: rack/setup/install cp1075-cp1091 - https://phabricator.wikimedia.org/T195923#4241397 (10RobH) p:05Triage>03Normal [21:53:45] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1967 bytes in 0.060 second response time [21:53:47] 10Operations, 10ops-eqiad, 10Traffic: rack/setup/install cp1075-cp1091 - https://phabricator.wikimedia.org/T195923#4241414 (10RobH) [21:58:08] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: planet.wikimedia.org: replace planet-venus software with rawdog - https://phabricator.wikimedia.org/T180498#4241426 (10Dzahn) PS2: amended to {F18603051} [22:04:59] 10Operations, 10ops-eqiad, 10Traffic: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923#4241450 (10RobH) [22:06:17] 10Operations, 10ops-eqiad, 10Traffic: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923#4241397 (10RobH) [22:09:08] !log boron - apt-get build-dep rawdog (installed libtidy5 python-feedparser python-tidylib [22:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:06] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1967 bytes in 0.067 second response time [22:17:43] 10Operations, 10Maps-Sprint, 10Patch-For-Review: reimage maps-test2004 to stretch and cassandra 2.2 - https://phabricator.wikimedia.org/T195741#4241483 (10Pnorman) I'm working on the scap3 stuff, but leaving the issue assigned to @Gehel since he's still got the Cassandra/etc stuff. Just to note so it doesn'... [22:22:07] What is "# filtertags: labs-project-deployment-prep" in a puppet manifest? [22:22:16] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1954 bytes in 0.063 second response time [22:22:35] (03PS2) 10Alex Monk: role::mail::mx: Permit changing certificate [puppet] - 10https://gerrit.wikimedia.org/r/435814 [22:23:11] (03CR) 10jerkins-bot: [V: 04-1] role::mail::mx: Permit changing certificate [puppet] - 10https://gerrit.wikimedia.org/r/435814 (owner: 10Alex Monk) [22:23:14] dammit jenkins [22:25:50] (03PS3) 10Alex Monk: role::mail::mx: Permit changing certificate [puppet] - 10https://gerrit.wikimedia.org/r/435814 [22:27:54] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: planet.wikimedia.org: replace planet-venus software with rawdog - https://phabricator.wikimedia.org/T180498#4241512 (10Paladox) This is the PS3 amended version that works: {F18603722} [22:29:37] ^ Can someone run puppet compiler on that change for me please? [22:58:25] RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 0.10 seconds [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180529T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:04:28] anyone? [23:04:53] paladox: https://people.wikimedia.org/~dzahn/rawdog/ [23:05:02] * paladox tests [23:05:39] Krenair: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/11315/console [23:06:07] Reedy: thank you [23:06:13] mutante works! [23:08:16] great :) [23:08:32] Reedy, oh it's doing random nodes [23:08:46] Do you need to tell it specifics? :P [23:08:59] can I get mx1001.wikimedia.org and mx2001.wikimedia.org ? [23:10:08] !log pnorman@tin Started deploy [kartotherian/deploy@2b75c93]: Deploy test of stretch build of Kartotherian to test2004 [23:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:31] !log pnorman@tin Finished deploy [kartotherian/deploy@2b75c93]: Deploy test of stretch build of Kartotherian to test2004 (duration: 00m 23s) [23:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:36] https://puppet-compiler.wmflabs.org/compiler02/11316/ [23:11:42] cheers [23:12:46] (03CR) 10Alex Monk: "Reedy ran puppet compiler for me and I believe that this shows it's basically a no-op in prod: https://puppet-compiler.wmflabs.org/compile" [puppet] - 10https://gerrit.wikimedia.org/r/435814 (owner: 10Alex Monk) [23:26:36] (03PS1) 10Dmaza: Enable $wgCookieSetOnIpBlock on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436193 (https://phabricator.wikimedia.org/T195930) [23:38:17] (03PS2) 10Subramanya Sastry: Enable RemexHtml on a bunch of additional wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435829 (https://phabricator.wikimedia.org/T195263) [23:44:37] (03CR) 10Jforrester: [C: 031] "Well done srwiktionary." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435829 (https://phabricator.wikimedia.org/T195263) (owner: 10Subramanya Sastry)