[00:00:27] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3739332 (10RobH) They advised we needed to open a support case, so I did so, 956261134. They're following up and will let us know. [00:30:59] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 4 minutes ago with 4 failures. Failed resources (up to 3 shown): Service[uwsgi],Exec[bump nf_conntrack hash table size],Service[ganglia-monitor],Exec[eth0_v6_token] [00:49:09] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 32.96 seconds [00:55:59] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:02:10] (03PS1) 10Ayounsi: Add non-secret password for netbox DB replication [labs/private] - 10https://gerrit.wikimedia.org/r/389654 [01:03:15] (03CR) 10Ayounsi: [V: 032 C: 032] Add non-secret password for netbox DB replication [labs/private] - 10https://gerrit.wikimedia.org/r/389654 (owner: 10Ayounsi) [01:04:18] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1510016648 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3770073 keys, up 4 minutes 6 seconds - replication_delay is 1510016648 [01:04:19] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1510016650 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3773527 keys, up 4 minutes 7 seconds - replication_delay is 1510016650 [01:04:39] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1510016677 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 3767791 keys, up 4 minutes 34 seconds - replication_delay is 1510016677 [01:05:39] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 3765099 keys, up 5 minutes 34 seconds - replication_delay is 0 [01:06:18] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3767569 keys, up 6 minutes 8 seconds - replication_delay is 0 [01:06:19] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3769333 keys, up 6 minutes 8 seconds - replication_delay is 0 [01:07:24] (03PS1) 10Chad: Swap git.wikimedia.org -> phabricator.wikimedia.org [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/389655 (https://phabricator.wikimedia.org/T139089) [01:17:31] (03PS13) 10Ayounsi: [WIP] Puppetize Netbox [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) [01:18:05] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Puppetize Netbox [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) (owner: 10Ayounsi) [01:20:41] (03PS14) 10Ayounsi: [WIP] Puppetize Netbox [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) [01:30:39] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed o [01:30:40] e was received [01:31:38] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [01:33:41] (03PS1) 10Ayounsi: PostgresSQL, add support for stretch and PG9.6 [puppet] - 10https://gerrit.wikimedia.org/r/389657 [01:51:09] (03PS15) 10Ayounsi: [WIP] Puppetize Netbox [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) [02:23:15] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.6) (duration: 07m 19s) [02:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:54] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Nov 7 02:29:54 UTC 2017 (duration 6m 39s) [02:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:19:31] 10Operations, 10Traffic, 10Patch-For-Review: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187#3739681 (10Nuria) This ticket should be a talk, really. [03:25:08] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 801.06 seconds [03:56:09] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 197.74 seconds [04:38:54] 10Operations, 10Ops-Access-Requests, 10Performance-Team (Radar): Create perf-team shell group - https://phabricator.wikimedia.org/T179728#3739710 (10Dzahn) [04:39:50] (03PS9) 10TerraCodes: Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) [04:45:39] PROBLEM - Check Varnish expiry mailbox lag on cp4025 is CRITICAL: CRITICAL: expiry mailbox lag is 2021460 [05:11:25] 10Operations, 10Ops-Access-Requests, 10Performance-Team (Radar): Create perf-team shell group - https://phabricator.wikimedia.org/T179728#3734285 (10Dzahn) This assumes that all users have full root (sudo ALL ALL) on the servers the group is applied to, right? [05:14:29] (03PS1) 10Dzahn: admins: create perf-team group with gilles, krinkle [puppet] - 10https://gerrit.wikimedia.org/r/389663 (https://phabricator.wikimedia.org/T179728) [05:16:09] PROBLEM - Check Varnish expiry mailbox lag on cp4021 is CRITICAL: CRITICAL: expiry mailbox lag is 2000092 [05:17:49] 10Operations, 10Ops-Access-Requests, 10Performance-Team: Requesting access to perf-teams for phedenskog (add phedenskog to perf-roots) - https://phabricator.wikimedia.org/T179729#3739724 (10Dzahn) @Krinkle Now that i saw the other ticket to create "perf-team" i understand better. Gotcha! [05:42:13] 10Operations, 10Deployments, 10Beta-Cluster-reproducible, 10HHVM, and 2 others: Switch mwscript from Zend PHP5 to default php alternative (e.g. HHVM or PHP7) - https://phabricator.wikimedia.org/T146285#3739730 (10Krinkle) >>! In T146285#3728364, @hashar wrote: > @Krinkle pointed at T145819, but I don't th... [06:19:17] !log Deploy alter table on s7 codfw master (db2029) with replication, this will cause lag in codfw - T174569 [06:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:25] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:20:22] 10Operations, 10Gerrit, 10Readers-Web-Backlog, 10Patch-For-Review, and 2 others: [subtask] Temporarily allow pushing large objects - https://phabricator.wikimedia.org/T178189#3739755 (10phuedx) >>! In T178189#3698691, @MoritzMuehlenhoff wrote: >> But the proper fix would be to run this service on stretch <... [06:24:29] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [06:24:49] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [06:28:38] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [06:28:58] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:39:10] as usual for my mornings, mediawiki.org loads very slowsly for me again [06:49:36] Nikerabbit: have a look at https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue then report your findings to the netops team [06:49:49] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [06:50:18] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [06:50:40] p858snake: it's not connectivity [06:51:02] https://phabricator.wikimedia.org/T179903 [06:51:25] (03CR) 10Ema: [C: 032] cache: send varnish logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/389515 (https://phabricator.wikimedia.org/T63782) (owner: 10Ema) [06:51:31] (03PS2) 10Ema: cache: send varnish logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/389515 (https://phabricator.wikimedia.org/T63782) [06:52:43] (03CR) 10Ema: [V: 032 C: 032] cache: send varnish logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/389515 (https://phabricator.wikimedia.org/T63782) (owner: 10Ema) [06:53:49] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [06:54:18] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [07:00:50] 10Operations, 10Traffic, 10Wikimedia-Logstash, 10Patch-For-Review: Add varnish logs to logstash - https://phabricator.wikimedia.org/T63782#3739791 (10ema) 05Open>03Resolved a:03ema [[https://logstash.wikimedia.org/app/kibana#/discover/188b07b0-c389-11e7-a44b-9b945870b167?_g=(refreshInterval%3A(displa... [07:02:06] (03CR) 10Krinkle: [C: 031] "LGTM. It is indeed the intent that this is still an admin group on the hosts it applies to." [puppet] - 10https://gerrit.wikimedia.org/r/389663 (https://phabricator.wikimedia.org/T179728) (owner: 10Dzahn) [07:11:08] !log reboot pinkunicorn for kernel (4.9.51) and openssl (1.0.2m) upgrades [07:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:02] !log mobrovac@tin Started restart [electron-render/deploy@8dd5f13]: Electron stuck, restarting - T174916 [07:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:11] T174916: electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 [07:24:09] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [07:24:48] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [07:32:49] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [07:33:18] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [07:48:38] 10Operations, 10Gerrit, 10Readers-Web-Backlog, 10Patch-For-Review, and 2 others: [subtask] Temporarily allow pushing large objects - https://phabricator.wikimedia.org/T178189#3739864 (10MoritzMuehlenhoff) Our scb* cluster currently runs jessie. I don't know the time frame for the new setup, but running the... [08:02:15] (03CR) 10Dzahn: "thank you for this, just can we please replace the "if $realm" with a Hiera lookup? Set the value of the proxy variable in Hiera and skip " [puppet] - 10https://gerrit.wikimedia.org/r/389492 (owner: 10Paladox) [08:07:24] (03CR) 10Dzahn: "if ${planet::planet_http_proxy} is set at all, then use the whole proxy line, otherwise don't" [puppet] - 10https://gerrit.wikimedia.org/r/389492 (owner: 10Paladox) [08:12:17] (03CR) 10Muehlenhoff: [C: 04-1] admins: create perf-team group with gilles, krinkle (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389663 (https://phabricator.wikimedia.org/T179728) (owner: 10Dzahn) [08:22:49] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [08:23:28] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [08:24:28] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:24:49] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [08:32:35] (03PS6) 10Filippo Giunchedi: smart: add ensure metaparameter [puppet] - 10https://gerrit.wikimedia.org/r/388057 (https://phabricator.wikimedia.org/T86552) [08:33:30] (03CR) 10Filippo Giunchedi: [C: 032] smart: add ensure metaparameter [puppet] - 10https://gerrit.wikimedia.org/r/388057 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [08:36:02] (03PS1) 10ArielGlenn: switch dumps monitor to read status from and write results to dumpsdata host [puppet] - 10https://gerrit.wikimedia.org/r/389667 (https://phabricator.wikimedia.org/T179857) [08:37:47] 10Operations, 10Gerrit, 10Readers-Web-Backlog, 10Patch-For-Review, and 2 others: [subtask] Temporarily allow pushing large objects - https://phabricator.wikimedia.org/T178189#3683839 (10mobrovac) While migrating SCB nodes to stretch will need to happen, I don't think we will have the bandwidth to do so soo... [08:41:49] 10Operations, 10netops: Allow syslog-tls in analytics towards wezen/lithium - https://phabricator.wikimedia.org/T177821#3739933 (10MoritzMuehlenhoff) [08:41:59] 10Operations, 10netops: Allow syslog-tls in analytics towards wezen/lithium - https://phabricator.wikimedia.org/T177821#3671479 (10MoritzMuehlenhoff) p:05Triage>03Normal [08:45:20] zhuyifei1999_: all good, the issue was 'patched' in another way, your bot is completely fine to process, sorry for the interruption! [08:45:57] np [08:48:19] !log Optimize pagelinks and templatelinks on s7 master - db1062 - T174509 [08:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:26] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [08:49:13] !log Run redact_sanitarium for hifwiktionary on db1095 (sanitarium) - T173647 [08:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:19] T173647: Prepare and check storage layer for hif.wiktionary - https://phabricator.wikimedia.org/T173647 [08:51:57] (03CR) 10Hashar: Install php5.5-iconv, php5.5-xml and php5.5-zip (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/389512 (https://phabricator.wikimedia.org/T179772) (owner: 10Umherirrender) [08:52:24] (03PS2) 10Hashar: Install php5.5-iconv, php5.5-xml and php5.5-zip [puppet] - 10https://gerrit.wikimedia.org/r/389512 (https://phabricator.wikimedia.org/T179772) (owner: 10Umherirrender) [08:53:18] (03PS3) 10Hashar: Install php5.5-xml and php5.5-zip [puppet] - 10https://gerrit.wikimedia.org/r/389512 (https://phabricator.wikimedia.org/T179772) (owner: 10Umherirrender) [08:53:27] <_joe_> php5.5? [08:53:37] <_joe_> aren't we dismissing it in like 3 months [08:54:14] that's for CI, it uses co-installable packages for regression tests [08:54:37] (for the older branches which are still supported) [08:57:31] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Remove db1038 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389670 (https://phabricator.wikimedia.org/T177911) [08:57:37] (03CR) 10Hashar: [C: 031] "I have cherry picked this change on the CI puppet master to try out on a dummy jessie instance." [puppet] - 10https://gerrit.wikimedia.org/r/389512 (https://phabricator.wikimedia.org/T179772) (owner: 10Umherirrender) [08:59:07] (03PS1) 10Marostegui: s3.hosts: Remove db1038 [software] - 10https://gerrit.wikimedia.org/r/389671 (https://phabricator.wikimedia.org/T177911) [08:59:27] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Remove db1038 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389670 (https://phabricator.wikimedia.org/T177911) (owner: 10Marostegui) [09:00:38] (03CR) 10Marostegui: [C: 032] s3.hosts: Remove db1038 [software] - 10https://gerrit.wikimedia.org/r/389671 (https://phabricator.wikimedia.org/T177911) (owner: 10Marostegui) [09:00:45] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Remove db1038 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389670 (https://phabricator.wikimedia.org/T177911) (owner: 10Marostegui) [09:00:51] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Remove db1038 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389670 (https://phabricator.wikimedia.org/T177911) (owner: 10Marostegui) [09:01:17] (03Merged) 10jenkins-bot: s3.hosts: Remove db1038 [software] - 10https://gerrit.wikimedia.org/r/389671 (https://phabricator.wikimedia.org/T177911) (owner: 10Marostegui) [09:01:38] !log ppchelko@tin Started deploy [cpjobqueue/deploy@3db0cc4]: Do not set Host header for requests to jobrunner [09:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:03] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db1038 from config - T177911 (duration: 00m 47s) [09:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:11] T177911: Decommission db1038 - https://phabricator.wikimedia.org/T177911 [09:02:28] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@3db0cc4]: Do not set Host header for requests to jobrunner (duration: 00m 49s) [09:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db1038 from config - T177911 (duration: 00m 45s) [09:03:01] (03PS2) 10Dzahn: admins: create perf-team group with gilles, krinkle [puppet] - 10https://gerrit.wikimedia.org/r/389663 (https://phabricator.wikimedia.org/T179728) [09:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:10] (03CR) 10Dzahn: "ok, done" [puppet] - 10https://gerrit.wikimedia.org/r/389663 (https://phabricator.wikimedia.org/T179728) (owner: 10Dzahn) [09:05:26] (03PS1) 10Marostegui: mariadb: Get ready to decommission db1038 [puppet] - 10https://gerrit.wikimedia.org/r/389672 (https://phabricator.wikimedia.org/T177911) [09:09:59] (03PS2) 10Marostegui: mariadb: Get ready to decommission db1038 [puppet] - 10https://gerrit.wikimedia.org/r/389672 (https://phabricator.wikimedia.org/T177911) [09:11:34] !log installing java security updates/restarting cassandra on restbase2002 [09:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:24] (03CR) 10Marostegui: [C: 032] "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler02/8657/" [puppet] - 10https://gerrit.wikimedia.org/r/389672 (https://phabricator.wikimedia.org/T177911) (owner: 10Marostegui) [09:13:59] !log Stop MySQL on db1038 - host to be decommissioned - T177911 [09:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:04] T177911: Decommission db1038 - https://phabricator.wikimedia.org/T177911 [09:15:11] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1038 - https://phabricator.wikimedia.org/T177911#3740010 (10Marostegui) This host is fully ready to be decommissioned by @Cmjohnson [09:15:24] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1038 - https://phabricator.wikimedia.org/T177911#3740015 (10Marostegui) a:05Marostegui>03Cmjohnson [09:16:17] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1050 - https://phabricator.wikimedia.org/T178162#3740016 (10Marostegui) [09:16:30] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: decommission db1036 - https://phabricator.wikimedia.org/T176311#3740017 (10Marostegui) [09:16:40] 10Operations, 10Gerrit, 10Readers-Web-Backlog, 10Patch-For-Review, and 2 others: [subtask] Temporarily allow pushing large objects - https://phabricator.wikimedia.org/T178189#3740018 (10phuedx) >>! In T178189#3739931, @mobrovac wrote: > (Also, this is highly off-topic for this ticket, we should probably cr... [09:16:45] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1035 - https://phabricator.wikimedia.org/T176931#3740019 (10Marostegui) [09:17:04] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1049 - https://phabricator.wikimedia.org/T175264#3740020 (10Marostegui) [09:17:18] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1037 - https://phabricator.wikimedia.org/T174902#3740021 (10Marostegui) [09:17:32] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1033 and db1028 - https://phabricator.wikimedia.org/T174076#3740022 (10Marostegui) [09:17:33] (03CR) 10Gehel: "This seems to be a reasonable approach. My understanding of scap is limited, but this looks correct to me." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/389550 (owner: 10EBernhardson) [09:17:48] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1026 - https://phabricator.wikimedia.org/T174763#3740023 (10Marostegui) [09:18:03] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1041 - https://phabricator.wikimedia.org/T173915#3740024 (10Marostegui) [09:21:47] ACKNOWLEDGEMENT - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1000.0] daniel_zahn https://phabricator.wikimedia.org/T179156 [09:21:47] ACKNOWLEDGEMENT - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1000.0] daniel_zahn https://phabricator.wikimedia.org/T179156 [09:21:47] ACKNOWLEDGEMENT - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T179156 [09:21:47] ACKNOWLEDGEMENT - ores on scb1002 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 136 bytes in 0.001 second response time daniel_zahn https://phabricator.wikimedia.org/T179156 [09:22:35] andrewbogott: PROCS CRITICAL: 2 processes with regex args '^/usr/bin/pytho[n] /usr/bin/nova-compute' [09:22:53] looks like the common "sometimes there can be 2 instead of 1" proc check issue [09:24:32] chasemp: any issues with https://gerrit.wikimedia.org/r/#/q/topic:icinga-paging+(status:open) and getting that merged? [09:30:03] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db2089 in s6 and s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389674 (https://phabricator.wikimedia.org/T178359) [09:31:18] anyone know what the default object size in a gerrit repo should be [09:31:33] the object size of mediawiki/services/chromium-render/deploy seems to be 100 bytes :[ [09:31:58] (yes, i really did mean bytes) [09:33:18] 10Operations, 10Gerrit, 10Readers-Web-Backlog, 10Patch-For-Review, and 2 others: [subtask] Temporarily allow pushing large objects - https://phabricator.wikimedia.org/T178189#3683839 (10akosiaris) >>! In T178189#3739864, @MoritzMuehlenhoff wrote: > Our scb* cluster currently runs jessie. I don't know the t... [09:36:15] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389674 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:36:48] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Pool db2089 in s6 and s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389674 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:38:04] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2089 in s6 and s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389674 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:38:13] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2089 in s6 and s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389674 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:39:17] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Pool db2089 as recentchanges multi-instance host on s6 and s5(s8) T178359 (duration: 00m 45s) [09:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:23] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [09:40:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Pool db2089 as recentchanges multi-instance host on s6 and s5(s8) T178359 (duration: 00m 46s) [09:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:33] PROBLEM - puppet last run on restbase2002 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 22 minutes ago with 5 failures. Failed resources (up to 3 shown): Package[hpssacli],Package[openssl],Package[ca-certificates],Package[initramfs-tools] [09:46:11] 10Operations, 10Thumbor, 10Patch-For-Review, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817#3740050 (10faidon) I wouldn't recommend reviving MgOpen for basically the reasons I described in [[ https://bugs.debian.org/819026 | #... [09:47:18] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db2086 in s5 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389676 (https://phabricator.wikimedia.org/T178359) [09:48:20] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Pool db2086 in s5 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389676 (https://phabricator.wikimedia.org/T178359) [09:51:33] RECOVERY - puppet last run on restbase2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:55:44] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389676 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:56:42] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Pool db2086 in s5 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389676 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:58:07] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2086 in s5 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389676 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:58:16] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2086 in s5 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389676 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:58:18] !log updating certspotter to 0.5 in apt and tegmen/einsteinium [09:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Pool db2086 as recentchanges multi-instance host on s5 and s7 T178359 (duration: 00m 45s) [09:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:22] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [09:59:54] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 42 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [10:00:52] (03CR) 10Alexandros Kosiaris: [C: 032] PostgresSQL, add support for stretch and PG9.6 [puppet] - 10https://gerrit.wikimedia.org/r/389657 (owner: 10Ayounsi) [10:00:55] (03PS2) 10Alexandros Kosiaris: PostgresSQL, add support for stretch and PG9.6 [puppet] - 10https://gerrit.wikimedia.org/r/389657 (owner: 10Ayounsi) [10:00:57] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] PostgresSQL, add support for stretch and PG9.6 [puppet] - 10https://gerrit.wikimedia.org/r/389657 (owner: 10Ayounsi) [10:04:54] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 10 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [10:09:36] (03CR) 10Alexandros Kosiaris: [C: 04-1] "May I also suggest avoiding the pain that will inevitably come with a path that is named MjoLniR and go for mjolnir instead ? It's not tha" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389550 (owner: 10EBernhardson) [10:09:48] !log reboot of maps codfw cluster for upgrades [10:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:10] (03PS3) 10Paladox: planet: Doin't set http_proxy or https_proxy if on labs [puppet] - 10https://gerrit.wikimedia.org/r/389492 [10:13:25] (03PS4) 10Paladox: planet: Doin't set http_proxy or https_proxy if on labs [puppet] - 10https://gerrit.wikimedia.org/r/389492 [10:13:53] (03CR) 10jerkins-bot: [V: 04-1] planet: Doin't set http_proxy or https_proxy if on labs [puppet] - 10https://gerrit.wikimedia.org/r/389492 (owner: 10Paladox) [10:14:33] (03PS5) 10Paladox: planet: Doin't set http_proxy or https_proxy if on labs [puppet] - 10https://gerrit.wikimedia.org/r/389492 [10:14:42] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db2091 in s2 and s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389679 (https://phabricator.wikimedia.org/T178359) [10:16:44] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Pool db2091 in s2 and s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389679 (https://phabricator.wikimedia.org/T178359) [10:19:59] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Pool db2086 as recentchanges multi-instance host on s5 and s7 T178359 (duration: 00m 45s) [10:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:06] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [10:24:28] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389679 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:24:48] !log create staging database on db1108 (researchers scratch pad) - T177405 [10:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:56] T177405: rack and setup db1107 and db1108 - https://phabricator.wikimedia.org/T177405 [10:25:04] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Pool db2091 in s2 and s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389679 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:26:13] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2091 in s2 and s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389679 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:27:07] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2091 in s2 and s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389679 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:27:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Pool db2091 as recentchanges multi-instance host on s2 and s4 T178359 (duration: 00m 46s) [10:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:34] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [10:28:20] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Pool db2091 as recentchanges multi-instance host on s2 and s4 T178359 (duration: 00m 45s) [10:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:38] (03PS2) 10Ema: 5.1.3-1wm2: backport 'record-prefix support for varnishncsa' [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/389516 [10:33:46] (03CR) 10Addshore: Add ::statistics::wmde::wdcm (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [10:34:00] (03PS13) 10Addshore: Add ::statistics::wmde::wdcm [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) [10:34:20] (03CR) 10Addshore: "> Yeah, until WMF/WMDE has a CRAN mirror we can't install packages via Puppet in production, just labs VMs. We've been able to side-step i" [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [10:34:25] (03Draft1) 10Paladox: planet: Increase numthreads to 3 [puppet] - 10https://gerrit.wikimedia.org/r/389681 [10:34:27] (03PS2) 10Paladox: planet: Increase numthreads to 3 [puppet] - 10https://gerrit.wikimedia.org/r/389681 [10:34:30] (03CR) 10jerkins-bot: [V: 04-1] Add ::statistics::wmde::wdcm [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [10:34:54] 10Operations, 10Discovery, 10Discovery-Analysis, 10WMDE-Analytics-Engineering, and 2 others: Setup a mirror for R language dependencies (CRAN) - https://phabricator.wikimedia.org/T170995#3740493 (10Addshore) [10:41:46] !log restarting ntpd on dns recursors to pick up openssl update [10:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:06] (03PS3) 10Addshore: Remove unused wmgUseWikibasePropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381193 [10:43:21] (03CR) 10Dzahn: [C: 032] "will replace this with a "global settings" file so that we don't have to change these things for each language, but for now we can just go" [puppet] - 10https://gerrit.wikimedia.org/r/389681 (owner: 10Paladox) [10:45:46] (03CR) 10GoranSMilovanovic: [C: 031] Add ::statistics::wmde::wdcm [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [10:49:10] (03CR) 10Addshore: [C: 04-1] Add loading of wikibase extensions from build (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381194 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [10:49:14] PROBLEM - NTP peers on acamar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [10:49:15] (03CR) 10GoranSMilovanovic: [C: 031] "@Addshore I guess you were trying to include the r_lang class again..." [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [10:50:14] RECOVERY - NTP peers on acamar is OK: NTP OK: Offset -0.000146 secs [10:50:32] (03PS6) 10Addshore: Add loading of wikibase extensions from build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381194 (https://phabricator.wikimedia.org/T176948) [10:51:00] (03PS3) 10Ema: 5.1.3-1wm2: backport 'record-prefix support for varnishncsa' [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/389516 [10:52:11] (03CR) 10Addshore: [C: 031] "Not used anywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381193 (owner: 10Addshore) [10:52:55] (03CR) 10Addshore: [C: 031] Add loading of wikibase extensions from build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381194 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [10:53:25] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db2092 in s1 and s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389683 (https://phabricator.wikimedia.org/T178359) [10:55:31] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389683 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:55:57] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Pool db2092 in s1 and s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389683 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:56:29] (03PS6) 10Addshore: Load wikibase build from mediawiki-config for beta wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381195 (https://phabricator.wikimedia.org/T176948) [10:56:31] (03PS5) 10Addshore: Load wikibase build from mediawiki-config for beta wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381199 (https://phabricator.wikimedia.org/T176948) [10:56:56] (03PS6) 10Addshore: Load wikibase build from mediawiki-config for beta wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381199 (https://phabricator.wikimedia.org/T176948) [10:57:08] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2092 in s1 and s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389683 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:57:17] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2092 in s1 and s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389683 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:57:36] (03PS1) 10Addshore: wmgWikibaseUseConfigFromWikidataBuild flase for all of BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389684 (https://phabricator.wikimedia.org/T176948) [10:57:53] PROBLEM - NTP peers on maerlant is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [10:58:08] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Pool db2092 as recentchanges multi-instance host on s1 and s3 T178359 (duration: 00m 45s) [10:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:14] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [10:59:02] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Pool db2092 as recentchanges multi-instance host on s1 and s3 T178359 (duration: 00m 45s) [10:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:20] !log reboot cp3007: upgrading kernel to 4.9.51, libssl to 1.0.2m and 1.1.0g [10:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:53] RECOVERY - NTP peers on maerlant is OK: NTP OK: Offset -0.000223 secs [11:01:53] PROBLEM - NTP peers on achernar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [11:03:48] (03PS1) 10Addshore: Load wikibase build from mediawiki-config for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389685 (https://phabricator.wikimedia.org/T176948) [11:03:50] (03PS1) 10Addshore: Load wikibase build from mediawiki-config for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389686 (https://phabricator.wikimedia.org/T176948) [11:03:52] (03PS1) 10Addshore: Load wikibase build from mediawiki-config for group0 & group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389687 (https://phabricator.wikimedia.org/T176948) [11:03:53] RECOVERY - NTP peers on achernar is OK: NTP OK: Offset -1.6e-05 secs [11:03:55] (03PS1) 10Addshore: Load wikibase build from mediawiki-config for wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389688 (https://phabricator.wikimedia.org/T176948) [11:03:57] (03PS1) 10Addshore: wmgWikibaseUseConfigFromWikidataBuild flase for all of PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389689 (https://phabricator.wikimedia.org/T176948) [11:03:59] (03PS1) 10Addshore: Remove wmgWikibaseUseConfigFromWikidataBuild [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389690 (https://phabricator.wikimedia.org/T176948) [11:07:23] !log cache_misc rolling reboots: upgrading kernel to 4.9.51, libssl to 1.0.2m and 1.1.0g [11:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:13] (03PS6) 10Addshore: Stop using $wgWikibaseSharedCacheKeyPrefix from Wikidata build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381371 (https://phabricator.wikimedia.org/T176948) (owner: 10Aude) [11:08:15] (03PS1) 10Addshore: Remove Shared Cache settings from Wikibase-buildentry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389691 (https://phabricator.wikimedia.org/T176948) [11:09:34] PROBLEM - NTP peers on chromium is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [11:11:04] 10Operations, 10Ops-Access-Requests, 10Performance-Team (Radar): Add hoo to perf-roots - https://phabricator.wikimedia.org/T179317#3740595 (10MoritzMuehlenhoff) >>! In T179317#3738741, @Gilles wrote: > If what @hoo needs is only a subset of what we have access to, why not create a new group for that? Yep, t... [11:11:27] 10Operations, 10Goal, 10Technical-Debt, 10User-fgiunchedi: Reduce technical debt in metrics monitoring - https://phabricator.wikimedia.org/T177195#3740596 (10MoritzMuehlenhoff) p:05Triage>03High [11:11:34] RECOVERY - NTP peers on chromium is OK: NTP OK: Offset 0.000356 secs [11:13:38] 10Operations, 10Data-Services, 10cloud-services-team (Kanban): Switch labstore servers to default SSH configuration - https://phabricator.wikimedia.org/T177914#3740602 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:16:18] (03PS1) 10Dzahn: planet: rawdog, use one global config for all langs [puppet] - 10https://gerrit.wikimedia.org/r/389692 [11:17:16] (03CR) 10Paladox: planet: rawdog, use one global config for all langs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389692 (owner: 10Dzahn) [11:20:13] PROBLEM - NTP peers on hydrogen is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [11:21:13] RECOVERY - NTP peers on hydrogen is OK: NTP OK: Offset -5.5e-05 secs [11:24:12] (03PS1) 10Alexandros Kosiaris: kube-proxy: Reload on /etc/default-kube-proxy changes [puppet] - 10https://gerrit.wikimedia.org/r/389693 [11:24:14] (03PS1) 10Alexandros Kosiaris: kubelet: Reload on /etc/default/kubelet changes [puppet] - 10https://gerrit.wikimedia.org/r/389694 [11:24:16] (03PS1) 10Alexandros Kosiaris: Parameterize kubernetes node username [puppet] - 10https://gerrit.wikimedia.org/r/389695 [11:24:54] PROBLEM - NTP peers on nescio is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [11:25:54] RECOVERY - NTP peers on nescio is OK: NTP OK: Offset 0.000446 secs [11:27:41] (03PS1) 10Alexandros Kosiaris: Change k8s_infrastructure_users structure [labs/private] - 10https://gerrit.wikimedia.org/r/389696 [11:28:46] (03PS2) 10Dzahn: planet: rawdog, use one global config for all langs [puppet] - 10https://gerrit.wikimedia.org/r/389692 [11:29:11] !log installing java security updates/restarting cassandra on restbase2001 (cassandra3 node) [11:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:09] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Change k8s_infrastructure_users structure [labs/private] - 10https://gerrit.wikimedia.org/r/389696 (owner: 10Alexandros Kosiaris) [11:31:50] (03CR) 10Dzahn: "not needed anymore since you were able to just set the variable to '' in Hiera and that made it work?" [puppet] - 10https://gerrit.wikimedia.org/r/389492 (owner: 10Paladox) [11:31:59] (03Abandoned) 10Paladox: planet: Doin't set http_proxy or https_proxy if on labs [puppet] - 10https://gerrit.wikimedia.org/r/389492 (owner: 10Paladox) [11:33:29] (03PS4) 10Dzahn: Install php5.5-xml and php5.5-zip [puppet] - 10https://gerrit.wikimedia.org/r/389512 (https://phabricator.wikimedia.org/T179772) (owner: 10Umherirrender) [11:34:07] (03PS5) 10Dzahn: contint: Install php5.5-xml and php5.5-zip [puppet] - 10https://gerrit.wikimedia.org/r/389512 (https://phabricator.wikimedia.org/T179772) (owner: 10Umherirrender) [11:37:22] (03PS3) 10Dzahn: planet: rawdog, use one global config for all langs [puppet] - 10https://gerrit.wikimedia.org/r/389692 [11:37:58] (03CR) 10Dzahn: [C: 032] contint: Install php5.5-xml and php5.5-zip [puppet] - 10https://gerrit.wikimedia.org/r/389512 (https://phabricator.wikimedia.org/T179772) (owner: 10Umherirrender) [11:38:25] (03PS4) 10Dzahn: planet: rawdog, use one global config for all langs [puppet] - 10https://gerrit.wikimedia.org/r/389692 [11:39:19] (03CR) 10Paladox: [C: 031] planet: rawdog, use one global config for all langs [puppet] - 10https://gerrit.wikimedia.org/r/389692 (owner: 10Dzahn) [11:39:55] (03CR) 10Dzahn: [C: 032] "tested in labs by paladox, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/389692 (owner: 10Dzahn) [11:42:46] !log restbase truncating cassandra 2 non-WP tables for T179420 [11:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:52] T179420: Migrate definitions storage from the legacy to new strategy - https://phabricator.wikimedia.org/T179420 [11:43:26] !log restbase truncating cassandra 2 non-WP tables for T179417 [11:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:31] T179417: Migrate Parsoid from legacy to new storage - https://phabricator.wikimedia.org/T179417 [11:51:00] (03PS2) 10ArielGlenn: switch dumps monitor to read status from and write results to dumpsdata host [puppet] - 10https://gerrit.wikimedia.org/r/389667 (https://phabricator.wikimedia.org/T179857) [11:51:47] (03PS1) 10Alexandros Kosiaris: Fix k8s_infrastructure_users [labs/private] - 10https://gerrit.wikimedia.org/r/389700 [11:52:02] !log mobrovac@tin Started deploy [restbase/deploy@eab2948]: Use the new storage for wikidata.org - T179417 [11:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:09] T179417: Migrate Parsoid from legacy to new storage - https://phabricator.wikimedia.org/T179417 [11:55:14] (03PS2) 10Alexandros Kosiaris: Fix k8s_infrastructure_users [labs/private] - 10https://gerrit.wikimedia.org/r/389700 [11:56:40] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix k8s_infrastructure_users [labs/private] - 10https://gerrit.wikimedia.org/r/389700 (owner: 10Alexandros Kosiaris) [11:57:54] (03CR) 10Paladox: [C: 031] phabricator: limit http access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/389459 (owner: 10Dzahn) [11:58:24] (03CR) 10Paladox: "Do we want to make it https only for the backend?" [puppet] - 10https://gerrit.wikimedia.org/r/389457 (owner: 10Dzahn) [12:00:16] !log mobrovac@tin Finished deploy [restbase/deploy@eab2948]: Use the new storage for wikidata.org - T179417 (duration: 08m 14s) [12:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:23] T179417: Migrate Parsoid from legacy to new storage - https://phabricator.wikimedia.org/T179417 [12:01:02] (03CR) 10Dzahn: "i don't think that's possible without changing the varnish director setup. it's a question for the traffic team but would probably require" [puppet] - 10https://gerrit.wikimedia.org/r/389457 (owner: 10Dzahn) [12:03:12] (03CR) 10Paladox: [C: 031] phabricator: drop ferm rule to open port 443 [puppet] - 10https://gerrit.wikimedia.org/r/389457 (owner: 10Dzahn) [12:13:39] (03PS15) 10Paladox: Gerrit: Remove ldap user and password from secure.config [puppet] - 10https://gerrit.wikimedia.org/r/366910 [12:18:46] (03PS2) 10Alexandros Kosiaris: kube-proxy: Reload on /etc/default-kube-proxy changes [puppet] - 10https://gerrit.wikimedia.org/r/389693 [12:18:48] (03PS2) 10Alexandros Kosiaris: kubelet: Reload on /etc/default/kubelet changes [puppet] - 10https://gerrit.wikimedia.org/r/389694 [12:18:50] (03PS2) 10Alexandros Kosiaris: Parameterize kubernetes node username [puppet] - 10https://gerrit.wikimedia.org/r/389695 [12:21:29] (03PS3) 10ArielGlenn: switch dumps monitor to read status from and write results to dumpsdata host [puppet] - 10https://gerrit.wikimedia.org/r/389667 (https://phabricator.wikimedia.org/T179857) [12:22:10] (03CR) 10ArielGlenn: [C: 032] switch dumps monitor to read status from and write results to dumpsdata host [puppet] - 10https://gerrit.wikimedia.org/r/389667 (https://phabricator.wikimedia.org/T179857) (owner: 10ArielGlenn) [12:42:12] (03CR) 10Muehlenhoff: [C: 031] admins: create perf-team group with gilles, krinkle [puppet] - 10https://gerrit.wikimedia.org/r/389663 (https://phabricator.wikimedia.org/T179728) (owner: 10Dzahn) [12:44:21] 10Operations, 10Gerrit, 10Readers-Web-Backlog, 10Patch-For-Review, and 2 others: [subtask] Temporarily allow pushing large objects - https://phabricator.wikimedia.org/T178189#3740805 (10phuedx) >>! In T178189#3740029, @akosiaris wrote: > Do we have a timeline for when we can (and want to) have this working... [12:44:44] (03PS3) 10Dzahn: admins: create perf-team group with gilles, krinkle [puppet] - 10https://gerrit.wikimedia.org/r/389663 (https://phabricator.wikimedia.org/T179728) [12:46:13] (03CR) 10Dzahn: [C: 032] admins: create perf-team group with gilles, krinkle [puppet] - 10https://gerrit.wikimedia.org/r/389663 (https://phabricator.wikimedia.org/T179728) (owner: 10Dzahn) [12:47:22] (03CR) 10Dzahn: "doesn't change any existing access for people" [puppet] - 10https://gerrit.wikimedia.org/r/389663 (https://phabricator.wikimedia.org/T179728) (owner: 10Dzahn) [12:50:35] (03Draft2) 10Jayprakash12345: add _ in appendix talk namespace at mywikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389702 [12:50:45] !log hafnium, tungsten: groupdel perf-roots to go with gerrit:389663 (T179728) [12:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:51] T179728: Create perf-team shell group - https://phabricator.wikimedia.org/T179728 [12:51:11] (03PS3) 10Jayprakash12345: add _ in appendix talk namespace at mywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389702 (https://phabricator.wikimedia.org/T179907) [12:53:59] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Create perf-team shell group - https://phabricator.wikimedia.org/T179728#3740831 (10Dzahn) I deleted the perf-roots group from tungsten and hafnium. Now: ``` [tungsten:~] $ id krinkle uid=2008(krinkle) gid=500(wikidev) gr... [12:54:43] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Create perf-team shell group - https://phabricator.wikimedia.org/T179728#3740832 (10Dzahn) [13:01:26] (03PS5) 10Jayprakash12345: Enable ShortUrl on pa.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386779 (https://phabricator.wikimedia.org/T178919) [13:02:44] (03CR) 10Jayprakash12345: "@SWAT member, Please Create shorturl table first before merge." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386779 (https://phabricator.wikimedia.org/T178919) (owner: 10Jayprakash12345) [13:10:44] !log rebooting wasat for kernel update [13:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:55] !log reboot maps eqiad cluster for upgrades [13:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:10] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/cumin] - 10https://gerrit.wikimedia.org/r/389705 [13:19:14] !log rebooting naos for kernel update [13:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:10] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/cumin] - 10https://gerrit.wikimedia.org/r/389705 (owner: 10Hashar) [13:31:16] !log Deploy schema change on s5 codfw master (db2023) with replication, this will generate lag on codfw - T174569 [13:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:22] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [13:32:33] PROBLEM - puppet last run on mw2246 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:37:38] (03PS1) 10Marostegui: mariadb: Reimage db1105 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/389707 [13:38:29] (03CR) 10Marostegui: [C: 032] mariadb: Reimage db1105 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/389707 (owner: 10Marostegui) [13:39:51] (03PS1) 10Marostegui: db-eqiad.php: Add comment about db1105 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389708 [13:43:43] !log starting rolling mx reboots for kernel update [13:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:50] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Add comment about db1105 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389708 (owner: 10Marostegui) [13:45:01] (03Merged) 10jenkins-bot: db-eqiad.php: Add comment about db1105 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389708 (owner: 10Marostegui) [13:46:03] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add comment about db1105 current status (duration: 00m 47s) [13:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:02] (03PS1) 10Marostegui: mariadb: db1105 to become multiinstance for s1,s2 [puppet] - 10https://gerrit.wikimedia.org/r/389709 (https://phabricator.wikimedia.org/T178359) [13:49:31] (03CR) 10Marostegui: "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler02/8669/" [puppet] - 10https://gerrit.wikimedia.org/r/389709 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [13:50:09] (03CR) 10Marostegui: [C: 032] mariadb: db1105 to become multiinstance for s1,s2 [puppet] - 10https://gerrit.wikimedia.org/r/389709 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [13:50:46] !log rebooted fermium (lists) for kernel update [13:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:05] (03CR) 10jenkins-bot: db-eqiad.php: Add comment about db1105 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389708 (owner: 10Marostegui) [13:57:34] RECOVERY - puppet last run on mw2246 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [14:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171107T1400). [14:00:04] Lucas_WMDE and Jayprakash12345: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:01:03] !log rebooting mw canaries to 4.9.51 kernel (also picking up openssl/openssl1.1 updates) [14:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:51] @SWAT member merge https://gerrit.wikimedia.org/r/#/c/389702/ first. It is not time taking patch [14:01:56] Lucas_WMDE: hello, I cant deploy your patch to change apache rules ( https://gerrit.wikimedia.org/r/#/c/388134/ ) [14:02:26] Lucas_WMDE: that would need a review by someone familiar with those rewrite rules, and then you can sync with operations to find a good time to merge it [14:02:56] !log actually holding mw canary reboots until SWAT is over [14:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:13] (03CR) 10Hashar: [C: 032] add _ in appendix talk namespace at mywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389702 (https://phabricator.wikimedia.org/T179907) (owner: 10Jayprakash12345) [14:03:16] Jayprakash12345: doing it :) [14:03:23] moritzm: ah yeah :] [14:03:31] moritzm: it is not going to take long though [14:04:27] (03Merged) 10jenkins-bot: add _ in appendix talk namespace at mywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389702 (https://phabricator.wikimedia.org/T179907) (owner: 10Jayprakash12345) [14:04:38] hashar: take your time, there's plenty of other hosts to reboot in the mean time :-) [14:05:33] (03PS3) 10Alexandros Kosiaris: kube-proxy: Reload on /etc/default-kube-proxy changes [puppet] - 10https://gerrit.wikimedia.org/r/389693 [14:05:35] (03PS3) 10Alexandros Kosiaris: kubelet: Reload on /etc/default/kubelet changes [puppet] - 10https://gerrit.wikimedia.org/r/389694 [14:05:37] (03PS3) 10Alexandros Kosiaris: Parameterize kubernetes node username [puppet] - 10https://gerrit.wikimedia.org/r/389695 [14:05:52] (03PS6) 10Hashar: Enable ShortUrl on pa.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386779 (https://phabricator.wikimedia.org/T178919) (owner: 10Jayprakash12345) [14:06:18] ahh [14:06:21] schema change [14:07:07] (03CR) 10jenkins-bot: add _ in appendix talk namespace at mywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389702 (https://phabricator.wikimedia.org/T179907) (owner: 10Jayprakash12345) [14:07:31] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: add _ in appendix talk namespace at mywiktionary - T179907 (duration: 00m 45s) [14:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:38] T179907: appendix talk namespace didn't recognize as talk page on Burmese Wiktionary (mywikt) - https://phabricator.wikimedia.org/T179907 [14:07:57] hashar: createExtensionTable has a shorturl option [14:08:00] !log rebooting tureis for kernel update [14:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:32] !log running update.php [14:08:33] ;D [14:09:39] hashar: thanks for giving first priority to add _ in appendix talk namespace at mywiktionary [14:09:52] (03PS4) 10Ema: 5.1.3-1wm2: record-prefix for varnishncsa, run vtc tests [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/389516 [14:09:59] dereckson: I dont even know that script :D [14:11:03] !log mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=pawiki ShortUrl - T178919 [14:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:09] T178919: ShortUrl and WikiLove on Punjabi Wikipedia - https://phabricator.wikimedia.org/T178919 [14:11:23] (03CR) 10Hashar: [C: 032] "Table created" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386779 (https://phabricator.wikimedia.org/T178919) (owner: 10Jayprakash12345) [14:11:31] dereckson: thank you for the hint! [14:11:37] you're welcome [14:12:08] (03CR) 10Alexandros Kosiaris: [C: 032] Parameterize kubernetes node username [puppet] - 10https://gerrit.wikimedia.org/r/389695 (owner: 10Alexandros Kosiaris) [14:12:17] (03CR) 10Alexandros Kosiaris: [C: 032] kube-proxy: Reload on /etc/default-kube-proxy changes [puppet] - 10https://gerrit.wikimedia.org/r/389693 (owner: 10Alexandros Kosiaris) [14:12:23] (03CR) 10Alexandros Kosiaris: [C: 032] kubelet: Reload on /etc/default/kubelet changes [puppet] - 10https://gerrit.wikimedia.org/r/389694 (owner: 10Alexandros Kosiaris) [14:13:03] (03Merged) 10jenkins-bot: Enable ShortUrl on pa.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386779 (https://phabricator.wikimedia.org/T178919) (owner: 10Jayprakash12345) [14:13:16] (03CR) 10jenkins-bot: Enable ShortUrl on pa.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386779 (https://phabricator.wikimedia.org/T178919) (owner: 10Jayprakash12345) [14:13:53] Jayprakash12345: it is being deployed [14:14:27] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Enable ShortUrl on pa.wiki - T178919 (duration: 00m 46s) [14:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:15] hashar: okay, I’ll find a reviewer, thanks [14:15:27] (sorry for not replying earlier, my computer froze up) [14:15:48] hashar: Everthing is fine in wikimedia debug. thank for deployment [14:15:55] Jayprakash12345: \o/ [14:16:02] moritzm: done :) [14:16:17] Lucas_WMDE: usually I looked at the git log history for the containing folder [14:18:43] PROBLEM - Keyholder SSH agent on naos is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [14:19:24] PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:19:31] hashar: well SMalyshev is the one who created the file, and he reviewed it [14:19:59] !log rearming keyholder on naos [14:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:17] !log rebooting mw canaries to 4.9.51 kernel (also picking up openssl/openssl1.1 updates) [14:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:44] RECOVERY - Keyholder SSH agent on naos is OK: OK: Keyholder is armed with all configured keys. [14:23:33] RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational [14:25:37] (03PS6) 10BBlack: Add local patch for transaction_timeout [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/387236 (https://phabricator.wikimedia.org/T179156) [14:26:21] (03CR) 10Giuseppe Lavagetto: [C: 031] "Thanks to the check that is performed in the puppetmaster-passenger package, and the fact we use a different ssldir for the [master] secti" [puppet] - 10https://gerrit.wikimedia.org/r/386666 (https://phabricator.wikimedia.org/T179099) (owner: 10Herron) [14:26:26] !log rolling restart of kafka on kafka-jumbo* for jvm security updates [14:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:00] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3741136 (10awight) Turns out we can't use the other concurrency pools for Celery. The other options, gevent and eventlet, are both s... [14:28:21] (03Abandoned) 10Jayprakash12345: Enable local Upload in jvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385359 (https://phabricator.wikimedia.org/T178660) (owner: 10Jayprakash12345) [14:28:58] (03CR) 10jerkins-bot: [V: 04-1] Add local patch for transaction_timeout [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/387236 (https://phabricator.wikimedia.org/T179156) (owner: 10BBlack) [14:30:39] (03PS3) 10Filippo Giunchedi: mx: export metrics from exim4 mainlog [puppet] - 10https://gerrit.wikimedia.org/r/388032 (https://phabricator.wikimedia.org/T179565) [14:30:41] (03PS2) 10Filippo Giunchedi: mtail: add test scaffolding [puppet] - 10https://gerrit.wikimedia.org/r/388478 (https://phabricator.wikimedia.org/T179565) [14:31:49] (03PS1) 10Alexandros Kosiaris: Remove hardcoded client-infrastructure user [puppet] - 10https://gerrit.wikimedia.org/r/389711 [14:32:07] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove hardcoded client-infrastructure user [puppet] - 10https://gerrit.wikimedia.org/r/389711 (owner: 10Alexandros Kosiaris) [14:33:19] !log reboot lithium for kernel upgrade [14:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:53] !log reboot wezen for kernel upgrade [14:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:33] (03PS1) 10Ottomata: Make navtiming support nested parsed UA objects, as well as json strings [puppet] - 10https://gerrit.wikimedia.org/r/389713 (https://phabricator.wikimedia.org/T179625) [14:38:19] (03PS2) 10Ottomata: Make navtiming support nested parsed UA objects, as well as json strings [puppet] - 10https://gerrit.wikimedia.org/r/389713 (https://phabricator.wikimedia.org/T179625) [14:38:47] !log restbase creating wiktionary definition schemas for T179420 [14:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:48] T179420: Migrate definitions storage from the legacy to new strategy - https://phabricator.wikimedia.org/T179420 [14:39:43] (03PS1) 10Muehlenhoff: Create repository components component/elastic55 and thirdparty/elastic55 [puppet] - 10https://gerrit.wikimedia.org/r/389714 [14:39:45] (03PS1) 10Muehlenhoff: Synchronise elastic 5.5 stack to thirdparty/elastic55 [puppet] - 10https://gerrit.wikimedia.org/r/389715 [14:41:59] (03PS1) 10Alexandros Kosiaris: Notify kube-apiserver on user db file changes [puppet] - 10https://gerrit.wikimedia.org/r/389716 [14:44:29] (03CR) 10Alexandros Kosiaris: [C: 032] Notify kube-apiserver on user db file changes [puppet] - 10https://gerrit.wikimedia.org/r/389716 (owner: 10Alexandros Kosiaris) [14:44:47] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:46:03] (03CR) 10Gehel: [C: 04-1] Synchronise elastic 5.5 stack to thirdparty/elastic55 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389715 (owner: 10Muehlenhoff) [14:48:06] (03CR) 10Filippo Giunchedi: "Note this is the scaffolding only, i.e. not yet hooked to tox" [puppet] - 10https://gerrit.wikimedia.org/r/388478 (https://phabricator.wikimedia.org/T179565) (owner: 10Filippo Giunchedi) [14:49:47] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:50:46] (03PS2) 10Muehlenhoff: Synchronise elastic 5.5 stack to thirdparty/elastic55 [puppet] - 10https://gerrit.wikimedia.org/r/389715 [14:51:08] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:52:23] !log roll-restart thumbor for kernel upgrades [14:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:35] (03PS1) 10ArielGlenn: move explicit references to users and groups out of snapshot module [puppet] - 10https://gerrit.wikimedia.org/r/389719 (https://phabricator.wikimedia.org/T179942) [15:02:14] (03CR) 10jerkins-bot: [V: 04-1] move explicit references to users and groups out of snapshot module [puppet] - 10https://gerrit.wikimedia.org/r/389719 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [15:02:39] !log mobrovac@tin Started deploy [restbase/deploy@c5dd1e2]: Switch wiktionary definitions to use the next-gen storage - T179420 [15:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:45] T179420: Migrate definitions storage from the legacy to new strategy - https://phabricator.wikimedia.org/T179420 [15:05:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Actually hostprivkey does absolutely nothing in 2.6+. It's a fully useless setting. hostcert is actually used in a few cases, none of whic" [puppet] - 10https://gerrit.wikimedia.org/r/386666 (https://phabricator.wikimedia.org/T179099) (owner: 10Herron) [15:07:45] (03PS2) 10Elukey: cassandra: use LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/388052 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [15:07:52] gehel: shall we? --^ [15:08:06] I had a chat with the Services team and they are ok to merge [15:08:25] I need to reboot aqs nodes for kernel updates this week so I'll couple the two updates [15:08:40] lemme know if you are ok (you can also merge whenever you prefer) [15:10:31] !log mobrovac@tin Finished deploy [restbase/deploy@c5dd1e2]: Switch wiktionary definitions to use the next-gen storage - T179420 (duration: 07m 52s) [15:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:38] T179420: Migrate definitions storage from the legacy to new strategy - https://phabricator.wikimedia.org/T179420 [15:11:39] (03PS2) 10ArielGlenn: move explicit references to users and groups out of snapshot module [puppet] - 10https://gerrit.wikimedia.org/r/389719 (https://phabricator.wikimedia.org/T179942) [15:12:19] <_joe_> !log added a runner for htmlCacheUpdate on cewiki too [15:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:33] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler03/8671/" [puppet] - 10https://gerrit.wikimedia.org/r/389485 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [15:12:34] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089#3741453 (10debt) Perfect, thanks so much for spending the time on this minor tech debt cleanup, @gehel! 👍 [15:12:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor comment inline. Also I am guessing we can add services and pods later on in another patch, so LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/388505 (https://phabricator.wikimedia.org/T177395) (owner: 10Filippo Giunchedi) [15:13:28] (03CR) 10jerkins-bot: [V: 04-1] move explicit references to users and groups out of snapshot module [puppet] - 10https://gerrit.wikimedia.org/r/389719 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [15:14:11] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/8672/" [puppet] - 10https://gerrit.wikimedia.org/r/389484 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [15:14:13] mutante: hey, https://phabricator.wikimedia.org/T99531#3698826 seems stalled, is there anything I can do to unstall it? [15:15:49] (03CR) 10Mforns: [C: 031] "LGTM! We should merge this soon..." [puppet] - 10https://gerrit.wikimedia.org/r/384093 (https://phabricator.wikimedia.org/T178174) (owner: 10Nuria) [15:16:52] (03PS3) 10ArielGlenn: move explicit references to users and groups out of snapshot module [puppet] - 10https://gerrit.wikimedia.org/r/389719 (https://phabricator.wikimedia.org/T179942) [15:17:29] (03CR) 10jerkins-bot: [V: 04-1] move explicit references to users and groups out of snapshot module [puppet] - 10https://gerrit.wikimedia.org/r/389719 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [15:19:38] (03PS4) 10ArielGlenn: move explicit references to users and groups out of snapshot module [puppet] - 10https://gerrit.wikimedia.org/r/389719 (https://phabricator.wikimedia.org/T179942) [15:19:40] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [15:19:46] (03PS1) 10Herron: puppet: add puppet 4 auth.conf template [puppet] - 10https://gerrit.wikimedia.org/r/389720 (https://phabricator.wikimedia.org/T179722) [15:19:57] someday I'll actually commit all the changes before push [15:20:05] but that day is not today and it wasn't yesterday either [15:20:09] (03CR) 10Alexandros Kosiaris: [C: 031] "That a very valid point. That stanza needs to be removed. Wanna bundle it in this patch ? In any case, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/386696 (https://phabricator.wikimedia.org/T179102) (owner: 10Herron) [15:21:05] (03PS2) 10Filippo Giunchedi: prometheus: add jobs to scrape metrics from k8s [puppet] - 10https://gerrit.wikimedia.org/r/388505 (https://phabricator.wikimedia.org/T177395) [15:21:23] (03CR) 10Filippo Giunchedi: prometheus: add jobs to scrape metrics from k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/388505 (https://phabricator.wikimedia.org/T177395) (owner: 10Filippo Giunchedi) [15:21:35] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add jobs to scrape metrics from k8s [puppet] - 10https://gerrit.wikimedia.org/r/388505 (https://phabricator.wikimedia.org/T177395) (owner: 10Filippo Giunchedi) [15:21:58] * godog shakes fist [15:22:44] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review: wikimedia-jessie & wikimedia-stretch docker images don't have deb-src set for apt.wikimedia.org - https://phabricator.wikimedia.org/T179354#3741493 (10akosiaris) >>! In T179354#3733948, @hashar wrote: > `apt-get build-dep h... [15:22:54] !log mobrovac@tin Started deploy [restbase/deploy@eab2948]: revert definition switch, wrong schema - T179420 [15:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:00] T179420: Migrate definitions storage from the legacy to new strategy - https://phabricator.wikimedia.org/T179420 [15:28:21] (03PS1) 10Ottomata: [WIP] EventLogging analytics capsule discrepency fixes [puppet] - 10https://gerrit.wikimedia.org/r/389722 (https://phabricator.wikimedia.org/T179625) [15:28:52] (03CR) 10jerkins-bot: [V: 04-1] [WIP] EventLogging analytics capsule discrepency fixes [puppet] - 10https://gerrit.wikimedia.org/r/389722 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [15:29:34] (03PS5) 10ArielGlenn: move explicit references to users and groups out of snapshot module [puppet] - 10https://gerrit.wikimedia.org/r/389719 (https://phabricator.wikimedia.org/T179942) [15:29:40] !log mobrovac@tin Finished deploy [restbase/deploy@eab2948]: revert definition switch, wrong schema - T179420 (duration: 06m 46s) [15:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:48] T179420: Migrate definitions storage from the legacy to new strategy - https://phabricator.wikimedia.org/T179420 [15:30:19] Amir1: o/ [15:30:30] elukey: hey there [15:30:32] sup [15:30:36] can I ask you something about ores? Or better to Aaron? [15:30:52] elukey: he is traveling atm [15:30:55] elukey: hi! [15:30:59] Happy to give it a shot... [15:31:03] hi! [15:31:04] so ask me and if I can answer [15:31:07] thanks a lot :) [15:31:41] o/ [15:31:45] so we have a lot of 500s in cache::misc for Ores, iirc due to a bug in how malformed reqs are handled [15:31:47] Just got online at my hotel in Rio :) [15:31:53] halfak: o/ [15:31:58] in Rio?? [15:32:05] wow [15:32:07] Yup. de Janeiro :) [15:32:12] elukey: that's a known issue, we are working on it [15:32:36] elukey, I implemented some fixes, but the deploy of those fixes got hung on some (probably) scap issues. [15:32:44] all right, just wanted to know if it was suposed to go out with yesterday's deployment or not [15:32:50] (03PS2) 10Ottomata: [WIP] EventLogging analytics capsule discrepency fixes [puppet] - 10https://gerrit.wikimedia.org/r/389722 (https://phabricator.wikimedia.org/T179625) [15:32:58] * halfak tries to remember what day it is [15:33:10] elukey: T179712 T179711 fyi [15:33:10] T179711: ORES 500 errors on a threshold lookup request - https://phabricator.wikimedia.org/T179711 [15:33:10] T179712: ORES 500s when model_info lookup fails due to a key error - https://phabricator.wikimedia.org/T179712 [15:33:14] (03CR) 10jerkins-bot: [V: 04-1] [WIP] EventLogging analytics capsule discrepency fixes [puppet] - 10https://gerrit.wikimedia.org/r/389722 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [15:33:19] awight: ack thanks! [15:33:25] any timeline for the fix deployed? [15:34:04] elukey: Pretty sure we can get that out by Thursday. Is it causing any trouble other than the monitor spam? [15:34:17] All we’re going to do FYI is change those to 404 responses. [15:35:19] awight: super, nothing major but it shadows our ability to spot other cache misc issues etc.. [15:35:31] so I was just checking in :) [15:35:34] will wait for the final fix [15:35:39] thursday seems fine [15:35:41] *400 [15:35:45] Not 404 :) [15:35:59] But also it will fix the original problem so the errors will stop too [15:36:24] Since the loop seems to be caused by the new thresholds lookup failing and the old one returning a 500 error. [15:37:16] !log T179420: recreating wiktionary definition schemas [15:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:22] T179420: Migrate definitions storage from the legacy to new strategy - https://phabricator.wikimedia.org/T179420 [15:39:21] (03PS3) 10Filippo Giunchedi: prometheus: add jobs to scrape metrics from k8s [puppet] - 10https://gerrit.wikimedia.org/r/388505 (https://phabricator.wikimedia.org/T177395) [15:40:33] elukey: sorry for the delay, I got distracted by a screaming kid... [15:40:48] he gets the precedence of course :) [15:41:22] elukey: yeah, he is usually really nice, but this time the mother needed some moral support to get through... [15:42:11] if you're good for that cassandra restart, I can merge the change, I'll take care of cassandra maps and I'll let you take care of restbase? [15:42:34] I'll take care of AQS, the services team is already up to speed and they'll take care of the restarts [15:42:58] ok, so maps for me, aqs for you and restbase for the service team? [15:43:09] should we ping someone in particular in the service team? [15:43:13] !log roll-restart thumbor in eqiad for kernel upgrade [15:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:25] elukey: you rebased that change 38 minutes ago and it is still current? What's happening? [15:44:40] elukey: ok if I merge now? [15:45:04] gehel: I was waiting your ack before proceeding, please merge :) [15:45:10] (03CR) 10Gehel: [C: 032] cassandra: use LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/388052 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [15:45:55] (03PS4) 10Herron: puppet: fix apache puppet-master.conf conflict at pkg install time [puppet] - 10https://gerrit.wikimedia.org/r/386696 (https://phabricator.wikimedia.org/T179102) [15:46:15] (03CR) 10Hashar: "recheck" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/389516 (owner: 10Ema) [15:46:26] elukey: done. I'm going to restart the maps-test cluster to make sure everything is alright [15:46:33] (03PS6) 10ArielGlenn: move explicit references to users and groups out of snapshot module [puppet] - 10https://gerrit.wikimedia.org/r/389719 (https://phabricator.wikimedia.org/T179942) [15:46:42] (03CR) 10Hashar: "The CI job should now have:" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/389516 (owner: 10Ema) [15:46:47] !log rolling restart of cassandra maps-test for logging change [15:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:36] (03CR) 10ArielGlenn: [C: 032] move explicit references to users and groups out of snapshot module [puppet] - 10https://gerrit.wikimedia.org/r/389719 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [15:47:43] (03CR) 10Herron: [C: 032] "Sure, next patch removes that section." [puppet] - 10https://gerrit.wikimedia.org/r/386696 (https://phabricator.wikimedia.org/T179102) (owner: 10Herron) [15:47:48] (03PS5) 10Herron: puppet: fix apache puppet-master.conf conflict at pkg install time [puppet] - 10https://gerrit.wikimedia.org/r/386696 (https://phabricator.wikimedia.org/T179102) [15:50:33] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3741582 (10herron) [15:50:37] 10Operations, 10Puppet, 10Patch-For-Review: Puppet4: Create empty/placeholder /etc/apache2/sites-enabled/puppet-master.conf - https://phabricator.wikimedia.org/T179102#3741580 (10herron) 05Open>03Resolved a:03herron [15:50:53] elukey: looking good, cassandra up and running, logs still flowing to logstash [15:51:47] goood! [15:52:19] elukey: Oh, I think we don't even need a restart, it looks like we do have scan=true in the logback config [15:52:22] (03PS1) 10Cmjohnson: Removing site.pp and dhcpd entries for decom db's db1028,33,[35-38] [puppet] - 10https://gerrit.wikimedia.org/r/389727 [15:52:27] (03PS2) 10Elukey: [WIP] First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) [15:53:53] elukey: let me confirm that on the other maps clusters [15:56:11] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:57:22] elukey: yep, no restart required, auto reload seems to work just fine [15:57:30] woa really? [15:57:52] yeah, I should have checked that before... but for once we have a config that make sense! [15:59:18] so you should see the new lvs endpoint getting more traffic asap :) [15:59:35] (03PS2) 10Herron: Puppet: Change hostcert and remove hostprivkey master settings [puppet] - 10https://gerrit.wikimedia.org/r/386666 (https://phabricator.wikimedia.org/T179099) [16:00:07] (03PS1) 10ArielGlenn: dumpsgen user should own the dumps repo for scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/389728 [16:00:19] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T179727#3734260 (10RobH) Please note this is out of warranty. Disk failed: Enclosure Device ID: 32 Slot Number: 0 Drive's position: DiskGroup: 0, Span: 0, Arm: 0 Enclosure position: 1 Device Id: 0 WWN: 5000C5005E6... [16:00:20] (03CR) 10Ema: [C: 031] smart: enable SMART health collection in esams [puppet] - 10https://gerrit.wikimedia.org/r/389484 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [16:00:25] (03CR) 10Ema: [C: 031] smart: enable SMART health collection in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/389485 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [16:00:34] elukey: mediawiki is already using the LVS endpoint, I doubt that cassandra is going to be noticeable :) [16:00:48] well, not the same port, so yes, I should see something... [16:01:13] I'll check the previous endpoint to see that it does not have any traffic left [16:01:39] (03PS2) 10Cmjohnson: Removing site.pp and dhcpd entries for decom db's db1028,33,[35-38,41 [puppet] - 10https://gerrit.wikimedia.org/r/389727 [16:01:48] (03CR) 10Herron: "Sounds reasonable to me. Removed hostprivkey setting and updated commit msg accordingly" [puppet] - 10https://gerrit.wikimedia.org/r/386666 (https://phabricator.wikimedia.org/T179099) (owner: 10Herron) [16:07:22] (03PS2) 10ArielGlenn: dumpsgen user should own the dumps repo for scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/389728 [16:08:46] (03PS2) 10Filippo Giunchedi: smart: enable SMART health collection in esams [puppet] - 10https://gerrit.wikimedia.org/r/389484 (https://phabricator.wikimedia.org/T86552) [16:11:02] (03CR) 10Filippo Giunchedi: [C: 032] smart: enable SMART health collection in esams [puppet] - 10https://gerrit.wikimedia.org/r/389484 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [16:11:38] (03CR) 10ArielGlenn: [C: 032] dumpsgen user should own the dumps repo for scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/389728 (owner: 10ArielGlenn) [16:11:46] (03PS3) 10ArielGlenn: dumpsgen user should own the dumps repo for scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/389728 [16:12:07] !log start cache_text/upload rolling reboots: upgrading kernel to 4.9.51, libssl to 1.0.2m and 1.1.0g [16:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:13] 10Operations, 10ops-ulsfo, 10Traffic: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815#3741664 (10RobH) [16:27:26] (03PS1) 10ArielGlenn: don't keep and serve old slowparse logs forever [puppet] - 10https://gerrit.wikimedia.org/r/389732 (https://phabricator.wikimedia.org/T174421) [16:28:00] PROBLEM - puppet last run on mw1288 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:31:32] 10Operations, 10ops-esams, 10Traffic: cp4043 disk failure - https://phabricator.wikimedia.org/T179953#3741749 (10RobH) [16:31:53] 10Operations, 10ops-esams, 10Traffic: cp3043 disk failure - https://phabricator.wikimedia.org/T179953#3741763 (10BBlack) [16:33:50] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [16:39:46] 10Operations, 10Gerrit, 10Readers-Web-Backlog, 10Patch-For-Review, and 2 others: [subtask] Temporarily allow pushing large objects - https://phabricator.wikimedia.org/T178189#3741790 (10akosiaris) >>! In T178189#3740805, @phuedx wrote: >>>! In T178189#3740029, @akosiaris wrote: >> Do we have a timeline for... [16:41:59] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:42:52] Yay, VectorBeta was uninstalled but not removed from the branch list [16:43:08] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [16:44:30] Removing extensions is almost as hard as adding them [16:45:01] (03PS1) 10Chad: Remove VectorBeta, isn't installed anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389734 [16:45:11] (03CR) 10Chad: [C: 032] Remove VectorBeta, isn't installed anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389734 (owner: 10Chad) [16:45:38] RECOVERY - Check Varnish expiry mailbox lag on cp4025 is OK: OK: expiry mailbox lag is 0 [16:45:48] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [16:45:58] RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 1383 days) [16:46:20] (03CR) 10Alexandros Kosiaris: [C: 031] prometheus: add jobs to scrape metrics from k8s [puppet] - 10https://gerrit.wikimedia.org/r/388505 (https://phabricator.wikimedia.org/T177395) (owner: 10Filippo Giunchedi) [16:46:25] (03Merged) 10jenkins-bot: Remove VectorBeta, isn't installed anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389734 (owner: 10Chad) [16:46:25] that was me ^ [16:47:08] (03CR) 10jenkins-bot: Remove VectorBeta, isn't installed anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389734 (owner: 10Chad) [16:47:49] !log demon@tin Synchronized multiversion/submodules.json: no op (duration: 00m 47s) [16:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:20] (03CR) 10Umherirrender: contint: Install php5.5-xml and php5.5-zip (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389512 (https://phabricator.wikimedia.org/T179772) (owner: 10Umherirrender) [16:57:13] (03PS4) 10Filippo Giunchedi: prometheus: add jobs to scrape metrics from k8s [puppet] - 10https://gerrit.wikimedia.org/r/388505 (https://phabricator.wikimedia.org/T177395) [16:57:22] (03CR) 10Alexandros Kosiaris: [C: 031] Puppet: Change hostcert and remove hostprivkey master settings [puppet] - 10https://gerrit.wikimedia.org/r/386666 (https://phabricator.wikimedia.org/T179099) (owner: 10Herron) [16:57:40] (03CR) 10Herron: [C: 032] Puppet: Change hostcert and remove hostprivkey master settings [puppet] - 10https://gerrit.wikimedia.org/r/386666 (https://phabricator.wikimedia.org/T179099) (owner: 10Herron) [16:57:43] (03PS3) 10Herron: Puppet: Change hostcert and remove hostprivkey master settings [puppet] - 10https://gerrit.wikimedia.org/r/386666 (https://phabricator.wikimedia.org/T179099) [16:57:58] RECOVERY - puppet last run on mw1288 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:58:49] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add jobs to scrape metrics from k8s [puppet] - 10https://gerrit.wikimedia.org/r/388505 (https://phabricator.wikimedia.org/T177395) (owner: 10Filippo Giunchedi) [17:00:04] godog, moritzm, and _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171107T1700). [17:00:04] greg-g and Lucas_WMDE: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:01:07] Amir1: ping ^ :) [17:01:47] jouncebot has a sass module? [17:02:13] Jouncebot also didn't announce puppetswat just now [17:02:18] !log Restarting Cassandra, restbase2001-{a,b,c} to apply OpenJDK upgrade [17:02:21] * no_justification mutters some profanity [17:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:48] it did afaics! [17:02:58] anyways I'm taking a look at the patches [17:03:08] (03PS2) 10Filippo Giunchedi: betacluster: add -videoscaler to scap dsh group [puppet] - 10https://gerrit.wikimedia.org/r/389642 (https://phabricator.wikimedia.org/T179688) (owner: 10Greg Grossmeier) [17:04:01] (03CR) 10Filippo Giunchedi: [C: 032] betacluster: add -videoscaler to scap dsh group [puppet] - 10https://gerrit.wikimedia.org/r/389642 (https://phabricator.wikimedia.org/T179688) (owner: 10Greg Grossmeier) [17:04:17] Lucas_WMDE: thanks [17:04:58] PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:05:09] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused [17:05:33] godog: let me know if you need me, I'm going into a meeting now :) I'll be responsive [17:05:56] greg-g: np, yeah I JFDI for your patch as it is beta only / trivial [17:05:59] RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2018-08-17 16:11:39 +0000 (expires in 282 days) [17:06:09] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.162 port 9042 [17:06:13] (03PS4) 10Herron: Puppet: Change hostcert and remove hostprivkey master settings [puppet] - 10https://gerrit.wikimedia.org/r/386666 (https://phabricator.wikimedia.org/T179099) [17:06:18] RECOVERY - Check Varnish expiry mailbox lag on cp4021 is OK: OK: expiry mailbox lag is 0 [17:06:24] Lucas_WMDE no_justification are there tasks for your patches btw? [17:06:48] no task for my patch [17:06:49] godog: Nope, I never had a task for my docroot cleanups [17:06:52] Been jfdi :) [17:07:10] (but there's a number of other ones in the git log) [17:08:00] yeah I remember going over some of those in the past [17:08:59] PROBLEM - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.163 and port 9042: Connection refused [17:09:08] PROBLEM - cassandra-b SSL 10.192.16.163:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:10:08] RECOVERY - cassandra-b SSL 10.192.16.163:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-b valid until 2018-08-17 16:11:40 +0000 (expires in 282 days) [17:10:30] godog: word [17:10:59] RECOVERY - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.163 port 9042 [17:13:34] (03PS2) 10Filippo Giunchedi: Swap wikimediafoundation.org over to using standard-docroot [puppet] - 10https://gerrit.wikimedia.org/r/385112 (owner: 10Chad) [17:13:38] PROBLEM - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.164 and port 9042: Connection refused [17:14:08] PROBLEM - cassandra-c SSL 10.192.16.164:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:14:39] (03CR) 10Filippo Giunchedi: [C: 032] Swap wikimediafoundation.org over to using standard-docroot [puppet] - 10https://gerrit.wikimedia.org/r/385112 (owner: 10Chad) [17:14:58] no_justification: merged [17:15:08] godog: Thanks so much! :) [17:15:08] RECOVERY - cassandra-c SSL 10.192.16.164:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-c valid until 2018-08-17 16:11:42 +0000 (expires in 282 days) [17:15:38] RECOVERY - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.164 port 9042 [17:16:00] (03PS1) 10BBlack: ulsfo subnet comment fixup [dns] - 10https://gerrit.wikimedia.org/r/389738 [17:16:02] (03PS1) 10BBlack: eqsin DNS for hosts, services, geodns [dns] - 10https://gerrit.wikimedia.org/r/389739 (https://phabricator.wikimedia.org/T156027) [17:16:18] (03CR) 10jerkins-bot: [V: 04-1] eqsin DNS for hosts, services, geodns [dns] - 10https://gerrit.wikimedia.org/r/389739 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [17:16:43] (03PS1) 10BBlack: ulsfo definition fixups [puppet] - 10https://gerrit.wikimedia.org/r/389740 [17:16:45] (03PS1) 10BBlack: eqsin: basics [puppet] - 10https://gerrit.wikimedia.org/r/389741 (https://phabricator.wikimedia.org/T156027) [17:17:26] (03CR) 10jerkins-bot: [V: 04-1] eqsin: basics [puppet] - 10https://gerrit.wikimedia.org/r/389741 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [17:17:31] (03CR) 10BBlack: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/389739 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [17:17:39] (03CR) 10jerkins-bot: [V: 04-1] eqsin DNS for hosts, services, geodns [dns] - 10https://gerrit.wikimedia.org/r/389739 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [17:17:56] (03PS1) 10Eevans: Log every retry warning [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/389742 [17:17:58] (03PS1) 10Eevans: Use a more realistic defaults [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/389743 [17:18:26] Lucas_WMDE: can you give me some context on the patch? I asked for a task hoping to get context from there [17:19:03] godog: Wikidata uses URIs like in its RDF export [17:19:17] and we redirect them to /wiki/Special:EntityData/$1 (which then does content negotiation) [17:19:40] the added URIs are used as RDF predicates for normalized quantities and identifiers [17:19:45] hashar: ? 2x CI failures here that seem like basic git-fetch failures of different kinds: https://gerrit.wikimedia.org/r/#/c/389739 [17:19:57] (03PS2) 10BBlack: eqsin: basics [puppet] - 10https://gerrit.wikimedia.org/r/389741 (https://phabricator.wikimedia.org/T156027) [17:20:28] e. g. you have the statement wd:Q42 wdtn:P1015 in our RDF data on Douglas Adams (https://www.wikidata.org/wiki/Special:EntityData/Q42.ttl) [17:20:58] bblack that error seems to be the same one reported a few days by another user. [17:21:14] which, if you expand the wd: and wdtn: prefixes, stands for [17:21:15] [17:21:32] 10Operations, 10ops-esams, 10Traffic: cp3043 disk failure - https://phabricator.wikimedia.org/T179953#3741889 (10RobH) I have requested parts dispatch SR956320029. Once they notify me of shipment, I'll open an inbound shipment request with EvoSwitch, as well as a smart hands ticket for them to swap the SSD... [17:21:38] 10Operations, 10ops-esams, 10Traffic: cp3043 disk failure - https://phabricator.wikimedia.org/T179953#3741890 (10RobH) a:05mark>03RobH [17:22:05] and currently, the first of those links (wd:Q42) resolves as expected, but the second one (/proc/direct-normalized/P1015) doesn’t, because the rewrite rule for /prop/direct-normalized/ is missing, so it uses the generic /prop/ rule [17:23:37] Lucas_WMDE: ack, thanks [17:24:30] (the content negotiation is also described on https://www.wikidata.org/wiki/Wikidata:Data_access#Linked_Data_interface, though that focuses mostly on the entity URIs, not the predicate URIs) [17:25:17] (03PS2) 10Filippo Giunchedi: Add rewrite rules for normalized Wikidata predicates [puppet] - 10https://gerrit.wikimedia.org/r/388134 (owner: 10Lucas Werkmeister (WMDE)) [17:25:22] (03PS1) 10ArielGlenn: move references to datasets use from dumps module out to profile [puppet] - 10https://gerrit.wikimedia.org/r/389745 (https://phabricator.wikimedia.org/T179942) [17:25:55] (03CR) 10Filippo Giunchedi: [C: 032] Add rewrite rules for normalized Wikidata predicates [puppet] - 10https://gerrit.wikimedia.org/r/388134 (owner: 10Lucas Werkmeister (WMDE)) [17:26:00] (03CR) 10jerkins-bot: [V: 04-1] move references to datasets use from dumps module out to profile [puppet] - 10https://gerrit.wikimedia.org/r/389745 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [17:26:04] (03PS5) 10Zoranzoki21: Enable the ArticlePlaceholder for Northern Sami (sewiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) [17:26:38] Lucas_WMDE: merged [17:26:44] godog: thank you! :) [17:27:08] Lucas_WMDE: np, should be live in the next 30 min or so [17:27:30] (03PS2) 10ArielGlenn: move references to datasets use from dumps module out to profile [puppet] - 10https://gerrit.wikimedia.org/r/389745 (https://phabricator.wikimedia.org/T179942) [17:29:50] (03PS1) 10Filippo Giunchedi: prometheus: reference k8s scrape_configs_extra [puppet] - 10https://gerrit.wikimedia.org/r/389746 [17:31:03] (03PS2) 10BBlack: eqsin DNS for hosts, services, geodns [dns] - 10https://gerrit.wikimedia.org/r/389739 (https://phabricator.wikimedia.org/T156027) [17:31:09] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: reference k8s scrape_configs_extra [puppet] - 10https://gerrit.wikimedia.org/r/389746 (owner: 10Filippo Giunchedi) [17:32:47] !log rolling reboot of scb in codfw to pick up new kernel (and openssl updates) [17:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:57] !log Clearing Cassandra snapshots (T179422) [17:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:03] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [17:38:55] !log Restart Cassandra, restbase1010-{a,b,c}.eqiad.wmnet (T178177) [17:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:01] T178177: Investigate aberrant Cassandra columnfamily read latency of restbase1010 - https://phabricator.wikimedia.org/T178177 [17:40:19] (03CR) 10Nuria: "Andrew, would you be so kind as to merge this one?" [puppet] - 10https://gerrit.wikimedia.org/r/384093 (https://phabricator.wikimedia.org/T178174) (owner: 10Nuria) [17:41:29] PROBLEM - cassandra-a CQL 10.64.0.114:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.114 and port 9042: Connection refused [17:43:28] RECOVERY - cassandra-a CQL 10.64.0.114:9042 on restbase1010 is OK: TCP OK - 0.000 second response time on 10.64.0.114 port 9042 [17:45:59] PROBLEM - cassandra-b SSL 10.64.0.115:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:46:08] PROBLEM - cassandra-b CQL 10.64.0.115:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.115 and port 9042: Connection refused [17:46:59] RECOVERY - cassandra-b SSL 10.64.0.115:7001 on restbase1010 is OK: SSL OK - Certificate restbase1010-b valid until 2018-08-17 16:11:06 +0000 (expires in 282 days) [17:47:08] RECOVERY - cassandra-b CQL 10.64.0.115:9042 on restbase1010 is OK: TCP OK - 0.000 second response time on 10.64.0.115 port 9042 [17:47:18] PROBLEM - puppet last run on mc2023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:48:18] PROBLEM - puppet last run on analytics1064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:48:59] (03PS3) 10ArielGlenn: move references to datasets use from dumps module out to profile [puppet] - 10https://gerrit.wikimedia.org/r/389745 (https://phabricator.wikimedia.org/T179942) [17:49:36] (03CR) 10jerkins-bot: [V: 04-1] move references to datasets use from dumps module out to profile [puppet] - 10https://gerrit.wikimedia.org/r/389745 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [17:49:48] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:49:58] PROBLEM - puppet last run on analytics1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:49:59] PROBLEM - puppet last run on ms-be1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:49:59] PROBLEM - puppet last run on fermium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:08] PROBLEM - cassandra-c CQL 10.64.0.116:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.116 and port 9042: Connection refused [17:50:09] PROBLEM - cassandra-c SSL 10.64.0.116:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:50:19] PROBLEM - puppet last run on ms-be1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:19] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:28] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:28] PROBLEM - puppet last run on ms-fe2008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:29] PROBLEM - puppet last run on logstash1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:38] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:39] PROBLEM - puppet last run on mw2215 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:46] puppet shower? [17:50:48] PROBLEM - puppet last run on bohrium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:49] PROBLEM - puppet last run on mw2113 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:51] yikes [17:50:58] PROBLEM - puppet last run on kafka1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:00] Error 400 on SERVER: Could not retrieve facts for analytics1064.eqiad.wmnet: Failed to find facts from PuppetDB at nitrogen.eqiad.wmnet:443: undefined method `content' for nil:NilClass [17:51:08] PROBLEM - puppet last run on nitrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:10] RECOVERY - cassandra-c CQL 10.64.0.116:9042 on restbase1010 is OK: TCP OK - 0.000 second response time on 10.64.0.116 port 9042 [17:51:10] RECOVERY - cassandra-c SSL 10.64.0.116:7001 on restbase1010 is OK: SSL OK - Certificate restbase1010-c valid until 2018-08-17 16:11:07 +0000 (expires in 282 days) [17:51:10] PROBLEM - puppet last run on labstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:18] PROBLEM - puppet last run on acrab is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:19] Error 400 on SERVER: Failed to submit 'replace facts' command for mw2215.codfw.wmnet to PuppetDB at nitrogen.eqiad.wmnet:443: undefined method `content' for nil:NilClass [17:51:29] PROBLEM - puppet last run on etcd1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:29] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:38] PROBLEM - puppet last run on lvs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:38] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:38] PROBLEM - puppet last run on ms-be1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:39] PROBLEM - puppet last run on mendelevium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:39] PROBLEM - puppet last run on db2057 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:39] PROBLEM - puppet last run on thumbor2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:43] (03PS4) 10ArielGlenn: move references to datasets use from dumps module out to profile [puppet] - 10https://gerrit.wikimedia.org/r/389745 (https://phabricator.wikimedia.org/T179942) [17:51:48] PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:48] PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:56] stopping ircecho [17:52:01] thanks :) [17:52:13] PROBLEM - puppet last run on elastic1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:13] PROBLEM - puppet last run on labtestmetal2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:13] PROBLEM - puppet last run on db2034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:13] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:13] PROBLEM - puppet last run on cp2018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:14] PROBLEM - puppet last run on restbase2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:14] PROBLEM - puppet last run on mw2225 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:19] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:29] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:39] PROBLEM - puppet last run on mw1322 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:39] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:39] PROBLEM - puppet last run on ores2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:39] PROBLEM - puppet last run on mw2249 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:39] PROBLEM - puppet last run on mc2036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:39] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:42] puppetdb random carnage? [17:52:58] PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:53:03] Warning: Error 400 on SERVER: Could not retrieve facts for cp1068.eqiad.wmnet: Failed to find facts from PuppetDB at nitrogen.eqiad.wmnet:443: undefined method `content' for nil:NilClass [17:53:09] PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:53:09] PROBLEM - puppet last run on kraz is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:53:18] RECOVERY - puppet last run on analytics1064 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:53:28] seems so [17:54:09] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:54:09] PROBLEM - puppet last run on notebook1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:54:13] although no oom-killer on nitrogen [17:54:19] PROBLEM - puppet last run on install1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:54:38] PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:54:39] PROBLEM - puppet last run on db1100 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:54:48] PROBLEM - puppet last run on elastic1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:54:49] PROBLEM - puppet last run on mc2032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:54:49] PROBLEM - puppet last run on mw2179 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:54:49] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:54:49] PROBLEM - puppet last run on thumbor2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:54:58] PROBLEM - puppet last run on db1089 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:54:59] PROBLEM - puppet last run on mw1284 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:54:59] PROBLEM - puppet last run on es2018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:55:25] !log stop ircecho on einstenium (puppet shower from nitrogen) [17:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:04] during the time of the problems database garbage collection happened on nitrogen [17:56:11] but that might be a red herring [17:56:13] yes but lasted 2 seconds [17:56:25] !log Decommissioning restbase2001-a.codfw.wmnet (T179422) [17:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:31] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [17:56:31] last change merged was https://gerrit.wikimedia.org/r/389746 but it doesn't seem the trigger [17:56:35] yeah, but the rate of failures should be about the fallout of two seconds, right? [17:56:37] I don't see a clearreason [17:56:37] Active: active (running) since Mon 2017-11-06 11:47:56 UTC; 1 day 6h ago [17:58:18] elukey: are we getting new failures or the failure shower is over? [17:58:25] is this postgres log message of any concern? GMT WARNING: there is already a transaction in progress [17:58:31] on nitrogen [17:58:47] volans: I stopped ircecho [17:59:01] elukey: that's why I'm asking you ;) [17:59:32] I have no idea, we should check icinga :D [17:59:55] or simply run puppet on a failed host [17:59:57] and see [17:59:57] herron: doesn't look super good to have those, but seems also pretty normal in the log [18:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Parsoid / OCG / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171107T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:21] no parsoid deploy today [18:00:29] elukey: yeah, I though you had a tail on the file [18:00:32] anyway it seems to still fail [18:00:34] nothing for ORES [18:00:36] should we restart it? [18:00:44] elukey how did you kill ircecho? I tried sudo kill pid but it seemed to keep running [18:01:00] herron: it gets restarted by puppet after a while [18:01:07] service ircecho stop :) [18:01:20] volans: I ran puppet on mw1284 and it went fine [18:01:28] PROBLEM - puppet last run on mw1284 etc.. [18:01:29] ah of course [18:01:46] I run it on neodymium and failed [18:01:49] now is going through [18:01:53] mmmh [18:02:06] and now failed again [18:02:37] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, 10Elasticsearch: Created dedicated elastic component in our APT repository - https://phabricator.wikimedia.org/T179964#3742139 (10Gehel) [18:03:03] ah yes same for me on mw1284 [18:03:25] 10Operations, 10netops: Allow syslog-tls in analytics towards wezen/lithium - https://phabricator.wikimedia.org/T177821#3742155 (10fgiunchedi) While investigating with @ayounsi it emerged that regular syslog is also blocked from analytics to prod. Please allow udp/514 as well, thanks! [18:03:31] herron: we're still with only eqiad puppetmasters right? [18:03:41] volans yes [18:04:13] 10Operations, 10netops: Allow syslog-tls and syslog in analytics towards wezen/lithium - https://phabricator.wikimedia.org/T177821#3742159 (10fgiunchedi) [18:04:51] on nitrogen's nginx error log I can see "connect() failed (111: Connection refused) while connecting to upstream," [18:05:35] shall we try bouncing puppetdb service? [18:05:56] that would be my next action [18:06:02] it doesn't log anything when it fails [18:06:05] just tested [18:06:55] !log restarting puppetdb service on nitrogen [18:06:56] (03CR) 10Zoranzoki21: [C: 031] Setup apache vhost on scap proxies as well [puppet] - 10https://gerrit.wikimedia.org/r/344221 (https://phabricator.wikimedia.org/T147938) (owner: 10Chad) [18:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:33] (03CR) 10Zoranzoki21: [C: 031] Swap git.wikimedia.org -> phabricator.wikimedia.org [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/389655 (https://phabricator.wikimedia.org/T139089) (owner: 10Chad) [18:07:57] I can see 200s now in the nginx access logs [18:08:06] still getting some failures from nginx but should be different and know [18:08:10] (03PS5) 10ArielGlenn: move references to datasets use from dumps module out to profile [puppet] - 10https://gerrit.wikimedia.org/r/389745 (https://phabricator.wikimedia.org/T179942) [18:08:29] not sure if related [18:08:37] elukey: I'm still getting failures [18:08:53] me too on mw1284 :( [18:09:58] there are a lot of connect() failed (111: Connection refused) while connecting to upstream, [18:10:30] could it be related to https://gerrit.wikimedia.org/r/#/c/386666/ ? [18:10:39] 10Operations, 10netops: Allow syslog-tls and syslog in analytics towards wezen/lithium - https://phabricator.wikimedia.org/T177821#3742177 (10ayounsi) 05Open>03Resolved a:03ayounsi Done. [18:10:41] that went in about an hour ago [18:11:19] I don't see how, that should be black/white, either works or not :D [18:11:41] yeah, I'd think so too [18:12:09] I am wondering if nginx is ok or not, it is difficult to tell if puppetdb is healthy from the logs [18:12:29] it fails many times to contact upstream [18:12:39] but jetty is not logging anything useful (surprise surprise) [18:13:03] (03PS2) 10Ottomata: Kafka: Enable topic deletion for Kafka by default [puppet] - 10https://gerrit.wikimedia.org/r/349280 (https://phabricator.wikimedia.org/T163392) (owner: 10Ppchelko) [18:13:44] the other weird thing is that I don't see 50x in the nginx logs [18:14:51] I mean, something like [18:14:58] connect() failed (111: Connection refused) while connecting to upstream, client: 10.64.48.45, server: , request: "GET /v3/nodes/mw1245.eqiad.wmnet/facts HTTP/1.1" [18:15:03] should end up in a 503 no? [18:15:05] !log awight@tin Started deploy [ores/deploy@29905e5]: test deployment to ores* (non-production) [18:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:09] !log awight@tin Finished deploy [ores/deploy@29905e5]: test deployment to ores* (non-production) (duration: 01m 04s) [18:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:04] (03PS1) 10Chad: group0 to wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389761 [18:24:43] !log demon@tin Started scap: wmf.7 bootstrap [18:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:36] (03PS3) 10Arturo Borrero Gonzalez: apt: unattended upgrades for wikimedia packages by default [puppet] - 10https://gerrit.wikimedia.org/r/389480 (https://phabricator.wikimedia.org/T177920) [18:28:37] bd808: ^^^ will try this https://wikitech.wikimedia.org/wiki/Puppet_coding/Testing (i.e, get the puppet compiler in jenkins to show me the changes) [18:30:09] arturo: cool. I didn't even know we had that class :) [18:30:35] which class? [18:31:04] please don't merge puppet changes for a bit, we're in the middle of a puppet outage ;) [18:36:07] <_joe_> !log restarting apache2 on rhodium [18:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:22] !log awight@tin Started deploy [ores/deploy@29905e5]: test deployment to ores* (non-production) [18:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:42] !log awight@tin Finished deploy [ores/deploy@29905e5]: test deployment to ores* (non-production) (duration: 00m 20s) [18:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:48] !log slowly running puppet on failed hosts with cumin (concurrency=5) [18:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:37] (03CR) 10Hashar: "That solved it. Thank you Daniel." [puppet] - 10https://gerrit.wikimedia.org/r/389512 (https://phabricator.wikimedia.org/T179772) (owner: 10Umherirrender) [18:49:36] !log awight@tin Started deploy [ores/deploy@29905e5]: test deployment to repair ores1002 (non-production) [18:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:07] !log awight@tin Finished deploy [ores/deploy@29905e5]: test deployment to repair ores1002 (non-production) (duration: 00m 33s) [18:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:15] 10Operations, 10ops-eqiad, 10Performance-Team: setup/install lawrencium for temp use by performance team - https://phabricator.wikimedia.org/T179968#3742297 (10RobH) [18:51:24] 10Operations, 10ops-eqiad, 10Performance-Team: setup/install lawrencium for temp use by performance team - https://phabricator.wikimedia.org/T179968#3742313 (10RobH) [18:56:14] !log awight@tin Started deploy [ores/deploy@29905e5]: test deployment to repair ores1008 (non-production) [18:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:43] (03PS1) 10Chad: Fix reverse-proxy symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389779 [18:56:46] !log awight@tin Finished deploy [ores/deploy@29905e5]: test deployment to repair ores1008 (non-production) (duration: 00m 32s) [18:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:41] !log awight@tin Started deploy [ores/deploy@29905e5]: test deployment to repair ores1009 (non-production) [18:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:12] !log awight@tin Finished deploy [ores/deploy@29905e5]: test deployment to repair ores1009 (non-production) (duration: 00m 31s) [18:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:11] (03PS1) 10Chad: Remove PrivateSettings.php symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389780 [19:03:46] !log begin stress test on ores* [19:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:41] (03CR) 10Ottomata: [C: 032] Removing appInstallId from whitelist [puppet] - 10https://gerrit.wikimedia.org/r/384093 (https://phabricator.wikimedia.org/T178174) (owner: 10Nuria) [19:11:17] (03PS4) 10Ottomata: Removing appInstallId from whitelist [puppet] - 10https://gerrit.wikimedia.org/r/384093 (https://phabricator.wikimedia.org/T178174) (owner: 10Nuria) [19:11:37] (03CR) 10Ottomata: [V: 032 C: 032] Removing appInstallId from whitelist [puppet] - 10https://gerrit.wikimedia.org/r/384093 (https://phabricator.wikimedia.org/T178174) (owner: 10Nuria) [19:12:43] (03CR) 10Ottomata: "Luca, are you doing JVM restarts? Have you done main Kafka yet? If not, could you merge this first and then do them?" [puppet] - 10https://gerrit.wikimedia.org/r/349280 (https://phabricator.wikimedia.org/T163392) (owner: 10Ppchelko) [19:12:58] !log demon@tin Finished scap: wmf.7 bootstrap (duration: 48m 15s) [19:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:46] (03CR) 10Zoranzoki21: [C: 031] Remove PrivateSettings.php symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389780 (owner: 10Chad) [19:18:21] (03CR) 10Zoranzoki21: [C: 031] Fix reverse-proxy symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389779 (owner: 10Chad) [19:21:45] (03PS1) 10BBlack: Add phab1001-aphlict alias [dns] - 10https://gerrit.wikimedia.org/r/389782 (https://phabricator.wikimedia.org/T112765) [19:22:26] (03CR) 10Chad: [C: 04-2] "Not yet, I'm paranoid" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389780 (owner: 10Chad) [19:22:39] (03CR) 10BBlack: [C: 032] Add phab1001-aphlict alias [dns] - 10https://gerrit.wikimedia.org/r/389782 (https://phabricator.wikimedia.org/T112765) (owner: 10BBlack) [19:23:46] (03PS1) 1020after4: phabricator: enable notification service (aphlict) [puppet] - 10https://gerrit.wikimedia.org/r/389783 [19:25:34] (03CR) 10Paladox: [C: 031] phabricator: enable notification service (aphlict) [puppet] - 10https://gerrit.wikimedia.org/r/389783 (owner: 1020after4) [19:31:05] !log restart apache2 service on rhodium [19:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:16] (03PS1) 10BBlack: cache_misc: routing for aphlict websockets [puppet] - 10https://gerrit.wikimedia.org/r/389785 [19:35:40] (03CR) 1020after4: [C: 031] cache_misc: routing for aphlict websockets [puppet] - 10https://gerrit.wikimedia.org/r/389785 (owner: 10BBlack) [19:37:42] (03CR) 10BBlack: [C: 032] phabricator: enable notification service (aphlict) [puppet] - 10https://gerrit.wikimedia.org/r/389783 (owner: 1020after4) [19:37:59] (03PS2) 10BBlack: cache_misc: routing for aphlict websockets [puppet] - 10https://gerrit.wikimedia.org/r/389785 [19:38:06] (03CR) 10BBlack: [C: 032] cache_misc: routing for aphlict websockets [puppet] - 10https://gerrit.wikimedia.org/r/389785 (owner: 10BBlack) [19:44:57] !log restarted apache2 on rhodium (puppet master failing) [19:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] no_justification: #bothumor I � Unicode. All rise for MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171107T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:00:24] o/ [20:01:26] my eyes are here no_justification :) [20:01:46] Well it's already live on testwiki :) [20:01:54] Just gotta do rest of group0 [20:03:08] awesome! [20:03:31] (03CR) 10Chad: [C: 032] group0 to wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389761 (owner: 10Chad) [20:04:48] (03Merged) 10jenkins-bot: group0 to wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389761 (owner: 10Chad) [20:04:53] I'm getting a 403 error on phab: https://gist.github.com/pnorman/ee87dab0b40168b5a413164a0c31fee2 [20:05:29] hm, can be the IP [20:05:39] phab works for me [20:06:44] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.7 [20:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:12] (03CR) 10jenkins-bot: group0 to wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389761 (owner: 10Chad) [20:07:26] looks lovely [20:10:24] Yep, everything seems to be working just fine on my end [20:10:37] Swapping to Phab for submodules worked flawlessly, btw [20:10:48] lovely! [20:11:13] no_justification: are you okay for me to go ahead with some config changes then? [20:11:29] Yeah go for it, I'll be around to keep an eye on things too [20:11:32] okay! [20:12:17] oooh, and apparently noone has touched mediawiki-config since i made them all, so no rebase needed [20:12:27] (03CR) 10Addshore: [C: 032] Remove unused wmgUseWikibasePropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381193 (owner: 10Addshore) [20:14:16] Sagan: Yes, it works fine from my German server. But locally I'm getting it on not just curl, but browsers too. [20:15:03] (03PS4) 10Addshore: Remove unused wmgUseWikibasePropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381193 [20:15:09] pnorman: some ips recive this error. this is to prevent wp0 abuse on phab. not sure, if your ip is on that, but it could be [20:15:09] (03CR) 10Addshore: [C: 032] Remove unused wmgUseWikibasePropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381193 (owner: 10Addshore) [20:15:14] (03PS7) 10Addshore: Add loading of wikibase extensions from build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381194 (https://phabricator.wikimedia.org/T176948) [20:15:19] (03PS7) 10Addshore: Load wikibase build from mediawiki-config for beta wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381195 (https://phabricator.wikimedia.org/T176948) [20:15:23] (03PS7) 10Addshore: Load wikibase build from mediawiki-config for beta wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381199 (https://phabricator.wikimedia.org/T176948) [20:15:30] (03PS2) 10Addshore: wmgWikibaseUseConfigFromWikidataBuild flase for all of BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389684 (https://phabricator.wikimedia.org/T176948) [20:15:35] (03PS2) 10Addshore: Load wikibase build from mediawiki-config for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389685 (https://phabricator.wikimedia.org/T176948) [20:15:41] (03PS2) 10Addshore: Load wikibase build from mediawiki-config for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389686 (https://phabricator.wikimedia.org/T176948) [20:15:46] (03PS2) 10Addshore: Load wikibase build from mediawiki-config for group0 & group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389687 (https://phabricator.wikimedia.org/T176948) [20:15:51] (03PS2) 10Addshore: Load wikibase build from mediawiki-config for wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389688 (https://phabricator.wikimedia.org/T176948) [20:15:56] (03PS2) 10Addshore: wmgWikibaseUseConfigFromWikidataBuild flase for all of PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389689 (https://phabricator.wikimedia.org/T176948) [20:16:06] or maybe they did need a rebase! [20:16:06] pnorman: you might want to try asking in #wikimedia-releng as well, not really sure who is taking care of phab [20:16:19] (03Merged) 10jenkins-bot: Remove unused wmgUseWikibasePropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381193 (owner: 10Addshore) [20:17:16] (03CR) 10jenkins-bot: Remove unused wmgUseWikibasePropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381193 (owner: 10Addshore) [20:17:55] (03PS1) 10Herron: Revert "Puppet: Change hostcert and remove hostprivkey master settings" [puppet] - 10https://gerrit.wikimedia.org/r/389791 [20:19:28] !log addshore@tin Synchronized wmf-config/CommonSettings.php: [[gerrit:381193|Remove unused wmgUseWikibasePropertySuggester]] PT 1/2 (duration: 00m 50s) [20:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:51] (03CR) 10Herron: [C: 032] Revert "Puppet: Change hostcert and remove hostprivkey master settings" [puppet] - 10https://gerrit.wikimedia.org/r/389791 (owner: 10Herron) [20:19:58] (03PS2) 10Herron: Revert "Puppet: Change hostcert and remove hostprivkey master settings" [puppet] - 10https://gerrit.wikimedia.org/r/389791 [20:20:48] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:381193|Remove unused wmgUseWikibasePropertySuggester]] PT 2/2 (duration: 00m 50s) [20:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:07] (03CR) 10Addshore: [C: 032] Add loading of wikibase extensions from build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381194 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:21:34] (03PS3) 10Krinkle: webperf: Make navtiming support nested parsed UA objects, as well as json strings [puppet] - 10https://gerrit.wikimedia.org/r/389713 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [20:21:36] (03CR) 10Krinkle: [C: 031] webperf: Make navtiming support nested parsed UA objects, as well as json strings [puppet] - 10https://gerrit.wikimedia.org/r/389713 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [20:22:08] (03CR) 10jerkins-bot: [V: 04-1] webperf: Make navtiming support nested parsed UA objects, as well as json strings [puppet] - 10https://gerrit.wikimedia.org/r/389713 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [20:23:14] (03PS2) 10Addshore: Remove wmgWikibaseUseConfigFromWikidataBuild [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389690 (https://phabricator.wikimedia.org/T176948) [20:23:21] (03PS7) 10Addshore: Stop using $wgWikibaseSharedCacheKeyPrefix from Wikidata build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381371 (https://phabricator.wikimedia.org/T176948) (owner: 10Aude) [20:23:24] (03PS2) 10Addshore: Remove Shared Cache settings from Wikibase-buildentry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389691 (https://phabricator.wikimedia.org/T176948) [20:23:43] (03Merged) 10jenkins-bot: Add loading of wikibase extensions from build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381194 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:25:49] !log addshore@tin Synchronized wmf-config/Wikibase-buildentry.php: [[gerrit:381194|Add loading of wikibase extensions from build]] PT 1/3 (duration: 00m 49s) [20:25:51] (03PS6) 10ArielGlenn: move references to datasets use from dumps module out to profile [puppet] - 10https://gerrit.wikimedia.org/r/389745 (https://phabricator.wikimedia.org/T179942) [20:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:59] (03PS1) 10BBlack: cache_misc: fix cookies+websockets [puppet] - 10https://gerrit.wikimedia.org/r/389794 (https://phabricator.wikimedia.org/T112765) [20:27:06] (03CR) 10jenkins-bot: Add loading of wikibase extensions from build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381194 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:27:13] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:381194|Add loading of wikibase extensions from build]] PT 2/3 (duration: 00m 50s) [20:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:46] !log addshore@tin Synchronized wmf-config/Wikibase.php: [[gerrit:381194|Add loading of wikibase extensions from build]] PT 3/3 (duration: 00m 50s) [20:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:04] for some reason I saw a small spike related to that last patch (51 notices), shouldn't have happened given i synced each file individually.... [20:31:05] 51 Notice: Undefined variable: wmgWikibaseUseConfigFromWikidataBuild in /srv/mediawiki/wmf-config/Wikibase.php on line 4 [20:32:18] (03CR) 10BBlack: [C: 032] cache_misc: fix cookies+websockets [puppet] - 10https://gerrit.wikimedia.org/r/389794 (https://phabricator.wikimedia.org/T112765) (owner: 10BBlack) [20:33:15] no_justification: ^^ any idea why something like that would happen? [20:34:14] (03CR) 10Addshore: [C: 032] Load wikibase build from mediawiki-config for beta wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381195 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:35:03] Eh, transient warnings like that...happen [20:35:08] :) [20:35:23] I've never worried about them, which is why I've always just sync'd the whole directory :p [20:35:37] what's a few warnings between friends [20:35:42] For some reason I always go file by file :) [20:37:47] (03PS7) 10ArielGlenn: move references to datasets use from dumps module out to profile [puppet] - 10https://gerrit.wikimedia.org/r/389745 (https://phabricator.wikimedia.org/T179942) [20:38:25] 10Operations, 10Phabricator, 10Traffic, 10Patch-For-Review: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#3742647 (10mmodell) 05Open>03Resolved a:03BBlack YAY! it only took 2.16 years! [20:38:31] heh, I thought gate-submit for mediawiki-config was in the test-prio queue, apparently not... [20:38:41] (03CR) 10Addshore: [C: 032] Load wikibase build from mediawiki-config for beta wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381199 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:40:00] (03PS1) 10BBlack: Aphlict: max_connections raise from 100 to 1K [puppet] - 10https://gerrit.wikimedia.org/r/389796 [20:40:03] (03Merged) 10jenkins-bot: Load wikibase build from mediawiki-config for beta wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381195 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:40:17] (03CR) 10jenkins-bot: Load wikibase build from mediawiki-config for beta wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381195 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:40:29] (03CR) 10BBlack: [C: 032] Aphlict: max_connections raise from 100 to 1K [puppet] - 10https://gerrit.wikimedia.org/r/389796 (owner: 10BBlack) [20:40:41] (03Merged) 10jenkins-bot: Load wikibase build from mediawiki-config for beta wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381199 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:41:42] (03CR) 10Addshore: [C: 032] wmgWikibaseUseConfigFromWikidataBuild flase for all of BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389684 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:43:40] (03CR) 10jenkins-bot: Load wikibase build from mediawiki-config for beta wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381199 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:43:56] (03PS4) 10Krinkle: webperf: Make navtiming support nested parsed UA objects as well [puppet] - 10https://gerrit.wikimedia.org/r/389713 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [20:43:58] (03Merged) 10jenkins-bot: wmgWikibaseUseConfigFromWikidataBuild flase for all of BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389684 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:44:09] (03CR) 10jenkins-bot: wmgWikibaseUseConfigFromWikidataBuild flase for all of BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389684 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:44:18] 10Operations, 10Phabricator, 10Traffic, 10Patch-For-Review: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#3742669 (10mmodell) @bblack spent a bunch of time debugging issues with websockets + varnish, so thanks a lot for your time and expertise, Bran... [20:45:49] !log addshore@tin Synchronized wmf-config/InitialiseSettings-labs.php: T176948 [[gerrit:381195|#1]] [[gerrit:381199|#2]] [[gerrit:389684|#3]] Load wikibase build from mediawiki-config for BETA ONLY (duration: 00m 50s) [20:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:56] T176948: Move config & loading logic out of Wikidata build and into mediawiki-config - https://phabricator.wikimedia.org/T176948 [20:46:08] 20:45:39 Check 'Logstash Error rate for mw1277.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.02, After: 2.00, Threshold: 1.00) [20:46:24] not sure if there is something up with mw1277, but that has happened on 2 of the checks now [20:46:36] (03CR) 10Addshore: [C: 032] Load wikibase build from mediawiki-config for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389685 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:48:26] (03CR) 10Herron: "reverted this because pointing the hostcert to /var/lib/puppet/ssl/certs seems to have caused rhodium to break with error "nitrogen.eqiad." [puppet] - 10https://gerrit.wikimedia.org/r/386666 (https://phabricator.wikimedia.org/T179099) (owner: 10Herron) [20:48:44] (03Merged) 10jenkins-bot: Load wikibase build from mediawiki-config for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389685 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:48:57] !log puppet issue cleared after reverting 386666. restarting ircecho on einsteinium [20:48:57] (03CR) 10jenkins-bot: Load wikibase build from mediawiki-config for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389685 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:48:59] (03PS5) 10Ottomata: webperf: Make navtiming support nested parsed UA objects as well [puppet] - 10https://gerrit.wikimedia.org/r/389713 (https://phabricator.wikimedia.org/T179625) [20:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:03] (03CR) 10Ottomata: [V: 032 C: 032] webperf: Make navtiming support nested parsed UA objects as well [puppet] - 10https://gerrit.wikimedia.org/r/389713 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [20:51:13] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T176948 [[gerrit:389685|Load wikibase build from mediawiki-config for wikidatawiki]] (duration: 00m 50s) [20:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:22] T176948: Move config & loading logic out of Wikidata build and into mediawiki-config - https://phabricator.wikimedia.org/T176948 [20:51:28] (03CR) 10Addshore: [C: 032] Load wikibase build from mediawiki-config for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389686 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:52:36] la la la la [20:52:47] (03Merged) 10jenkins-bot: Load wikibase build from mediawiki-config for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389686 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:52:56] (03CR) 10jenkins-bot: Load wikibase build from mediawiki-config for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389686 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:53:55] (03PS3) 10Ottomata: [WIP] EventLogging analytics capsule discrepency fixes [puppet] - 10https://gerrit.wikimedia.org/r/389722 (https://phabricator.wikimedia.org/T179625) [20:54:05] (03CR) 10Addshore: [C: 032] Load wikibase build from mediawiki-config for group0 & group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389687 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:54:27] (03CR) 10jerkins-bot: [V: 04-1] [WIP] EventLogging analytics capsule discrepency fixes [puppet] - 10https://gerrit.wikimedia.org/r/389722 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [20:54:34] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T176948 [[gerrit:389686|Load wikibase build from mediawiki-config for hewiki]] (duration: 00m 49s) [20:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:53] (03PS1) 10Krinkle: grafana: Fix network graphs to not end in a weird drop to 0 [puppet] - 10https://gerrit.wikimedia.org/r/389798 [20:55:07] (03CR) 10Krinkle: "As always, preview at https://grafana.wikimedia.org/dashboard/db/server-board?refresh=1m&orgId=1" [puppet] - 10https://gerrit.wikimedia.org/r/389798 (owner: 10Krinkle) [20:55:34] (03PS2) 10Krinkle: grafana: Fix network graphs to not end in a weird drop to 0 [puppet] - 10https://gerrit.wikimedia.org/r/389798 [20:55:57] (03PS4) 10Ottomata: [WIP] EventLogging analytics capsule discrepency fixes [puppet] - 10https://gerrit.wikimedia.org/r/389722 (https://phabricator.wikimedia.org/T179625) [20:56:22] (03CR) 10jerkins-bot: [V: 04-1] [WIP] EventLogging analytics capsule discrepency fixes [puppet] - 10https://gerrit.wikimedia.org/r/389722 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [20:57:50] (03Merged) 10jenkins-bot: Load wikibase build from mediawiki-config for group0 & group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389687 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:58:38] (03CR) 10Addshore: [C: 032] Load wikibase build from mediawiki-config for wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389688 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:58:45] (03CR) 10Addshore: [C: 032] wmgWikibaseUseConfigFromWikidataBuild flase for all of PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389689 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [20:59:15] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T176948 [[gerrit:389687|Load wikibase build from mediawiki-config for group0 & group1]] (duration: 00m 50s) [20:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:21] T176948: Move config & loading logic out of Wikidata build and into mediawiki-config - https://phabricator.wikimedia.org/T176948 [20:59:53] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational [21:00:13] PROBLEM - Check systemd state on notebook1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:00:31] (03Merged) 10jenkins-bot: Load wikibase build from mediawiki-config for wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389688 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [21:00:32] (03PS1) 10BBlack: aphlict: add the CNAME in codfw, too [dns] - 10https://gerrit.wikimedia.org/r/389799 (https://phabricator.wikimedia.org/T112765) [21:00:34] (03CR) 10jenkins-bot: Load wikibase build from mediawiki-config for group0 & group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389687 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [21:00:46] (03CR) 10BBlack: [C: 032] aphlict: add the CNAME in codfw, too [dns] - 10https://gerrit.wikimedia.org/r/389799 (https://phabricator.wikimedia.org/T112765) (owner: 10BBlack) [21:01:11] (03Merged) 10jenkins-bot: wmgWikibaseUseConfigFromWikidataBuild flase for all of PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389689 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [21:01:41] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T176948 [[gerrit:389688|Load wikibase build from mediawiki-config for wikidataclient]] (duration: 00m 50s) [21:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:53] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:02:55] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T176948 [[gerrit:389689|wmgWikibaseUseConfigFromWikidataBuild flase for all of PROD]] (duration: 00m 49s) [21:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:11] jouncebot: now [21:03:11] For the next 0 hour(s) and 56 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171107T2000) [21:03:17] (03CR) 10Addshore: [C: 032] Remove wmgWikibaseUseConfigFromWikidataBuild [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389690 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [21:04:08] (03CR) 10jenkins-bot: Load wikibase build from mediawiki-config for wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389688 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [21:04:10] (03CR) 10jenkins-bot: wmgWikibaseUseConfigFromWikidataBuild flase for all of PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389689 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [21:04:39] (03Merged) 10jenkins-bot: Remove wmgWikibaseUseConfigFromWikidataBuild [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389690 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [21:04:48] (03CR) 10jenkins-bot: Remove wmgWikibaseUseConfigFromWikidataBuild [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389690 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [21:06:27] !log addshore@tin Synchronized wmf-config/Wikibase.php: T176948 [[gerrit:389690|Remove wmgWikibaseUseConfigFromWikidataBuild]] PT 1/3 (duration: 00m 50s) [21:06:29] (03CR) 1020after4: [C: 031] "maybe some day in the distant future we want https internally, however, for now it's definitely not in any short-term plans" [puppet] - 10https://gerrit.wikimedia.org/r/389457 (owner: 10Dzahn) [21:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:33] T176948: Move config & loading logic out of Wikidata build and into mediawiki-config - https://phabricator.wikimedia.org/T176948 [21:07:12] (03CR) 10Ottomata: [C: 032] grafana: Fix network graphs to not end in a weird drop to 0 [puppet] - 10https://gerrit.wikimedia.org/r/389798 (owner: 10Krinkle) [21:07:34] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T176948 [[gerrit:389690|Remove wmgWikibaseUseConfigFromWikidataBuild]] PT 2/3 (duration: 00m 52s) [21:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:08] !log addshore@tin Synchronized wmf-config/InitialiseSettings-labs.php: T176948 [[gerrit:389690|Remove wmgWikibaseUseConfigFromWikidataBuild]] PT 3/3 (LABS) (duration: 00m 50s) [21:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:14] (03CR) 10Addshore: [C: 032] Stop using $wgWikibaseSharedCacheKeyPrefix from Wikidata build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381371 (https://phabricator.wikimedia.org/T176948) (owner: 10Aude) [21:09:28] 2 to go [21:09:53] no_justification ^^ [21:10:28] Whee [21:10:29] aude: I guess once this is all done we can remove the config stuff from the build-resources! [21:13:07] (03Merged) 10jenkins-bot: Stop using $wgWikibaseSharedCacheKeyPrefix from Wikidata build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381371 (https://phabricator.wikimedia.org/T176948) (owner: 10Aude) [21:13:33] (03CR) 10jenkins-bot: Stop using $wgWikibaseSharedCacheKeyPrefix from Wikidata build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381371 (https://phabricator.wikimedia.org/T176948) (owner: 10Aude) [21:14:54] (03PS1) 10BBlack: remove phabricator-new hostname [dns] - 10https://gerrit.wikimedia.org/r/389803 (https://phabricator.wikimedia.org/T137928) [21:15:07] (03PS1) 10BBlack: phab: dc failover behind the primary public name [puppet] - 10https://gerrit.wikimedia.org/r/389804 (https://phabricator.wikimedia.org/T137928) [21:16:07] !log addshore@tin Synchronized wmf-config/Wikibase.php: T176948 [[gerrit:381371|Stop using wgWikibaseSharedCacheKeyPrefix from Wikidata build]] (duration: 00m 49s) [21:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:14] T176948: Move config & loading logic out of Wikidata build and into mediawiki-config - https://phabricator.wikimedia.org/T176948 [21:16:30] (03CR) 10Addshore: [C: 032] Remove Shared Cache settings from Wikibase-buildentry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389691 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [21:17:35] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3742811 (10RobH) I've been in discussion with Renny@Dell support. We lowered the CPU count from all to just 2 per CPU. The error still happened during the OS install and boot just no... [21:17:42] (03PS5) 10Ottomata: [WIP] EventLogging analytics capsule discrepency fixes [puppet] - 10https://gerrit.wikimedia.org/r/389722 (https://phabricator.wikimedia.org/T179625) [21:18:14] (03CR) 10jerkins-bot: [V: 04-1] [WIP] EventLogging analytics capsule discrepency fixes [puppet] - 10https://gerrit.wikimedia.org/r/389722 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [21:18:55] (03Merged) 10jenkins-bot: Remove Shared Cache settings from Wikibase-buildentry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389691 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [21:19:04] (03CR) 10jenkins-bot: Remove Shared Cache settings from Wikibase-buildentry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389691 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [21:19:15] (03PS2) 10BBlack: phab: dc failover behind the primary public name [puppet] - 10https://gerrit.wikimedia.org/r/389804 (https://phabricator.wikimedia.org/T137928) [21:20:05] (03CR) 1020after4: [C: 031] phab: dc failover behind the primary public name [puppet] - 10https://gerrit.wikimedia.org/r/389804 (https://phabricator.wikimedia.org/T137928) (owner: 10BBlack) [21:20:19] (03CR) 1020after4: [C: 031] remove phabricator-new hostname [dns] - 10https://gerrit.wikimedia.org/r/389803 (https://phabricator.wikimedia.org/T137928) (owner: 10BBlack) [21:20:28] !log addshore@tin Synchronized wmf-config/Wikibase-buildentry.php: T176948 [[gerrit:389691|Remove Shared Cache settings from Wikibase-buildentry]] (duration: 00m 50s) [21:20:31] (03CR) 10BBlack: [C: 032] phab: dc failover behind the primary public name [puppet] - 10https://gerrit.wikimedia.org/r/389804 (https://phabricator.wikimedia.org/T137928) (owner: 10BBlack) [21:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:38] no_justification: thats them all! [21:21:03] (03CR) 1020after4: [C: 031] phabricator: limit http access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/389459 (owner: 10Dzahn) [21:21:25] (03CR) 1020after4: "nicely done sir." [puppet] - 10https://gerrit.wikimedia.org/r/389794 (https://phabricator.wikimedia.org/T112765) (owner: 10BBlack) [21:21:42] (03CR) 1020after4: "works!" [puppet] - 10https://gerrit.wikimedia.org/r/389783 (owner: 1020after4) [21:23:12] (03CR) 10BBlack: [C: 032] remove phabricator-new hostname [dns] - 10https://gerrit.wikimedia.org/r/389803 (https://phabricator.wikimedia.org/T137928) (owner: 10BBlack) [21:24:11] 10Operations, 10Release-Engineering-Team: Update Debian package for Blubber - https://phabricator.wikimedia.org/T179984#3742821 (10thcipriani) [21:25:49] 10Operations, 10Release Pipeline, 10Release-Engineering-Team: Update Debian package for Blubber - https://phabricator.wikimedia.org/T179984#3742833 (10thcipriani) adding @akosiaris and @Joe since they have the most background on Blubber: could one of you upload a new version of blubber to the apt repository/... [21:33:13] Do we periodically get rid of the CI stack traces? I get a 404 on https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm-jessie/23406/console Just wanted to confirm. [21:33:59] Niharika: i believe so [21:36:04] Niharika: Hi, yep, 15 days i think for some tests [21:36:14] RECOVERY - Check systemd state on notebook1001 is OK: OK - running: The system is fully operational [21:36:17] Aha, thanks! [21:50:26] notebook1001? [21:54:24] (03PS1) 10Ottomata: Add exception guard for json parsing in eventlogging mysql filter [puppet] - 10https://gerrit.wikimedia.org/r/389861 (https://phabricator.wikimedia.org/T179625) [21:55:57] (03CR) 10EBernhardson: Deploy MjoLniR with new deploy repository (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/389550 (owner: 10EBernhardson) [21:56:09] (03PS3) 10EBernhardson: Deploy MjoLniR with new deploy repository [puppet] - 10https://gerrit.wikimedia.org/r/389550 [21:56:41] (03CR) 10jerkins-bot: [V: 04-1] Deploy MjoLniR with new deploy repository [puppet] - 10https://gerrit.wikimedia.org/r/389550 (owner: 10EBernhardson) [21:59:26] (03PS4) 10EBernhardson: Deploy MjoLniR with new deploy repository [puppet] - 10https://gerrit.wikimedia.org/r/389550 [21:59:58] (03CR) 10jerkins-bot: [V: 04-1] Deploy MjoLniR with new deploy repository [puppet] - 10https://gerrit.wikimedia.org/r/389550 (owner: 10EBernhardson) [22:02:49] (03PS5) 10EBernhardson: Deploy MjoLniR with new deploy repository [puppet] - 10https://gerrit.wikimedia.org/r/389550 [22:03:41] (03PS1) 10BBlack: git-ssh.wm.o: reduce to 10m TTL for failover [dns] - 10https://gerrit.wikimedia.org/r/389869 (https://phabricator.wikimedia.org/T164810) [22:12:43] (03PS1) 10BBlack: LVS/phabricator: add git-ssh in codfw [puppet] - 10https://gerrit.wikimedia.org/r/389871 (https://phabricator.wikimedia.org/T164810) [22:14:32] (03CR) 1020after4: [C: 031] "This is objectively a good thing :)" [dns] - 10https://gerrit.wikimedia.org/r/389869 (https://phabricator.wikimedia.org/T164810) (owner: 10BBlack) [22:15:36] 10Operations, 10Phabricator, 10RelEng-Archive-FY201718-Q1, 10Patch-For-Review: setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3742971 (10mmodell) [22:17:19] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban): Verify that the codfw lvs is configured correctly for Phabricator - https://phabricator.wikimedia.org/T168699#3742977 (10mmodell) [22:17:24] (03CR) 10Paladox: [C: 031] LVS/phabricator: add git-ssh in codfw [puppet] - 10https://gerrit.wikimedia.org/r/389871 (https://phabricator.wikimedia.org/T164810) (owner: 10BBlack) [22:17:46] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban): Verify that the codfw lvs is configured correctly for Phabricator - https://phabricator.wikimedia.org/T168699#3373930 (10mmodell) https://gerrit.wikimedia.org/r/#/c/389871/ [22:18:14] (03CR) 10Zppix: [C: 031] LVS/phabricator: add git-ssh in codfw [puppet] - 10https://gerrit.wikimedia.org/r/389871 (https://phabricator.wikimedia.org/T164810) (owner: 10BBlack) [22:20:01] (03PS6) 10Ottomata: [WIP] EventLogging analytics capsule discrepency fixes [puppet] - 10https://gerrit.wikimedia.org/r/389722 (https://phabricator.wikimedia.org/T179625) [22:39:12] !log Reset a global account's email, per T179950 [22:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:36] (03PS1) 1020after4: Narrow the range for this ban as it is affecting users who are not on a zero rated host. [puppet] - 10https://gerrit.wikimedia.org/r/389888 [23:14:03] (03CR) 10jerkins-bot: [V: 04-1] Narrow the range for this ban as it is affecting users who are not on a zero rated host. [puppet] - 10https://gerrit.wikimedia.org/r/389888 (owner: 1020after4) [23:14:54] PROBLEM - cassandra-a SSL 10.192.16.165:7001 on restbase2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [23:15:13] PROBLEM - cassandra-a CQL 10.192.16.165:9042 on restbase2002 is CRITICAL: connect to address 10.192.16.165 and port 9042: Connection refused [23:16:16] (03PS2) 10Greg Grossmeier: Narrow the range for this ban as it is affecting users who are not on a zero rated host. [puppet] - 10https://gerrit.wikimedia.org/r/389888 (https://phabricator.wikimedia.org/T168142) (owner: 1020after4) [23:16:40] (03PS6) 10EBernhardson: Deploy MjoLniR with new deploy repository [puppet] - 10https://gerrit.wikimedia.org/r/389550 [23:16:42] (03PS1) 10EBernhardson: Support .whl in archiva git-fat link [puppet] - 10https://gerrit.wikimedia.org/r/389889 [23:16:44] (03CR) 10jerkins-bot: [V: 04-1] Narrow the range for this ban as it is affecting users who are not on a zero rated host. [puppet] - 10https://gerrit.wikimedia.org/r/389888 (https://phabricator.wikimedia.org/T168142) (owner: 1020after4) [23:17:48] (03PS3) 1020after4: Narrow the range for this ban [puppet] - 10https://gerrit.wikimedia.org/r/389888 [23:24:28] cassandra fails expected ^ [23:26:21] (03CR) 10Paladox: [C: 031] Narrow the range for this ban [puppet] - 10https://gerrit.wikimedia.org/r/389888 (owner: 1020after4) [23:26:46] (03CR) 10Paladox: [C: 031] Narrow the range for this ban (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389888 (owner: 1020after4) [23:30:36] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3743155 (10RobH) Dell seems to agree, they are dispatching a replacement mainboard. I'll swap and we'll see what happens. [23:30:58] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package for Blubber - https://phabricator.wikimedia.org/T179984#3743157 (10greg) [23:32:45] (03PS4) 1020after4: Narrow the range for this ban [puppet] - 10https://gerrit.wikimedia.org/r/389888 [23:33:18] (03CR) 10Paladox: [C: 031] Narrow the range for this ban [puppet] - 10https://gerrit.wikimedia.org/r/389888 (owner: 1020after4) [23:33:22] (03PS5) 1020after4: Remove unjustified overbroad /8 network from blocklist [puppet] - 10https://gerrit.wikimedia.org/r/389888 [23:35:18] any opsen willing to merge ^ so that we can stop blocking pnorman from phabricator?: [23:35:40] the /8 is a ridiculously broad ban and it affects a lot of presumably innocent networks [23:36:05] (03PS1) 10Ayounsi: Add make_wheels.sh [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/389890 [23:36:07] (03PS1) 10Ayounsi: Add fixed requirements for netbox 2.2.4 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/389891 [23:36:29] (03CR) 10Ayounsi: [V: 032 C: 032] Add make_wheels.sh [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/389890 (owner: 10Ayounsi) [23:36:50] (03CR) 10Ayounsi: [V: 032 C: 032] Add fixed requirements for netbox 2.2.4 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/389891 (owner: 10Ayounsi) [23:42:10] (03PS1) 10Ayounsi: Update submodules to netbox 2.2.4 and matching wheels [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/389895 [23:42:31] (03CR) 10Ayounsi: [V: 032 C: 032] Update submodules to netbox 2.2.4 and matching wheels [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/389895 (owner: 10Ayounsi) [23:44:20] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3743181 (10bd808) [23:55:55] (03CR) 10Pnorman: [C: 031] Remove unjustified overbroad /8 network from blocklist [puppet] - 10https://gerrit.wikimedia.org/r/389888 (owner: 1020after4) [23:57:56] !log Decommissioning restbase2001-b.codfw.wmnet (T179422) [23:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:04] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422