[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161214T0000). [00:01:29] (03CR) 10Dzahn: [C: 031] Labs: Remove obsolete code [puppet] - 10https://gerrit.wikimedia.org/r/326312 (owner: 10Tim Landscheidt) [00:01:55] (03CR) 10Dzahn: [C: 031] Tools: Quote arguments in clush [puppet] - 10https://gerrit.wikimedia.org/r/326380 (owner: 10Tim Landscheidt) [00:04:51] (03PS3) 10Dzahn: Move wikimedia-logo.svg to role module [puppet] - 10https://gerrit.wikimedia.org/r/325729 (owner: 10Tim Landscheidt) [00:04:57] (03CR) 10Dzahn: [C: 032] Move wikimedia-logo.svg to role module [puppet] - 10https://gerrit.wikimedia.org/r/325729 (owner: 10Tim Landscheidt) [00:05:47] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): setup/install restbase-test100[123] - https://phabricator.wikimedia.org/T151075#2871318 (10RobH) [00:06:39] bblack: hi! after talking with others in -mobile and elsewhere, we'd like to revive the Varnish solution we'd talked about, if that's OK [00:06:42] (03PS2) 10Andrew Bogott: Add comment headers to check_keystone icinga plugins [puppet] - 10https://gerrit.wikimedia.org/r/326991 [00:06:47] Sorry for the back-and-forth! [00:07:04] The task is now public https://phabricator.wikimedia.org/T152602 [00:07:41] In brief, it's not totally clear disabling JS is the right approach for this proxy, and also some in-banner JS didn't do the job, so a server-side solution (where we have indeed verified the UA) seems more certain [00:07:59] I have to afk for about 2.5 hours, but I should get scrollback! thx again for ur help! [00:08:15] (03CR) 10Andrew Bogott: [C: 032] Add comment headers to check_keystone icinga plugins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/326991 (owner: 10Andrew Bogott) [00:09:09] (03PS2) 10Andrew Bogott: Keystone: Publish credentials for novaobserver account [labs/private] - 10https://gerrit.wikimedia.org/r/326979 (https://phabricator.wikimedia.org/T150092) [00:10:17] (03PS2) 10Dzahn: move install_console from global /files to modules/role/files/ [puppet] - 10https://gerrit.wikimedia.org/r/325460 [00:12:52] !log cobalt (gerrit) re-enabled puppet [00:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:05] (03CR) 10Andrew Bogott: [V: 032 C: 032] Keystone: Publish credentials for novaobserver account [labs/private] - 10https://gerrit.wikimedia.org/r/326979 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [00:18:12] !log twentyafterfour@tin Synchronized php-1.29.0-wmf.6/extensions/GlobalBlocking/includes/GlobalBlocking.class.php: fix T153153 (duration: 01m 00s) [00:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:24] T153153: Warning: method_exists() expects exactly 2 parameters, 1 given in GlobalBlocking.class.php - https://phabricator.wikimedia.org/T153153 [00:20:56] lovely, fix one thing, now it's another: Fatal error: Invalid static property access: ApiBase::messageMap in /srv/mediawiki/php-1.29.0-wmf.6/extensions/GlobalBlocking/includes/GlobalBlocking.class.php on line 90 [00:21:39] !log gerrit restarting to apply fix for diffusion links (T153130) [00:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:51] T153130: Some "(Diffusion)" links missing or broken after upgrade - https://phabricator.wikimedia.org/T153130 [00:22:35] done [00:22:38] paladox: ^ [00:22:39] 06Operations, 05Prometheus-metrics-monitoring: Improvements to Ganglia-equivalent Prometheus dashboards - https://phabricator.wikimedia.org/T152791#2871415 (10fgiunchedi) I tried repeating a panel per-host (load average) here https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown . Though it i... [00:23:02] mutante thanks, that fixed it [00:23:06] :) [00:23:48] did it? i couldnt confirm just yet [00:24:13] Yep [00:24:14] it worked [00:24:24] https://gerrit.wikimedia.org/r/#/c/326163/4/modules/gerrit/templates/gerrit.config.erb [00:24:31] click on the diffusion links ^^ [00:24:50] ok, i can confirm now [00:24:58] the other one was cached [00:25:31] :) [00:27:01] (03CR) 10Dzahn: [C: 032] move install_console from global /files to modules/role/files/ [puppet] - 10https://gerrit.wikimedia.org/r/325460 (owner: 10Dzahn) [00:27:41] RECOVERY - puppet last run on elastic1039 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [00:37:21] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:06:21] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [01:08:39] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): setup/install restbase-test100[123] - https://phabricator.wikimedia.org/T151075#2871514 (10RobH) [01:10:51] RECOVERY - MegaRAID on ms1001 is OK: OK: optimal, 4 logical, 48 physical [01:15:28] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2871517 (10RobH) Please note that restbase-test1001 has an issue detecting one of the disks, but the other two are ready for u... [01:15:44] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2871518 (10RobH) 05stalled>03Resolved [01:19:51] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 798.82 seconds [01:25:51] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 180.94 seconds [01:32:11] PROBLEM - Disk space on ruthenium is CRITICAL: DISK CRITICAL - free space: / 1632 MB (3% inode=91%) [01:41:22] (03PS1) 10Dzahn: lists/exim: move files from /files to role module [puppet] - 10https://gerrit.wikimedia.org/r/327138 [01:42:21] (03CR) 10Dzahn: [] "not 100% sure yet about "modules/role/files/exim" vs "modules/role/files/lists" etc" [puppet] - 10https://gerrit.wikimedia.org/r/327138 (owner: 10Dzahn) [01:46:24] (03PS2) 10Dzahn: Remove obsolete file misc/geoiplogtag [puppet] - 10https://gerrit.wikimedia.org/r/325583 (owner: 10Tim Landscheidt) [01:48:33] (03CR) 10Dzahn: [C: 032] Remove obsolete file misc/geoiplogtag [puppet] - 10https://gerrit.wikimedia.org/r/325583 (owner: 10Tim Landscheidt) [01:49:05] 06Operations, 10ops-codfw: codfw: rack/setup ms-fe200[5-8] - https://phabricator.wikimedia.org/T152612#2871555 (10Papaul) [01:54:21] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [01:56:21] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [01:59:37] PROBLEM - configured eth on ms-fe2007 is CRITICAL: Return code of 255 is out of bounds [02:00:38] RECOVERY - configured eth on ms-fe2007 is OK: OK - interfaces up [02:02:31] 06Operations, 10ops-codfw: codfw: rack/setup ms-fe200[5-8] - https://phabricator.wikimedia.org/T152612#2871585 (10Papaul) [02:02:37] RECOVERY - cassandra-a CQL 10.64.0.32:9042 on restbase1016 is OK: TCP OK - 0.001 second response time on 10.64.0.32 port 9042 [02:03:42] 06Operations, 10ops-codfw: codfw: rack/setup ms-fe200[5-8] - https://phabricator.wikimedia.org/T152612#2854279 (10Papaul) a:05Papaul>03fgiunchedi @fgiunchedi you good to take over. [02:05:09] 06Operations, 10ops-codfw, 06DC-Ops: ms-be2025 controller failure - https://phabricator.wikimedia.org/T151201#2871591 (10Papaul) p:05High>03Low [02:41:37] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:41:37] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:41:37] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:41:37] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:41:47] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:41:47] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:41:57] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:41:58] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:07] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:07] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:07] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:17] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:17] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:18] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:18] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:18] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:27] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:27] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:28] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:28] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:37] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [02:42:37] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [02:42:37] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [02:42:37] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [02:42:38] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [02:42:38] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [02:42:47] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [02:42:47] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [02:42:57] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [02:42:57] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [02:42:57] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [02:43:07] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [02:43:07] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [02:43:07] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [02:43:08] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [02:43:08] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [02:43:17] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [02:43:17] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [02:43:17] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [02:43:17] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [03:23:27] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 700.98 seconds [03:27:28] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 293.91 seconds [03:28:17] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:52:17] PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:57:17] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [04:09:17] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:11:17] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:20:07] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:21:17] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [04:38:17] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [04:39:37] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:40:17] RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [04:48:27] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:49:07] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [04:55:17] PROBLEM - parsoid on ruthenium is CRITICAL: connect to address 10.64.16.151 and port 8142: Connection refused [05:07:37] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [05:17:27] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [05:24:17] PROBLEM - puppet last run on relforge1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:28:27] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=46%) [05:40:06] (03PS3) 10Dzahn: Add cron job for PageAssessments maintenance script to puppet [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) (owner: 10Kaldari) [05:41:33] (03PS4) 10Dzahn: mediawiki: Add cron job for PageAssessments maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) (owner: 10Kaldari) [05:44:17] RECOVERY - parsoid on ruthenium is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.362 second response time [05:47:15] (03CR) 10Dzahn: [C: 031] Tools: Install php5-readline [puppet] - 10https://gerrit.wikimedia.org/r/325251 (https://phabricator.wikimedia.org/T136519) (owner: 10Tim Landscheidt) [05:47:17] PROBLEM - parsoid on ruthenium is CRITICAL: connect to address 10.64.16.151 and port 8142: Connection refused [05:49:27] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2245.30 Read Requests/Sec=4360.70 Write Requests/Sec=9.20 KBytes Read/Sec=19531.60 KBytes_Written/Sec=2969.60 [05:52:17] RECOVERY - puppet last run on relforge1001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [05:59:27] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=142.20 Read Requests/Sec=175.30 Write Requests/Sec=0.90 KBytes Read/Sec=1484.40 KBytes_Written/Sec=250.80 [06:02:37] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [06:03:17] PROBLEM - puppet last run on elastic1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:04:47] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 249 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:09:38] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 11 probes of 249 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:11:52] (03PS5) 10Dzahn: Add dewiktionary to RESTBase on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/321817 (https://phabricator.wikimedia.org/T150764) (owner: 10Mattflaschen) [06:14:17] RECOVERY - parsoid on ruthenium is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.052 second response time [06:14:43] (03CR) 10Dzahn: [C: 032] Add dewiktionary to RESTBase on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/321817 (https://phabricator.wikimedia.org/T150764) (owner: 10Mattflaschen) [06:17:17] PROBLEM - parsoid on ruthenium is CRITICAL: connect to address 10.64.16.151 and port 8142: Connection refused [06:17:56] (03CR) 10Dzahn: [C: 031] Tools proxy: Restrict to labs networks [puppet] - 10https://gerrit.wikimedia.org/r/321371 (owner: 10Muehlenhoff) [06:18:49] (03CR) 10Dzahn: [C: 031] role::jsbench: Restrict to production networks [puppet] - 10https://gerrit.wikimedia.org/r/320547 (owner: 10Muehlenhoff) [06:20:41] (03CR) 10Dzahn: [C: 031] ssh_pybal: Restrict to production networks [puppet] - 10https://gerrit.wikimedia.org/r/320556 (owner: 10Muehlenhoff) [06:21:43] (03PS3) 10Dzahn: hhvm::admin: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/304476 (owner: 10Muehlenhoff) [06:21:55] (03CR) 10Dzahn: [C: 031] hhvm::admin: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/304476 (owner: 10Muehlenhoff) [06:25:39] (03CR) 10Dzahn: [C: 031] toollabs: install korean locales [puppet] - 10https://gerrit.wikimedia.org/r/308439 (https://phabricator.wikimedia.org/T130532) (owner: 10Merlijn van Deen) [06:27:16] (03PS2) 10Dzahn: toolserver: Redirect ~mzmcbride/yanker/ to tool labs [puppet] - 10https://gerrit.wikimedia.org/r/308435 (https://phabricator.wikimedia.org/T136924) (owner: 10Merlijn van Deen) [06:28:20] (03CR) 10Dzahn: [C: 032] toolserver: Redirect ~mzmcbride/yanker/ to tool labs [puppet] - 10https://gerrit.wikimedia.org/r/308435 (https://phabricator.wikimedia.org/T136924) (owner: 10Merlijn van Deen) [06:28:27] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [06:29:11] (03PS2) 10Dzahn: toollabs: install korean locales [puppet] - 10https://gerrit.wikimedia.org/r/308439 (https://phabricator.wikimedia.org/T130532) (owner: 10Merlijn van Deen) [06:29:47] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:30:38] (03CR) 10Dzahn: [C: 032] toollabs: install korean locales [puppet] - 10https://gerrit.wikimedia.org/r/308439 (https://phabricator.wikimedia.org/T130532) (owner: 10Merlijn van Deen) [06:32:17] RECOVERY - puppet last run on elastic1031 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:32:37] PROBLEM - puppet last run on poolcounter1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:35:15] !log Stop MySQL db2048 for maintenance - T149553 [06:35:27] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tree] [06:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:28] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [06:37:58] (03CR) 10Dzahn: [C: 031] "https://packages.debian.org/search?keywords=opencv-data&searchon=names&suite=stable§ion=all" [puppet] - 10https://gerrit.wikimedia.org/r/303416 (https://phabricator.wikimedia.org/T142321) (owner: 10Merlijn van Deen) [06:38:47] (03PS1) 10Marostegui: db-codfw.php: Depool db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327145 (https://phabricator.wikimedia.org/T151552) [06:40:13] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327145 (https://phabricator.wikimedia.org/T151552) (owner: 10Marostegui) [06:40:18] (03CR) 10jenkins-bot: [] db-codfw.php: Depool db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327145 (https://phabricator.wikimedia.org/T151552) (owner: 10Marostegui) [06:40:48] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327145 (https://phabricator.wikimedia.org/T151552) (owner: 10Marostegui) [06:41:03] (03CR) 10Dzahn: [C: 031] "http://pywikibot.org/" [puppet] - 10https://gerrit.wikimedia.org/r/243688 (https://phabricator.wikimedia.org/T106311) (owner: 10Merlijn van Deen) [06:43:07] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2068 - T151552 (duration: 01m 48s) [06:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:21] T151552: Import S2,S6,S7,m3 and x1 to dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T151552 [06:44:06] !log Stop replication db2068 for maintenance - T151552 [06:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:37] RECOVERY - puppet last run on poolcounter1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [07:03:27] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [07:09:17] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:23:20] 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2871789 (10Marostegui) >>! In T152761#2869535, @kaldari wrote: > @Marostegui: Apparently, it's taking a long time to complete the... [07:35:27] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [07:37:17] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:44:37] PROBLEM - puppet last run on copper is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:45:27] RECOVERY - puppet last run on copper is OK: OK: Puppet is currently enabled, last run 31 minutes ago with 0 failures [07:54:47] PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:02:17] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [08:08:07] RECOVERY - Disk space on ruthenium is OK: DISK OK [08:08:13] <_joe_> !log removing parsoids main.log from ruthenium, 39 GB occupied [08:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:17] RECOVERY - parsoid on ruthenium is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.052 second response time [08:10:28] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [08:11:28] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4491783 keys, up 43 days 23 hours - replication_delay is 0 [08:17:23] <_joe_> !log restbase not starting on restbase1016 due to a failed deploy; masked the systemd unit and depooled from conftool [08:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:17] PROBLEM - Check systemd state on restbase1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:20:47] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:22:47] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [08:33:49] !log gehel@tin Starting deploy [kartotherian/deploy@3bd1692]: (no message) [08:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:26] !log gehel@tin Finished deploy [kartotherian/deploy@3bd1692]: (no message) (duration: 02m 37s) [08:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:13] !log restarting ferm on elastic2020 [08:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:47] RECOVERY - Check whether ferm is active by checking the default input chain on elastic2020 is OK: OK ferm input default policy is set [08:38:31] (03CR) 10Paladox: [C: 031] Tools: Install php5-readline [puppet] - 10https://gerrit.wikimedia.org/r/325251 (https://phabricator.wikimedia.org/T136519) (owner: 10Tim Landscheidt) [08:39:17] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[restbase] [08:48:35] !log kill stuck notification script on maps-test2001 - T145534 [08:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:46] T145534: maps - tilerator notification seems stuck on sorting files - https://phabricator.wikimedia.org/T145534 [08:48:47] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [08:53:17] (03CR) 10ArielGlenn: [C: 031] "Agree, let's just toss it." [dns] - 10https://gerrit.wikimedia.org/r/327022 (https://phabricator.wikimedia.org/T120527) (owner: 10Dzahn) [09:00:28] !log restarted ntp on elastic2020, was with unknown offset on Icinga since ~24h [09:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:47] RECOVERY - NTP on elastic2020 is OK: NTP OK: Offset -0.0001921355724 secs [09:43:38] (03PS1) 10Yuvipanda: [WIP] maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 [09:44:23] (03CR) 10jenkins-bot: [V: 04-1] [WIP] maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 (owner: 10Yuvipanda) [09:50:37] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:50:47] PROBLEM - mathoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:50:47] PROBLEM - dhclient process on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:50:47] PROBLEM - apertium apy on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:50:47] PROBLEM - ores uWSGI web app on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:50:48] PROBLEM - Check systemd state on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:50:48] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:50:57] PROBLEM - salt-minion processes on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:50:57] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:51:07] PROBLEM - ores on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:51:08] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:51:08] PROBLEM - Disk space on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:51:17] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:51:17] PROBLEM - Check whether ferm is active by checking the default input chain on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:51:18] PROBLEM - configured eth on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:51:27] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:51:27] PROBLEM - puppet last run on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:51:27] PROBLEM - Check size of conntrack table on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:51:28] PROBLEM - SSH on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:51:28] PROBLEM - MD RAID on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:51:28] PROBLEM - trendingedits endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:51:37] PROBLEM - DPKG on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:51:57] PROBLEM - eventstreams on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:52:48] <_joe_> looking into it [09:53:07] _joe_: can ping, cannot ssh so far, check mgmt I would say ;) [09:53:40] !log oblivian@puppetmaster1001 conftool action : set/pooled=no; selector: name=scb1001.eqiad.wmnet [09:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:25] <_joe_> volans: on it [09:55:00] <_joe_> I can see console, going to take my time to understand what's up [09:55:17] <_joe_> Out of memory: Kill process 8347 (electron) score 300 or sacrifice child [09:55:20] <_joe_> OOM [09:56:37] RECOVERY - apertium apy on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5632 bytes in 0.002 second response time [09:56:37] RECOVERY - dhclient process on scb1001 is OK: PROCS OK: 0 processes with command name dhclient [09:56:37] RECOVERY - ores uWSGI web app on scb1001 is OK: ● uwsgi-ores.service - uwsgi-ores uwsgi app [09:56:37] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational [09:56:47] RECOVERY - salt-minion processes on scb1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:56:47] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [09:56:47] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [09:56:57] RECOVERY - ores on scb1001 is OK: HTTP OK: HTTP/1.0 200 OK - 2822 bytes in 0.001 second response time [09:56:57] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 264 bytes in 0.002 second response time [09:56:58] RECOVERY - Disk space on scb1001 is OK: DISK OK [09:57:07] RECOVERY - Check whether ferm is active by checking the default input chain on scb1001 is OK: OK ferm input default policy is set [09:57:07] RECOVERY - configured eth on scb1001 is OK: OK - interfaces up [09:57:17] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [09:57:17] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 31 minutes ago with 0 failures [09:57:17] RECOVERY - SSH on scb1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [09:57:17] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [09:57:17] RECOVERY - MD RAID on scb1001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [09:57:17] RECOVERY - Check size of conntrack table on scb1001 is OK: OK: nf_conntrack is 4 % full [09:57:27] RECOVERY - trendingedits endpoints health on scb1001 is OK: All endpoints are healthy [09:57:27] RECOVERY - DPKG on scb1001 is OK: All packages OK [09:57:37] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy [09:57:37] RECOVERY - mathoid endpoints health on scb1001 is OK: All endpoints are healthy [09:57:47] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.031 second response time [09:58:31] <_joe_> ok no traffic means it is now recovering [09:58:37] !log gehel@tin Starting deploy [kartotherian/deploy@abc731d]: (no message) [09:58:43] <_joe_> but pdfrender was using an awful lot of memory [09:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:11] <_joe_> also, we're mixing so many things there [09:59:24] !log gehel@tin Finished deploy [kartotherian/deploy@abc731d]: (no message) (duration: 00m 47s) [09:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:47] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: name=scb1001.eqiad.wmnet [10:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:31] (03PS3) 10Jcrespo: mariadb: Depool db1089 for schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326980 (https://phabricator.wikimedia.org/T69223) [10:15:57] PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:16:47] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [10:20:24] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1089 for schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326980 (https://phabricator.wikimedia.org/T69223) (owner: 10Jcrespo) [10:20:30] (03CR) 10jenkins-bot: [] mariadb: Depool db1089 for schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326980 (https://phabricator.wikimedia.org/T69223) (owner: 10Jcrespo) [10:21:08] (03Merged) 10jenkins-bot: mariadb: Depool db1089 for schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326980 (https://phabricator.wikimedia.org/T69223) (owner: 10Jcrespo) [10:21:27] (03PS5) 10Ema: varnishxcps: use varnishncsa to read log entries from the VSM [puppet] - 10https://gerrit.wikimedia.org/r/326965 (https://phabricator.wikimedia.org/T151643) [10:21:47] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::webserver: add hack to allow installing nginx [puppet] - 10https://gerrit.wikimedia.org/r/327164 (https://phabricator.wikimedia.org/T153042) [10:22:03] <_joe_> elukey: ^^ can I ask you for a review? [10:23:58] sure! [10:24:03] !log restbase deploying 0c06fb7 on restbase1016 [10:24:04] <_joe_> it's ugly [10:24:08] <_joe_> I know :P [10:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:09] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic2020.codfw.wmnet [10:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:02] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1089 (duration: 00m 47s) [10:27:09] _joe_ so nginx will not override the nginx.conf that you put right ? [10:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:18] nginx-full I mean [10:27:53] (I am trying to understand the hack :P) [10:27:59] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2872099 (10Gehel) elastic2020 is now repooled. Traffic is still flowing to codfw, but no large shards are allocated on elastic2020 at the momen... [10:28:05] <_joe_> elukey: no, because of the fact puppet explictly tells apt to keep old config files [10:28:17] RECOVERY - Check systemd state on restbase1016 is OK: OK - running: The system is fully operational [10:28:22] oh ok didn't know it [10:29:07] RECOVERY - Restbase root url on restbase1016 is OK: HTTP OK: HTTP/1.1 200 - 15450 bytes in 0.011 second response time [10:29:08] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [10:29:27] PROBLEM - Check systemd state on mw2099 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:29:58] <_joe_> mw2099 is me [10:30:27] RECOVERY - Check systemd state on mw2099 is OK: OK - running: The system is fully operational [10:33:37] (03CR) 10Elukey: [C: 031] "Given the fact that the only suggested fix from the Debian maintainers is to change the nginx config, I think that this change "does its j" [puppet] - 10https://gerrit.wikimedia.org/r/327164 (https://phabricator.wikimedia.org/T153042) (owner: 10Giuseppe Lavagetto) [10:37:17] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [10:39:34] (03PS2) 10Giuseppe Lavagetto: role::mediawiki::webserver: add hack to allow installing nginx [puppet] - 10https://gerrit.wikimedia.org/r/327164 (https://phabricator.wikimedia.org/T153042) [10:40:47] ah nice port 443 [10:40:53] good :) [10:43:11] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::mediawiki::webserver: add hack to allow installing nginx [puppet] - 10https://gerrit.wikimedia.org/r/327164 (https://phabricator.wikimedia.org/T153042) (owner: 10Giuseppe Lavagetto) [10:52:17] PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:52:49] (03PS1) 10Giuseppe Lavagetto: mediawiki: add TLS to all canaries in codfw [puppet] - 10https://gerrit.wikimedia.org/r/327169 [10:53:36] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] mediawiki: add TLS to all canaries in codfw [puppet] - 10https://gerrit.wikimedia.org/r/327169 (owner: 10Giuseppe Lavagetto) [10:54:32] !log mobrovac@tin Starting deploy [trending-edits/deploy@c3bc30a]: (no message) [10:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:53] !log mobrovac@tin Finished deploy [trending-edits/deploy@c3bc30a]: (no message) (duration: 00m 21s) [10:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:08] (03CR) 10Alexandros Kosiaris: "Yes. given that we are at ununpentium, since a few days ago officially named Moscovium, the element with number 115 out of a total 118, re" [dns] - 10https://gerrit.wikimedia.org/r/326445 (https://phabricator.wikimedia.org/T152966) (owner: 10Alexandros Kosiaris) [10:58:53] !log alter table on db1089 - enwiki T69223 [10:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:08] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [10:59:32] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1089 for schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327171 [10:59:47] (03CR) 10Jcrespo: [C: 04-2] "Wait for schema change to finish." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327171 (owner: 10Jcrespo) [11:03:25] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 15User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2872177 (10elukey) a:05elukey>03None [11:04:07] (03PS1) 10Jcrespo: mariadb: Depool db1092 for schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327172 (https://phabricator.wikimedia.org/T69223) [11:04:20] 06Operations, 10ops-eqiad, 15User-Elukey: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#2872181 (10elukey) [11:04:30] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1092 for schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327172 (https://phabricator.wikimedia.org/T69223) (owner: 10Jcrespo) [11:04:35] (03CR) 10jenkins-bot: [] mariadb: Depool db1092 for schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327172 (https://phabricator.wikimedia.org/T69223) (owner: 10Jcrespo) [11:05:10] (03Merged) 10jenkins-bot: mariadb: Depool db1092 for schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327172 (https://phabricator.wikimedia.org/T69223) (owner: 10Jcrespo) [11:05:42] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM, 13Patch-For-Review, 15User-Elukey: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2872188 (10elukey) a:05elukey>03None [11:05:55] 06Operations, 10Analytics, 15User-Elukey: kafkatee's logrotate/syslog default pkg files needs to be removed - https://phabricator.wikimedia.org/T145490#2872191 (10elukey) a:05elukey>03None [11:06:05] 06Operations, 15User-Elukey: hhvm root:adm owned log files cause failures for logrotate - https://phabricator.wikimedia.org/T146464#2872193 (10elukey) a:05elukey>03None [11:06:11] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/326965 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [11:06:24] 06Operations, 06Performance-Team, 15User-Elukey: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2872195 (10elukey) [11:07:36] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1092 (duration: 00m 48s) [11:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:18] (03CR) 10Alexandros Kosiaris: [C: 032] ssh_pybal: Restrict to production networks [puppet] - 10https://gerrit.wikimedia.org/r/320556 (owner: 10Muehlenhoff) [11:08:23] (03PS2) 10Alexandros Kosiaris: ssh_pybal: Restrict to production networks [puppet] - 10https://gerrit.wikimedia.org/r/320556 (owner: 10Muehlenhoff) [11:08:34] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ssh_pybal: Restrict to production networks [puppet] - 10https://gerrit.wikimedia.org/r/320556 (owner: 10Muehlenhoff) [11:08:55] !log mobrovac@tin Starting deploy [trending-edits/deploy@c3bc30a]: (no message) [11:08:59] (03PS1) 10EBernhardson: Add libgomp1 to hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/327173 [11:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:12] !log mobrovac@tin Finished deploy [trending-edits/deploy@c3bc30a]: (no message) (duration: 00m 17s) [11:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:13] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [11:22:12] !log alter table on db1092 - wikidatawiki T69223 [11:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:24] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [11:22:57] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1089 for schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327171 (owner: 10Jcrespo) [11:23:03] (03CR) 10jenkins-bot: [] Revert "mariadb: Depool db1089 for schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327171 (owner: 10Jcrespo) [11:23:05] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1089 for schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327171 [11:24:27] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1092 for schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327176 [11:24:42] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:29:26] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1089 (duration: 00m 40s) [11:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:30] !log mobrovac@tin Starting deploy [trending-edits/deploy@c3bc30a]: (no message) [11:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:52] !log mobrovac@tin Finished deploy [trending-edits/deploy@c3bc30a]: (no message) (duration: 00m 22s) [11:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:42] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:52:42] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [11:57:43] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1092 for schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327176 [11:57:58] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1092 for schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327176 (owner: 10Jcrespo) [11:58:04] (03CR) 10jenkins-bot: [] Revert "mariadb: Depool db1092 for schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327176 (owner: 10Jcrespo) [11:58:47] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1092 for schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327176 (owner: 10Jcrespo) [12:01:19] (03PS2) 10Alexandros Kosiaris: Rework network::subnets [puppet] - 10https://gerrit.wikimedia.org/r/313650 [12:02:41] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1092 (duration: 00m 44s) [12:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:32] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:12:42] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [12:15:13] Jenkins should hoepfully no longer comment like (CR) jenkins-bot: [] Revert "mariadb: Depool db1092 for schema change" [mediawiki-config] (that was spammy and + was a bug). [12:15:22] fixed in https://gerrit.wikimedia.org/r/#/c/327152/ [12:20:47] (03PS4) 10Niharika29: Deploy scholarships with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/326461 (https://phabricator.wikimedia.org/T129134) [12:37:32] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [12:40:50] (03PS1) 10Alexandros Kosiaris: role::postgres::master: Hieraize node scope variable [puppet] - 10https://gerrit.wikimedia.org/r/327178 [12:41:40] (03CR) 10jenkins-bot: [V: 04-1] role::postgres::master: Hieraize node scope variable [puppet] - 10https://gerrit.wikimedia.org/r/327178 (owner: 10Alexandros Kosiaris) [12:47:57] (03PS2) 10Alexandros Kosiaris: role::postgres::master: Hieraize node scope variable [puppet] - 10https://gerrit.wikimedia.org/r/327178 [12:48:57] (03CR) 10jenkins-bot: [V: 04-1] role::postgres::master: Hieraize node scope variable [puppet] - 10https://gerrit.wikimedia.org/r/327178 (owner: 10Alexandros Kosiaris) [12:59:14] (03PS3) 10Alexandros Kosiaris: role::postgres::master: Hieraize node scope variable [puppet] - 10https://gerrit.wikimedia.org/r/327178 [13:03:56] (03CR) 10Niharika29: [] "> One meta point: /srv/deployment/wikimedia is a fine parent directory to put this in; however, you may end up, at some point conflicting " [puppet] - 10https://gerrit.wikimedia.org/r/326461 (https://phabricator.wikimedia.org/T129134) (owner: 10Niharika29) [13:04:13] paladox: ^ [13:04:23] What's causing the [] ? [13:04:33] Niharika hi, that would be v: 0 c: 0 [13:04:36] (03PS4) 10Alexandros Kosiaris: role::postgres::master: Hieraize node scope variable [puppet] - 10https://gerrit.wikimedia.org/r/327178 [13:04:49] which we block since it is by default in gerrit 2.13 that it shows all code values [13:04:58] I see. [13:05:16] which was annoying to some users as you would see c: 0 and v: 0 in red (bright red). [13:05:43] Yeah, that'd be annoying for sure. [13:13:30] yep. [13:21:54] (03CR) 10MZMcBride: "Neat, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/308435 (https://phabricator.wikimedia.org/T136924) (owner: 10Merlijn van Deen) [13:25:01] !log Deploy alter table to remove partitions on metawiki.pagelinks - db2068 - T153194 [13:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:15] T153194: Remove partitions from metawiki.pagelinks - https://phabricator.wikimedia.org/T153194 [13:26:40] (03CR) 10Alexandros Kosiaris: [C: 032] role::postgres::master: Hieraize node scope variable [puppet] - 10https://gerrit.wikimedia.org/r/327178 (owner: 10Alexandros Kosiaris) [13:28:35] (03Abandoned) 10Elukey: Add upstream source files [debs/prometheus-apache-exporter] - 10https://gerrit.wikimedia.org/r/327020 (owner: 10Elukey) [13:29:29] (03PS1) 10Alexandros Kosiaris: labsdb1005: Remove comments and postgres role [puppet] - 10https://gerrit.wikimedia.org/r/327180 [13:29:55] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] labsdb1005: Remove comments and postgres role [puppet] - 10https://gerrit.wikimedia.org/r/327180 (owner: 10Alexandros Kosiaris) [13:30:09] is phabricator unresponsive for anyone else? [13:31:14] hm, it's back [13:31:16] strange [13:32:46] (03PS1) 10Alexandros Kosiaris: if guard the slave in role::postgres::master [puppet] - 10https://gerrit.wikimedia.org/r/327181 [13:33:33] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Augeas[hba_create-replication@-v4] [13:35:11] (03PS2) 10Alexandros Kosiaris: if guard the slave in role::postgres::master [puppet] - 10https://gerrit.wikimedia.org/r/327181 [13:39:22] PROBLEM - puppet last run on restbase1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:44:18] (03PS3) 10Alexandros Kosiaris: if guard the slave in role::postgres::master [puppet] - 10https://gerrit.wikimedia.org/r/327181 [13:52:35] !log mobrovac@tin Starting deploy [trending-edits/deploy@c3bc30a]: (no message) [13:52:40] !log mobrovac@tin Finished deploy [trending-edits/deploy@c3bc30a]: (no message) (duration: 00m 04s) [13:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:49] (03PS4) 10Alexandros Kosiaris: if guard the slave in role::postgres::master [puppet] - 10https://gerrit.wikimedia.org/r/327181 [13:52:50] not true ^ [13:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:57] scap3 bug [13:53:23] hey what's this catalogue fetch fail on rb1015? [13:53:36] (03PS1) 10Jcrespo: mariadb: Move backup templates to subdir on the right position [puppet] - 10https://gerrit.wikimedia.org/r/327186 [13:57:25] (03PS1) 10Jcrespo: mariadb: Fix quering phabricator databases by querying m3 slave [puppet] - 10https://gerrit.wikimedia.org/r/327187 (https://phabricator.wikimedia.org/T151999) [13:58:36] (03CR) 10Alexandros Kosiaris: [C: 032] if guard the slave in role::postgres::master [puppet] - 10https://gerrit.wikimedia.org/r/327181 (owner: 10Alexandros Kosiaris) [14:00:04] jouncebot: next [14:00:04] In 4 hour(s) and 59 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161214T1900) [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161214T1400). Please do the needful. [14:00:04] aharoni: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:01:00] hashar: only one patch for eu swat. you? me? [14:01:24] hm, looks like aharoni is not around [14:01:26] go for it ? [14:01:32] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [14:01:34] I am dealing with some python test for zuul/selenium / Vector :} [14:01:58] maybe Nikerabbit can baby sit it [14:01:58] hashar: well, I'll wait for aharoni to show up, that is the deal, right? [14:02:04] yeah [14:02:23] aharoni: we were just talking about you, ready for swat? :) [14:02:24] (03CR) 10Marostegui: [C: 031] mariadb: Move backup templates to subdir on the right position [puppet] - 10https://gerrit.wikimedia.org/r/327186 (owner: 10Jcrespo) [14:03:00] !log mobrovac@tin Starting deploy [trending-edits/deploy@c3bc30a]: (no message) [14:03:06] !log mobrovac@tin Finished deploy [trending-edits/deploy@c3bc30a]: (no message) (duration: 00m 05s) [14:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:40] zeljkof: ready! :) [14:04:02] in that case... [14:04:11] I can SWAT today! [14:06:27] (03CR) 10Marostegui: [C: 031] mariadb: Fix quering phabricator databases by querying m3 slave [puppet] - 10https://gerrit.wikimedia.org/r/327187 (https://phabricator.wikimedia.org/T151999) (owner: 10Jcrespo) [14:06:58] \O/ [14:07:22] RECOVERY - puppet last run on restbase1015 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:07:25] hashar: I might need a bit of help with swat today, will ping you [14:07:40] I'm around. [14:08:29] aharoni: merging 327170... [14:09:14] (03PS1) 10Giuseppe Lavagetto: puppet-ecdsacert: various tweaks [puppet] - 10https://gerrit.wikimedia.org/r/327189 [14:09:19] aharoni: can you test 327170 at mwdebug1002, once it is there? [14:09:56] (03CR) 10jenkins-bot: [V: 04-1] puppet-ecdsacert: various tweaks [puppet] - 10https://gerrit.wikimedia.org/r/327189 (owner: 10Giuseppe Lavagetto) [14:11:01] zeljkof: remind me please: I enable it in the Firefox addon, and then I go to... which site? [14:11:25] aharoni: hm, let me check the docs... I rarely use it too [14:11:32] !log mobrovac@tin Starting deploy [trending-edits/deploy@c3bc30a]: (no message) [14:11:36] !log mobrovac@tin Finished deploy [trending-edits/deploy@c3bc30a]: (no message) (duration: 00m 04s) [14:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:57] (03PS2) 10Giuseppe Lavagetto: puppet-ecdsacert: various tweaks [puppet] - 10https://gerrit.wikimedia.org/r/327189 [14:12:15] aharoni: https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [14:13:00] I think you just go to whatever site you need to test, the debug header will make sure you reach the canary machine [14:14:15] hashar: ok, a question [14:14:25] hashar: baby sit whom? [14:14:59] hashar: I usually deploy commits in operations/mediawiki-config [14:15:36] Nikerabbit: the patch that aharoni wanted deployed in this swat window, but he was a minute or two late, so we were thinking what to do if he does not show up [14:15:56] aha [14:16:06] hashar: how do I deploy a patch from mediawiki/extensions/ContentTranslation? [14:16:24] zeljkof: ping me on Google chat? :) [14:16:33] Anyway, I think I'm ready to test. [14:16:42] But it's not there yet AFAIK. [14:16:50] aharoni: just a sec, to log in [14:16:53] which branch did you deploy it? [14:17:02] Also AFAIK, ContentTranslation is not different from any other extension. [14:17:04] is that branch on any wikipedia yet excluding test? [14:17:31] !log mobrovac@tin Starting deploy [trending-edits/deploy@c3bc30a]: (no message) [14:17:35] !log mobrovac@tin Finished deploy [trending-edits/deploy@c3bc30a]: (no message) (duration: 00m 04s) [14:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:20] aharoni, Nikerabbit: well, I rarely deploy any extensions, looks like, usually just config changes, still new to deploy [14:19:17] scap dir should do it, no? [14:19:24] aharoni: 327170 just got merged, proceeding... [14:19:26] anyway, I was thinking how to test it... that seems impossible to me [14:19:47] (it is working on en.wikipedia.beta.wmflabs.org though) [14:19:56] Nikerabbit: well, not sure what dir it is in, that is the problem :) [14:21:01] hashar: ok, the question [14:21:14] the docs: https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Fetching_Patches [14:21:50] for 327170, instead of [14:21:51] you@tin:/srv/mediawiki-staging $ git fetch [14:21:55] I should do [14:22:17] zfilipin@tin:/srv/mediawiki-staging/php-1.29.0-wmf.6/extensions/ContentTranslation$ git fetch [14:22:23] yep [14:22:27] hashar: correct? ^ [14:22:49] Nikerabbit: thanks, I remember doing it a while back, but not for a few months, so I was not sure [14:23:28] /srv/mediawiki-staging is a clone of operations/mediawiki-config [14:23:44] mediawiki core/extensions are under the php-XX-wmf.Y [14:23:52] "git fetch" worked fine [14:24:04] (03PS1) 10Jcrespo: mariadb: Remove all paralelism from misc backup generation [puppet] - 10https://gerrit.wikimedia.org/r/327193 (https://phabricator.wikimedia.org/T134977) [14:24:12] but "git log -p HEAD..@{u}" says "fatal: HEAD does not point to a branch" [14:25:37] "git status" says "HEAD detached at 7ce1304" [14:25:41] uh oh [14:25:45] did I do something wrong' [14:26:14] ? [14:26:25] hashar: ^ [14:27:45] Nikerabbit: ^ [14:28:14] zeljkof: which path are you in? [14:28:25] zfilipin@tin:/srv/mediawiki-staging/php-1.29.0-wmf.6/extensions/ContentTranslation$ [14:28:28] that indicates you are in a detached head / i.e. not in a branch [14:28:38] and a detached head does not have a remote tracking branch set. Hence {u} is irrelevant [14:28:59] git log HEAD..origin/wmf/1.29.0-wmf.6 [14:29:12] or look at the status of that mw version [14:29:19] In /srv/mediawiki-staging/php-1.29.0-wmf.6 : git status [14:30:17] ok, that looks ok [14:30:27] ok, rebasing then [14:30:51] then [14:30:58] you just have to fetch from the mediawiki root [14:31:03] eg /srv/mediawiki-staging/php-1.29.0-wmf.6 [14:31:14] which would have the commit that bumps the ContentTranslation submodule [14:31:57] ok, so back to php-... folder and then "git fetch" [14:32:07] yeah [14:32:19] that is mediawiki/core @ the wmf branch [14:32:28] you will get a change such as * a7f44f0 - (origin/wmf/1.29.0-wmf.6) Update git submodules (Wed Dec 14 14:17:18 2016 +0000) [14:32:34] that is Gerrit automatically creating a commit to bump the submodule in mediawiki/core [14:32:43] when a change for said extension has been merged [14:32:49] so git fetch [14:32:49] review [14:32:51] git rebase [14:32:58] then check the status of the submodule: [14:33:05] git status extensions/ContentTranslation [14:33:08] you will see new commits [14:33:20] update the submodule: git submodule update extensions/ContentTranslation [14:33:26] that will autorebase to the latest version [14:33:45] then scap sync-dir php-1.29.0-wmf.6/extensions/ContentTranslation [14:35:34] (03CR) 10Volans: [C: 04-1] "In general LGTM, some minor comment inline." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/327189 (owner: 10Giuseppe Lavagetto) [14:36:26] hashar: ok, thanks, could you please document that on deployers page? [14:36:39] ;) [14:36:51] I could try, but I'm not sure I would do a good job [14:36:58] .. [14:37:20] I am 1001% sure that is described in a section such as "how to deploy an extension change" [14:37:22] or something like that [14:37:49] Nikerabbit, aharoni: I should push 327170 to cluster? or did you say that it can be tested at mwdebug1002? [14:38:07] hashar: really? I guess I have missed that [14:38:15] * zeljkof is looking up the extensions docs [14:38:40] zeljkof: unless -wmf.6 is deployed to some wikipedia it cannot be tested with mwdebug I believe [14:38:59] in fact hard to test at all before the train runs [14:39:04] !log Stop replication db2012 (m3) for maintenance - T151552 [14:39:15] guess that got figured out on beta already [14:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:17] T151552: Import S2,S6,S7,m3 and x1 to dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T151552 [14:39:26] yeah beta looks good [14:39:37] Nikerabbit: according to https://tools.wmflabs.org/versions/, wmf.6 is at group 0 [14:39:46] (what Nikerabbit said) [14:39:56] (03PS3) 10Elukey: [WIP] Yandex ClickHouse puppetization [puppet] - 10https://gerrit.wikimedia.org/r/325797 (https://phabricator.wikimedia.org/T150343) [14:39:57] I have to go now...-> [14:40:03] I'm here [14:40:06] hashar, Nikerabbit, aharoni: ok, in that case, deploying to the universe [14:40:23] PROBLEM - puppet last run on labstore1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:40:47] 06Operations, 06Labs, 10Labs-Infrastructure, 06Reading-Web-Backlog, and 3 others: https://wikitech.m.wikimedia.org/ serves wikimedia.org portal - https://phabricator.wikimedia.org/T120527#2872577 (10Dzahn) [14:40:49] 06Operations, 10DNS, 10Traffic, 07Mobile: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#2872576 (10Dzahn) [14:41:22] (03CR) 10Giuseppe Lavagetto: [] puppet-ecdsacert: various tweaks (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/327189 (owner: 10Giuseppe Lavagetto) [14:42:54] (03CR) 10Giuseppe Lavagetto: [] puppet-ecdsacert: various tweaks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/327189 (owner: 10Giuseppe Lavagetto) [14:42:55] Nikerabbit: go for it. I'll test later today with the train. [14:42:59] zeljkof: ^ [14:43:10] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2872578 (10GWicke) @Robh: Excellent, thanks for the update! [14:43:39] !log zfilipin@tin Synchronized php-1.29.0-wmf.6/extensions/ContentTranslation: SWAT: [[gerrit:327170|Fix header on Special:CX when translating]] (duration: 00m 43s) [14:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:32] aharoni: deployed [14:44:42] you should be able to test it at group 0 [14:44:53] https://tools.wmflabs.org/versions/ [14:45:43] looks like that is all [14:45:52] !log finished EU SWAT [14:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:15] (03PS3) 10Giuseppe Lavagetto: puppet-ecdsacert: various tweaks [puppet] - 10https://gerrit.wikimedia.org/r/327189 [14:51:09] (03CR) 10jenkins-bot: [V: 04-1] puppet-ecdsacert: various tweaks [puppet] - 10https://gerrit.wikimedia.org/r/327189 (owner: 10Giuseppe Lavagetto) [14:55:54] !log mobrovac@tin Starting deploy [trending-edits/deploy@c3bc30a]: (no message) [14:55:54] (03PS4) 10Giuseppe Lavagetto: puppet-ecdsacert: various tweaks [puppet] - 10https://gerrit.wikimedia.org/r/327189 [14:55:58] !log mobrovac@tin Finished deploy [trending-edits/deploy@c3bc30a]: (no message) (duration: 00m 04s) [14:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:45] (03CR) 10Volans: [C: 04-1] puppet-ecdsacert: various tweaks (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/327189 (owner: 10Giuseppe Lavagetto) [15:03:17] (03PS6) 10Ema: varnishxcps: use varnishncsa to read log entries from the VSM [puppet] - 10https://gerrit.wikimedia.org/r/326965 (https://phabricator.wikimedia.org/T151643) [15:03:34] !log mobrovac@tin Starting deploy [trending-edits/deploy@c3bc30a]: (no message) [15:03:38] !log mobrovac@tin Finished deploy [trending-edits/deploy@c3bc30a]: (no message) (duration: 00m 04s) [15:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:11] (03CR) 10Alexandros Kosiaris: [C: 031] mariadb: Remove all paralelism from misc backup generation [puppet] - 10https://gerrit.wikimedia.org/r/327193 (https://phabricator.wikimedia.org/T134977) (owner: 10Jcrespo) [15:08:23] RECOVERY - puppet last run on labstore1002 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:08:57] (03PS1) 10Ema: debug_proxy: bump proxy_read_timeout [puppet] - 10https://gerrit.wikimedia.org/r/327204 (https://phabricator.wikimedia.org/T152895) [15:11:01] (03PS2) 10Ema: debug_proxy: bump proxy_read_timeout [puppet] - 10https://gerrit.wikimedia.org/r/327204 (https://phabricator.wikimedia.org/T152895) [15:13:56] (03CR) 10Andrew Bogott: [C: 031] delete wikitech.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/327022 (https://phabricator.wikimedia.org/T120527) (owner: 10Dzahn) [15:17:54] (03PS5) 10Giuseppe Lavagetto: puppet-ecdsacert: various tweaks [puppet] - 10https://gerrit.wikimedia.org/r/327189 [15:18:32] (03CR) 10Alex Monk: [C: 031] "domain has never worked, does not currently work, and is not likely to work in the near future. and is seemingly redundant." [dns] - 10https://gerrit.wikimedia.org/r/327022 (https://phabricator.wikimedia.org/T120527) (owner: 10Dzahn) [15:19:02] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/327189 (owner: 10Giuseppe Lavagetto) [15:20:03] (03PS1) 10Jcrespo: dbhosts: update core dbs to include db1095 and new labsdbs [software] - 10https://gerrit.wikimedia.org/r/327209 [15:20:06] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-ecdsacert: various tweaks [puppet] - 10https://gerrit.wikimedia.org/r/327189 (owner: 10Giuseppe Lavagetto) [15:23:22] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-wildcardsign] [15:24:20] (03CR) 10Elukey: [C: 031] "https://puppet-compiler.wmflabs.org/4884/hassaleh.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/327204 (https://phabricator.wikimedia.org/T152895) (owner: 10Ema) [15:27:01] (03PS3) 10Ema: debug_proxy: bump proxy_read_timeout [puppet] - 10https://gerrit.wikimedia.org/r/327204 (https://phabricator.wikimedia.org/T152895) [15:27:12] (03CR) 10Ema: [V: 032 C: 032] debug_proxy: bump proxy_read_timeout [puppet] - 10https://gerrit.wikimedia.org/r/327204 (https://phabricator.wikimedia.org/T152895) (owner: 10Ema) [15:29:51] !log mobrovac@tin Starting deploy [trending-edits/deploy@c04e9d1]: (no message) [15:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:13] !log mobrovac@tin Finished deploy [trending-edits/deploy@c04e9d1]: (no message) (duration: 00m 21s) [15:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:22] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [15:47:23] (03PS1) 10Eevans: enable instance restbase1016-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/327218 (https://phabricator.wikimedia.org/T151086) [15:48:42] (03CR) 10Eevans: [C: 031] "Ready!" [puppet] - 10https://gerrit.wikimedia.org/r/327218 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [15:49:41] !log mobrovac@tin Starting deploy [trending-edits/deploy@c04e9d1]: (no message) [15:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:00] !log mobrovac@tin Finished deploy [trending-edits/deploy@c04e9d1]: (no message) (duration: 00m 19s) [15:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:14] (03CR) 10Marostegui: [C: 031] "I assume we will include dbstore2001 once it has all the shards? So far it has s1,s2,s3,s4,s5,s6" [software] - 10https://gerrit.wikimedia.org/r/327209 (owner: 10Jcrespo) [16:05:08] (03PS1) 10Papaul: DNS: Add mgmt DNS entries for wdqs2003 Bug:T152644 [dns] - 10https://gerrit.wikimedia.org/r/327222 (https://phabricator.wikimedia.org/T152644) [16:05:31] (03CR) 10Jcrespo: [C: 032] mariadb: Move backup templates to subdir on the right position [puppet] - 10https://gerrit.wikimedia.org/r/327186 (owner: 10Jcrespo) [16:05:40] (03PS2) 10Jcrespo: mariadb: Move backup templates to subdir on the right position [puppet] - 10https://gerrit.wikimedia.org/r/327186 [16:11:11] (03CR) 10Jcrespo: [C: 032] "dbstore2001 is already there. Remember that our "script" doesn't fail fatally when a tables or db doesn't exist, it just ignores it and tr" [software] - 10https://gerrit.wikimedia.org/r/327209 (owner: 10Jcrespo) [16:14:49] (03CR) 10Hashar: [] "recheck" [software] - 10https://gerrit.wikimedia.org/r/325762 (https://phabricator.wikimedia.org/T152549) (owner: 10Hashar) [16:15:02] (03CR) 10Hashar: [] "Be bold ? :)" [software] - 10https://gerrit.wikimedia.org/r/325762 (https://phabricator.wikimedia.org/T152549) (owner: 10Hashar) [16:15:13] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development, 13Patch-For-Review: tox-jessie is failing on operations/software - https://phabricator.wikimedia.org/T152549#2872809 (10hashar) a:03hashar [16:15:41] (03PS1) 10Papaul: DNS: ADD production DNS for wdqs2003 Bug:T152644 [dns] - 10https://gerrit.wikimedia.org/r/327223 (https://phabricator.wikimedia.org/T152644) [16:17:02] !log tcp mysql addition in core fw for labsdb1009/10/11 from labs instances T140452 [16:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:14] T140452: Initial setup and provision of labsdb1009, labsdb1010 and labsdb1011 - https://phabricator.wikimedia.org/T140452 [16:18:29] <_joe_> !log re-generating internal certs for mw clusters [16:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:37] (03PS2) 10Jcrespo: mariadb: Fix quering phabricator databases by querying m3 slave [puppet] - 10https://gerrit.wikimedia.org/r/327187 (https://phabricator.wikimedia.org/T151999) [16:20:04] chasemp, what is the firewall like? [16:21:29] jynus: mirrors existing labsdbs which I'm imagining after we do some like-for-like testing down the road all of that is removed in favor of teh haproxy end points only [16:21:56] is it the "tcp dpt:mysql" one? [16:22:23] jynus: this is at the inf layer so it's 'labs-instance-in4 term labsdb-tcp4' not iptables [16:23:25] wait, is this fw on the db or somewhere else? [16:24:23] I will have to have a look at that soon [16:24:29] to tune it [16:24:29] somewhere else, there is an acl on the core routers from labs instances that whitelists labsdbs, I added the new ones for now. separately the fw on the dbs themselves can be used to block/allow [16:24:35] ok ok [16:25:05] (03PS1) 10Giuseppe Lavagetto: puppetmaster: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/327225 [16:25:06] (03PS1) 10Papaul: DHCP: ADD DHCP entries for wdqs2003 Bug:T152644 [puppet] - 10https://gerrit.wikimedia.org/r/327226 (https://phabricator.wikimedia.org/T152644) [16:25:15] actually, we should do that for the proxies only [16:25:29] but it is ok to have it like that [16:25:30] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] puppetmaster: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/327225 (owner: 10Giuseppe Lavagetto) [16:25:32] for admin reasons [16:25:37] (03PS2) 10Giuseppe Lavagetto: puppetmaster: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/327225 [16:25:42] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] puppetmaster: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/327225 (owner: 10Giuseppe Lavagetto) [16:25:48] and have the fine tune on the machines, I think [16:26:11] I will have a look at that today or tomorrow [16:26:17] agreed, once we put things in service I'm matching the existing so we can reason about migration cases and sort it out [16:26:38] <_joe_> jynus: ok to merge your change too? [16:26:41] should be a '.' there after service to make any sense [16:27:01] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: rack/setup/install wdqs2003 - https://phabricator.wikimedia.org/T152644#2872862 (10Papaul) [16:27:03] <_joe_> you submitted it 20 minutes ago, so I guess it is [16:27:14] I wanted to merge more at the same time [16:27:26] but the ones merged can go already [16:27:36] it is just that the rebases are slow [16:27:47] <_joe_> ok [16:27:53] (03CR) 10Jcrespo: [C: 032] mariadb: Fix quering phabricator databases by querying m3 slave [puppet] - 10https://gerrit.wikimedia.org/r/327187 (https://phabricator.wikimedia.org/T151999) (owner: 10Jcrespo) [16:27:59] (03PS3) 10Jcrespo: mariadb: Fix quering phabricator databases by querying m3 slave [puppet] - 10https://gerrit.wikimedia.org/r/327187 (https://phabricator.wikimedia.org/T151999) [16:28:19] they are not changes in real time [16:29:07] (03PS2) 10Jcrespo: mariadb: Remove all paralelism from misc backup generation [puppet] - 10https://gerrit.wikimedia.org/r/327193 (https://phabricator.wikimedia.org/T134977) [16:30:13] 06Operations, 10DBA, 13Patch-For-Review: Throttle mysql backups on dbstore1001 in order to not saturate the node - https://phabricator.wikimedia.org/T134977#2872886 (10jcrespo) a:03jcrespo Putting this in progress so it is in the radar, to check next week if it worked. [16:30:28] (03CR) 10Jcrespo: [C: 032] mariadb: Remove all paralelism from misc backup generation [puppet] - 10https://gerrit.wikimedia.org/r/327193 (https://phabricator.wikimedia.org/T134977) (owner: 10Jcrespo) [16:33:46] (03PS1) 10Papaul: Add partman recipe for wdqs2003 Bug:T152644 [puppet] - 10https://gerrit.wikimedia.org/r/327227 (https://phabricator.wikimedia.org/T152644) [16:34:30] 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2872905 (10Marostegui) I have extended this just in case and in order to avoid bothering our US folks with a page. [16:35:33] PROBLEM - puppet last run on es2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:37:27] (03PS1) 10Andrew Bogott: toollabs: Include openstack::clientlib on web hosts [puppet] - 10https://gerrit.wikimedia.org/r/327228 [16:40:24] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix SANs for internal certs [puppet] - 10https://gerrit.wikimedia.org/r/327229 [16:40:37] 06Operations, 07Availability: Set databases as read-only or switchover to secondary datacenter - https://phabricator.wikimedia.org/T138810#2872925 (10jcrespo) [16:42:33] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: fix SANs for internal certs [puppet] - 10https://gerrit.wikimedia.org/r/327229 (owner: 10Giuseppe Lavagetto) [16:42:50] (03PS2) 10Eevans: enable instance restbase1016-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/327218 (https://phabricator.wikimedia.org/T151086) [16:43:03] <_joe_> jynus: again, I need to merge [16:43:06] <_joe_> can I? [16:43:22] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:44:13] yes, sorry [16:44:33] trying to do too many things at the same time [16:44:41] <_joe_> heh [16:44:56] and when it is not a breaking change, I lose interest [16:45:11] because I have other 20 breaking changes at the same tiem [16:48:15] !log Stop replication db2068 for maintenance - T151552 [16:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:28] T151552: Import S2,S6,S7,m3 and x1 to dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T151552 [16:58:22] (03PS2) 10Andrew Bogott: toollabs: Include openstack::clientlib on web hosts [puppet] - 10https://gerrit.wikimedia.org/r/327228 [16:58:23] (03PS1) 10Andrew Bogott: Add novaconfig to labs.yaml [puppet] - 10https://gerrit.wikimedia.org/r/327230 (https://phabricator.wikimedia.org/T150092) [17:00:12] (03PS1) 10Andrew Bogott: Add labs.yaml to labs/private [labs/private] - 10https://gerrit.wikimedia.org/r/327231 (https://phabricator.wikimedia.org/T150092) [17:01:45] (03PS3) 10Andrew Bogott: toollabs: Include openstack::clientlib on web hosts [puppet] - 10https://gerrit.wikimedia.org/r/327228 [17:01:47] (03PS1) 10Andrew Bogott: Labs hiera: Include private labs.yaml in hiera search [puppet] - 10https://gerrit.wikimedia.org/r/327232 (https://phabricator.wikimedia.org/T150092) [17:06:50] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, and 2 others: wdqs2003 switch port configuration - https://phabricator.wikimedia.org/T153094#2873019 (10RobH) 05Open>03Resolved network port enabled, description set, and put in the internal vlan. [17:06:53] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: rack/setup/install wdqs2003 - https://phabricator.wikimedia.org/T152644#2873021 (10RobH) [17:09:32] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:14:43] (03CR) 10RobH: [C: 032] DNS: Add mgmt DNS entries for wdqs2003 Bug:T152644 [dns] - 10https://gerrit.wikimedia.org/r/327222 (https://phabricator.wikimedia.org/T152644) (owner: 10Papaul) [17:16:48] (03PS2) 10RobH: DNS: ADD production DNS for wdqs2003 Bug:T152644 [dns] - 10https://gerrit.wikimedia.org/r/327223 (https://phabricator.wikimedia.org/T152644) (owner: 10Papaul) [17:17:49] (03CR) 10RobH: [C: 032] DNS: ADD production DNS for wdqs2003 Bug:T152644 [dns] - 10https://gerrit.wikimedia.org/r/327223 (https://phabricator.wikimedia.org/T152644) (owner: 10Papaul) [17:19:44] (03PS2) 10RobH: Add partman recipe for wdqs2003 Bug:T152644 [puppet] - 10https://gerrit.wikimedia.org/r/327227 (https://phabricator.wikimedia.org/T152644) (owner: 10Papaul) [17:20:06] aww, i miss gerrit telling me verified now [17:20:13] i had it for an hour yesterday and now im spoiled! [17:22:57] (03PS1) 10Yuvipanda: tools: Automount ldap.yaml too onto containers [puppet] - 10https://gerrit.wikimedia.org/r/327235 [17:23:44] (03CR) 10RobH: [C: 032] Add partman recipe for wdqs2003 Bug:T152644 [puppet] - 10https://gerrit.wikimedia.org/r/327227 (https://phabricator.wikimedia.org/T152644) (owner: 10Papaul) [17:23:53] (03CR) 10jenkins-bot: [V: 04-1] tools: Automount ldap.yaml too onto containers [puppet] - 10https://gerrit.wikimedia.org/r/327235 (owner: 10Yuvipanda) [17:24:00] (03PS2) 10RobH: DHCP: ADD DHCP entries for wdqs2003 Bug:T152644 [puppet] - 10https://gerrit.wikimedia.org/r/327226 (https://phabricator.wikimedia.org/T152644) (owner: 10Papaul) [17:24:13] (03CR) 10RobH: [C: 032] DHCP: ADD DHCP entries for wdqs2003 Bug:T152644 [puppet] - 10https://gerrit.wikimedia.org/r/327226 (https://phabricator.wikimedia.org/T152644) (owner: 10Papaul) [17:24:37] (03CR) 10RobH: [V: 032 C: 032] DHCP: ADD DHCP entries for wdqs2003 Bug:T152644 [puppet] - 10https://gerrit.wikimedia.org/r/327226 (https://phabricator.wikimedia.org/T152644) (owner: 10Papaul) [17:27:48] 06Operations, 10Citoid, 10RESTBase, 10RESTBase-API, and 5 others: Set-up Citoid behind RESTBase - https://phabricator.wikimedia.org/T108646#2873094 (10Mvolz) [17:28:54] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: rack/setup/install wdqs2003 - https://phabricator.wikimedia.org/T152644#2873098 (10RobH) All of the install_module and dns changes that @papaul put in gerrit for review have been merged live on cluster, @papaul can continue wit... [17:29:09] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: rack/setup/install wdqs2003 - https://phabricator.wikimedia.org/T152644#2873104 (10RobH) [17:31:39] (03PS3) 10Filippo Giunchedi: enable instance restbase1016-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/327218 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [17:33:32] !log OS install on wdqs2003 [17:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:15] I am fixing dbstore1001 and es2001 [17:35:25] (03PS1) 10Jcrespo: mariadb: Fix mistaken template puppet url [puppet] - 10https://gerrit.wikimedia.org/r/327238 [17:35:42] (03CR) 10jenkins-bot: [V: 04-1] mariadb: Fix mistaken template puppet url [puppet] - 10https://gerrit.wikimedia.org/r/327238 (owner: 10Jcrespo) [17:35:56] (03PS2) 10Jcrespo: mariadb: Fix mistaken template puppet url [puppet] - 10https://gerrit.wikimedia.org/r/327238 [17:36:49] (03CR) 10Filippo Giunchedi: [C: 032] enable instance restbase1016-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/327218 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [17:37:32] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:38:25] (03PS3) 10Jcrespo: mariadb: Fix mistaken template puppet url [puppet] - 10https://gerrit.wikimedia.org/r/327238 [17:39:12] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Fix mistaken template puppet url [puppet] - 10https://gerrit.wikimedia.org/r/327238 (owner: 10Jcrespo) [17:40:22] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:41:38] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring, 15User-Elukey: Port apache httpd metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T147316#2873122 (10elukey) a:03elukey [17:42:02] (03PS1) 10Elukey: Add the prometheus-apache-exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/327240 (https://phabricator.wikimedia.org/T147316) [17:51:12] akosiaris: thoughs on my comment on https://gerrit.wikimedia.org/r/#/c/323079 ? I'm asking because I'd like to go ahead with it today/tomorrow [17:51:42] (03PS1) 10Faidon Liambotis: aptrepo: add Docker's apt repo to reprepro updates [puppet] - 10https://gerrit.wikimedia.org/r/327241 [17:51:44] (03PS1) 10Faidon Liambotis: docker: cleanup dockerproject's apt repository [puppet] - 10https://gerrit.wikimedia.org/r/327242 [17:51:46] (03PS1) 10Faidon Liambotis: docker: cleanup the custom apt repository stanzas [puppet] - 10https://gerrit.wikimedia.org/r/327243 [17:53:59] godog: well given your response, I would say it probably is better to have puppet fail if the hiera variable is not defined [17:54:24] (03CR) 10Faidon Liambotis: [] "LGTM, but I'd prefer doing this whole thing in an entirely different way, cf. topic:docker-apt." [puppet] - 10https://gerrit.wikimedia.org/r/321485 (owner: 10Dduvall) [17:55:36] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add hhvm_exporter role and class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323079 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [17:57:21] (03PS1) 10Jcrespo: mariadb: even more small fixes for the misc backup scripts [puppet] - 10https://gerrit.wikimedia.org/r/327245 (https://phabricator.wikimedia.org/T151999) [17:57:59] akosiaris: heh though the variable won't be defined in labs, I guess we could default to something sane tho [17:59:09] (03CR) 10Jcrespo: [C: 032] mariadb: even more small fixes for the misc backup scripts [puppet] - 10https://gerrit.wikimedia.org/r/327245 (https://phabricator.wikimedia.org/T151999) (owner: 10Jcrespo) [17:59:40] (03CR) 10Rush: [C: 031] "This (and the subsequent cleanup) makes a lot of sense to me" [puppet] - 10https://gerrit.wikimedia.org/r/327241 (owner: 10Faidon Liambotis) [17:59:58] godog: well there is the labs wide hiera file labs.yaml where we could define that variable [18:00:06] assuming it makes sense [18:00:09] gtg, meeting [18:01:46] RECOVERY - puppet last run on es2001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [18:01:46] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:01:47] PROBLEM - cassandra-b CQL 10.64.0.33:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.33 and port 9042: Connection refused [18:02:35] akosiaris: ok thanks! I'll look into that now [18:03:02] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.0.33:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.33 and port 9042: Connection refused Filippo Giunchedi bootstrapping [18:17:22] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2873220 (10jcrespo) MySQLs wit no SSL ``` $ sudo salt -C 'G@cluster:mysql' cmd.run 'mysql --skip-ssl -e "SELECT @@ssl_ca"' | grep -c 'NULL' 14 ``` MySQL with expired TLS cert: ``` $ sudo... [18:18:05] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2873222 (10jcrespo) [18:18:21] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2203662 (10jcrespo) [18:18:30] (03PS2) 10ArielGlenn: list last n good dumps: implement rsynclisting option [puppet] - 10https://gerrit.wikimedia.org/r/326422 (https://phabricator.wikimedia.org/T152954) [18:19:36] (03CR) 1020after4: [C: 032] No need to import ValueError, it's built in [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326199 (owner: 10Chad) [18:20:26] (03Merged) 10jenkins-bot: No need to import ValueError, it's built in [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326199 (owner: 10Chad) [18:20:52] 06Operations, 10DBA, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#2873250 (10jcrespo) Pending hosts: ``` db1063.eqiad.wmnet db1054.eqiad.wmnet db1067.eqiad.wmnet db1036.eqiad.wmnet db1015.eqiad.wmnet db1021.eqiad.wmnet db1022.eqiad.wmnet db205... [18:21:18] 06Operations, 10DBA, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#2873253 (10jcrespo) [18:22:03] !log twentyafterfour@tin Synchronized scap/plugins/patch.py: sync patch.py (duration: 00m 40s) [18:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:35] (03PS4) 10Filippo Giunchedi: Add hhvm_exporter role and class [puppet] - 10https://gerrit.wikimedia.org/r/323079 (https://phabricator.wikimedia.org/T147423) [18:26:42] (03CR) 10Alex Monk: [C: 031] lists/exim: move files from /files to role module [puppet] - 10https://gerrit.wikimedia.org/r/327138 (owner: 10Dzahn) [18:27:44] (03CR) 10Filippo Giunchedi: [] Add hhvm_exporter role and class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323079 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [18:28:04] (03PS1) 10Florianschmidtwelzow: Enable sitenotice banners for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327253 (https://phabricator.wikimedia.org/T152826) [18:28:27] akosiaris: attempted a fix but afaics without a sane default in labs the role can't be used out of the box without further tweaks :( [18:29:46] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [18:34:49] (03PS1) 10Thcipriani: Bump scap version to 3.4.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/327256 (https://phabricator.wikimedia.org/T127762) [18:34:58] godog: thanks for the merge! [18:35:19] urandom: np! [18:38:24] 06Operations, 10MediaWiki-General-or-Unknown, 06Release-Engineering-Team, 10Traffic, and 4 others: Make sure we're not relying on HTTP_PROXY headers - https://phabricator.wikimedia.org/T140658#2873344 (10Aklapper) [18:42:44] (03PS1) 10Eevans: enable instance restbase1016-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/327260 (https://phabricator.wikimedia.org/T151086) [18:43:38] (03CR) 10Eevans: [C: 04-1] "Not yet ready." [puppet] - 10https://gerrit.wikimedia.org/r/327260 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [18:46:32] (03PS3) 10ArielGlenn: list last n good dumps: implement rsynclisting option [puppet] - 10https://gerrit.wikimedia.org/r/326422 (https://phabricator.wikimedia.org/T152954) [18:47:06] PROBLEM - Juniper alarms on asw-a-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms [18:47:10] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: rack/setup/install wdqs2003 - https://phabricator.wikimedia.org/T152644#2873383 (10Papaul) [18:47:51] (03CR) 10Yuvipanda: [C: 04-1] "This approach has a few issues:" [puppet] - 10https://gerrit.wikimedia.org/r/327228 (owner: 10Andrew Bogott) [18:48:00] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: rack/setup/install wdqs2003 - https://phabricator.wikimedia.org/T152644#2855908 (10Papaul) a:05Papaul>03Gehel @Gehel you can take over. [18:48:15] (03CR) 10Yuvipanda: [C: 04-1] "https://gerrit.wikimedia.org/r/#/c/327235/ for how we can expose a data file (with just credentials) to all containers." [puppet] - 10https://gerrit.wikimedia.org/r/327228 (owner: 10Andrew Bogott) [18:48:25] papaul: great! Thanks! cc SMalyshev [18:48:52] gehel: yw [18:48:59] godog: heh, check out iowait on 1016: https://grafana.wikimedia.org/dashboard/db/restbase-cassandra-system?from=now-24h&to=now&panelId=13&fullscreen [18:51:21] urandom: hehehe neat [18:53:41] 06Operations, 06Discovery, 06Discovery-Search, 10Monitoring, 07Wikimedia-Incident: Alert when ES indexes are freezed for more than 30 minutes - https://phabricator.wikimedia.org/T110171#2873403 (10Deskana) >>! In T110171#2870147, @Dzahn wrote: > I know this is a general Phabricator workflow thing but i n... [18:53:47] (03PS1) 10Dzahn: add tendril[12]001, v4 and v6 IPs [dns] - 10https://gerrit.wikimedia.org/r/327266 (https://phabricator.wikimedia.org/T149340) [18:59:45] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): setup/install restbase-test100[123] - https://phabricator.wikimedia.org/T151075#2873409 (10Cmjohnson) It appears the LFF to SFF adapter was the culprit for the disk not presenting itself. Replaced the adapter and all disk show up. During the... [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161214T1900). Please do the needful. [19:00:04] James_F, ejegg, and yurik: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [19:00:26] * James_F waves. [19:00:29] standing by [19:01:14] I can SWAT today [19:01:18] Cool. [19:01:19] here [19:01:29] patch cherry-picked to wmf.5 and wmf.6, up for review: https://gerrit.wikimedia.org/r/327251 https://gerrit.wikimedia.org/r/327252 [19:01:41] (03PS3) 10Thcipriani: Provide the visual editor wikitext mode Beta Feature to all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321993 (owner: 10Jforrester) [19:01:59] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321993 (owner: 10Jforrester) [19:02:43] verified that it does fix the issues on beta cluster [19:02:52] (03Merged) 10jenkins-bot: Provide the visual editor wikitext mode Beta Feature to all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321993 (owner: 10Jforrester) [19:03:04] images below the fold appear: https://googleweblight.com/?lite_url=https://en.wikipedia.beta.wmflabs.org/wiki/Barack_Obama [19:04:07] (03PS2) 10Dzahn: delete wikitech.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/327022 (https://phabricator.wikimedia.org/T120527) [19:04:15] James_F: config change is live on mwdebug1002, check please [19:05:36] (03CR) 10Dzahn: [C: 032] "thanks for the +1's everybody" [dns] - 10https://gerrit.wikimedia.org/r/327022 (https://phabricator.wikimedia.org/T120527) (owner: 10Dzahn) [19:07:18] thcipriani: One second, double-checking an odd bug which I think is debug-only. [19:07:29] ack, thanks [19:07:56] (03PS4) 10ArielGlenn: list last n good dumps: implement rsynclisting option [puppet] - 10https://gerrit.wikimedia.org/r/326422 (https://phabricator.wikimedia.org/T152954) [19:08:03] 06Operations, 06Labs, 10Labs-Infrastructure, 06Reading-Web-Backlog, and 3 others: https://wikitech.m.wikimedia.org/ serves wikimedia.org portal - https://phabricator.wikimedia.org/T120527#2873427 (10Dzahn) [radon:~] $ host wikitech.m.wikimedia.org Host wikitech.m.wikimedia.org not found: 3(NXDOMAIN) This... [19:08:34] 06Operations, 10DNS, 10Traffic, 07Mobile: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#2873430 (10Dzahn) [19:08:37] 06Operations, 06Labs, 10Labs-Infrastructure, 06Reading-Web-Backlog, and 3 others: https://wikitech.m.wikimedia.org/ serves wikimedia.org portal - https://phabricator.wikimedia.org/T120527#2873428 (10Dzahn) 05Open>03Resolved a:03Dzahn [19:10:02] thcipriani: Go ahead, but we will have a follow-up patch. [19:10:09] James_F: ok, going live. [19:12:09] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:321993|Provide the visual editor wikitext mode Beta Feature to all]] (duration: 00m 42s) [19:12:15] ^ James_F live everywhere [19:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:27] * thcipriani waits on jenkins [19:12:32] thcipriani: Thanks! [19:14:19] thcipriani: hey, i added a last-minute patch to the swat, if that's okay. [19:14:36] MatmaRex: sure, I'll try to get there :) [19:15:30] 06Operations, 10Traffic, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port gdnsd statistics from ganglia to prometheus - https://phabricator.wikimedia.org/T147426#2873453 (10fgiunchedi) Pasting a conversation in `#wikimedia-traffic` re: status codes and dashboarding ``` 19:04 in terms of... [19:15:53] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [19:16:29] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Estimate hardware requirements for WDQS upgrade - https://phabricator.wikimedia.org/T148747#2873458 (10RobH) [19:17:05] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Estimate hardware requirements for WDQS upgrade - https://phabricator.wikimedia.org/T148747#2873461 (10Cmjohnson) [19:17:09] (03PS2) 10Andrew Bogott: Add novaconfig to labs.yaml [puppet] - 10https://gerrit.wikimedia.org/r/327230 (https://phabricator.wikimedia.org/T150092) [19:17:17] (03CR) 10Andrew Bogott: [V: 032 C: 032] Add labs.yaml to labs/private [labs/private] - 10https://gerrit.wikimedia.org/r/327231 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [19:17:20] James_F: visualeditor typo update is live on mwdebug1002, check please [19:17:20] 06Operations, 10DNS, 10Traffic, 07Mobile: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#2873463 (10Krenair) [19:17:59] thcipriani: Looking. [19:19:18] thcipriani: Yeah, LGTM. [19:19:25] James_F: cool, going live. [19:19:53] PROBLEM - puppet last run on wasat is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:22:31] !log thcipriani@tin Synchronized php-1.29.0-wmf.6/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.DesktopArticleTarget.init.js: SWAT: [[gerrit:327258|Follow-up Ic1f1de26: Fix typo in edit tab selector]] (duration: 00m 49s) [19:22:39] ^ James_F live everywhere [19:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:21] thcipriani: Thanks! [19:25:00] ejegg: your changes for both .5 and .6 are live on mwdebug1002, check please [19:26:06] 06Operations, 10Phabricator, 06Release-Engineering-Team, 13Patch-For-Review: Setup test domain for phab2001 - https://phabricator.wikimedia.org/T152132#2873547 (10Dzahn) a:03Dzahn [19:26:47] (03CR) 10Andrew Bogott: [C: 032] Add novaconfig to labs.yaml [puppet] - 10https://gerrit.wikimedia.org/r/327230 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [19:27:02] 06Operations, 10Phabricator, 06Release-Engineering-Team, 13Patch-For-Review: Setup test domain for phab2001 - https://phabricator.wikimedia.org/T152132#2839539 (10Dzahn) we are waiting for https://gerrit.wikimedia.org/r/#/c/324797/ here, traffic team asked to wait a couple days because they were in the mi... [19:27:21] (03PS2) 10Dzahn: varnish misc: add phab2001 as a backend for phab-new [puppet] - 10https://gerrit.wikimedia.org/r/324797 (https://phabricator.wikimedia.org/T137928) [19:27:27] checking [19:27:51] (03PS2) 10Andrew Bogott: Labs hiera: Include private labs.yaml in hiera search [puppet] - 10https://gerrit.wikimedia.org/r/327232 (https://phabricator.wikimedia.org/T150092) [19:28:03] (03CR) 10Madhuvishy: [] WIP tools: job to copytruncate logs in place (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/326153 (owner: 10Rush) [19:29:39] thcipriani: I see the change, and nothing's breaking [19:29:55] will have to wait for wide deploy to see the fix via google's proxy [19:30:15] ejegg: okie doke. Going live. Will start with wmf.6 and then do wmf.5 [19:30:24] (03CR) 10Andrew Bogott: [C: 032] Labs hiera: Include private labs.yaml in hiera search [puppet] - 10https://gerrit.wikimedia.org/r/327232 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [19:32:45] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#2873558 (10fgiunchedi) @hashar re: the jenkins stats in ganglia in P4571 are they all/some in graphite already? [19:32:47] !log thcipriani@tin Synchronized php-1.29.0-wmf.6/resources/src/startup.js: SWAT: [[gerrit:327252|Add googleweblight to JS blacklist]] T152602 (duration: 00m 41s) [19:32:53] (03PS2) 10Yuvipanda: [WIP] maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 [19:32:55] (03PS1) 10Yuvipanda: labsdb: Add 'status' field to labsdbaccounts [puppet] - 10https://gerrit.wikimedia.org/r/327271 [19:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:58] T152602: Spurious Amazon clicks / Banners on googleweblight.com - https://phabricator.wikimedia.org/T152602 [19:34:16] 06Operations, 06Labs, 10Labs-Infrastructure: Silver anomalies - https://phabricator.wikimedia.org/T151486#2873604 (10Dzahn) [19:34:20] !log thcipriani@tin Synchronized php-1.29.0-wmf.5/resources/src/startup.js: SWAT: [[gerrit:327251|Add googleweblight to JS blacklist]] T152602 (duration: 00m 39s) [19:34:23] (03PS2) 10Yuvipanda: labsdb: Add 'status' field to labsdbaccounts [puppet] - 10https://gerrit.wikimedia.org/r/327271 [19:34:24] ^ ejegg live all the places [19:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:33] (03CR) 10Yuvipanda: [V: 032 C: 032] "Already applied" [puppet] - 10https://gerrit.wikimedia.org/r/327271 (owner: 10Yuvipanda) [19:35:01] 06Operations, 06Labs, 10Labs-Infrastructure: silver: /dev/md2 mounted twice - https://phabricator.wikimedia.org/T151489#2873610 (10Dzahn) [19:35:05] (03CR) 10jenkins-bot: [V: 04-1] [WIP] maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 (owner: 10Yuvipanda) [19:35:06] thcipriani: awesome, checking through the proxy [19:35:43] (03PS5) 10ArielGlenn: list last n good dumps: implement rsynclisting option [puppet] - 10https://gerrit.wikimedia.org/r/326422 (https://phabricator.wikimedia.org/T152954) [19:37:07] hmm, still nuking images below the fold. [19:37:29] Guessing it's just RL stuff being cached [19:38:25] 06Operations, 10ops-eqiad, 13Patch-For-Review: eqiad: Rack and setup new restbase nodes - https://phabricator.wikimedia.org/T150964#2873663 (10fgiunchedi) a:05Cmjohnson>03fgiunchedi [19:38:48] yurik: blerg. Looks like wmf.6 is fixed as far as submodule bumping automagically goes, but we'll have to do a manual submodule bump for wmf.5 [19:39:13] thcipriani, lets do 6 first? also, could you add https://gerrit.wikimedia.org/r/#/c/327274/ to it too? [19:39:31] i will add it to the list [19:39:43] legal has been asking for that [19:40:24] PROBLEM - puppet last run on labstore1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:40:40] hrm. That's a full scap worth of stuff. Will take a longer than the remainder of this window and will run into train. Can we do that one evening SWAT? [19:41:55] thcipriani, ok [19:41:59] thanks [19:42:06] creating the submodule bump for 5 now [19:42:10] thx [19:42:47] * yurik glad .6 is auto-doing it again! [19:44:07] yurik: could you +1 https://gerrit.wikimedia.org/r/#/c/327276/ if it looks right to you [19:44:12] or +2 ¯\_(ツ)_/¯ [19:44:24] * yurik looks [19:44:48] 06Operations: Fix Icinga checks for test/decom servers - https://phabricator.wikimedia.org/T151632#2823217 (10Dzahn) > We should instead puppetize this so that those kind of hosts have a special configuration in Icinga so that the check names have some sort of identifier like TEST INSTANCE or DECOM to clearly re... [19:45:33] thanks thcipriani! [19:46:18] yurik: change is live for wmf.6 on mwdebug1002, check please (if there's anything to check there) [19:47:52] sec [19:48:12] (03CR) 10Filippo Giunchedi: [] "LGTM overall, minor comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/327240 (https://phabricator.wikimedia.org/T147316) (owner: 10Elukey) [19:48:12] ok [19:48:53] RECOVERY - puppet last run on wasat is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [19:50:11] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Initial debianization [debs/prometheus-redis-exporter] - 10https://gerrit.wikimedia.org/r/325471 (owner: 10Filippo Giunchedi) [19:50:26] thcipriani, go for it [19:50:38] yurik: ok, going live for wmf.6 [19:52:35] !log thcipriani@tin Synchronized php-1.29.0-wmf.6/extensions/JsonConfig/includes/JCLuaLibrary.php: SWAT: [[gerrit:327246|Reindex tabular data array for easier lua access]] T152809 (duration: 00m 40s) [19:52:46] ^ yurik live on wmf.6 [19:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:48] T152809: mw.ext.data.get() returns 0-indexed tables - https://phabricator.wikimedia.org/T152809 [19:53:02] MatmaRex: still around? [19:53:22] 06Operations: Restarts of ganglia-monitor are unreliable - https://phabricator.wikimedia.org/T135723#2873762 (10Dzahn) p:05Normal>03Low lowering priority because meanwhile we have a goal to remove Ganglia [19:53:26] thcipriani: hello [19:53:59] thcipriani: mine's a (low impact) security patch, so i submitted it as a draft [19:54:20] thcipriani: i think it's simplest if you checkout it and sync, then we make it public and merge [19:54:20] MatmaRex: hi. I +2'd your change just now, but it didn't show up here :) [19:54:47] thcipriani: it's a draft (semi-secret) [19:55:17] (03PS1) 1020after4: all wikis to 1.29.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327277 [19:55:19] (03CR) 1020after4: [C: 032] all wikis to 1.29.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327277 (owner: 1020after4) [19:55:23] so it probably won't merge until i "publish" it [19:55:40] (03CR) 1020after4: [C: 032] "see T153184" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327277 (owner: 1020after4) [19:55:45] gotcha, OK, I'll fetch it down to tin and sync [19:55:57] (03Merged) 10jenkins-bot: all wikis to 1.29.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327277 (owner: 1020after4) [19:57:01] MatmaRex: it's on mwdebug1002, if you want to check anything before it goes live [19:57:25] yeah, i can verify [19:57:39] thcipriani: works as expected [19:57:43] MatmaRex: ok, going live [19:58:53] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT (duration: 00m 40s) [19:58:59] ^ MatmaRex live everywhere [19:59:02] 06Operations, 10ops-codfw: install2001 hardware troubles - https://phabricator.wikimedia.org/T137647#2374163 (10Dzahn) fwiw, install2001 looks alright as of today, CPU usage is very low, and the smokeping graph for bast2001 is a flat line https://smokeping.wikimedia.org/?target=codfw.Hosts.bast2001 unsure wh... [19:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:11] thank you! [19:59:13] (03Draft2) 10Bartosz Dziewoński: Set $wgAbuseFilterParserClass='AbuseFilterParser' again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327269 (https://phabricator.wikimedia.org/T153217) [19:59:23] thcipriani: done with swat? I need to revert to wmf5 according to T153184 [19:59:23] T153184: s3 database resource usage and contention increased 2-10x times - https://phabricator.wikimedia.org/T153184 [19:59:28] thcipriani: ^ should probably get that merged [19:59:47] twentyafterfour: not quite yet [20:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161214T2000). [20:01:15] (03PS3) 10Bartosz Dziewoński: Set $wgAbuseFilterParserClass='AbuseFilterParser' again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327269 (https://phabricator.wikimedia.org/T153217) [20:01:34] that might have to be +2'd again? twentyafterfour just merged a commit [20:01:40] thcipriani: ^ [20:01:53] (03CR) 10Thcipriani: [] Set $wgAbuseFilterParserClass='AbuseFilterParser' again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327269 (https://phabricator.wikimedia.org/T153217) (owner: 10Bartosz Dziewoński) [20:01:58] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327269 (https://phabricator.wikimedia.org/T153217) (owner: 10Bartosz Dziewoński) [20:02:02] let's try that [20:02:28] yurik: also, your change is live for wmf.5 on mwdebug1002, check please [20:02:43] yeah I didn't realize swat was still going and I started to rollback the train due to T153184 [20:02:46] twentyafterfour: sorry :( I will poke you when all's clear [20:02:58] (03Merged) 10jenkins-bot: Set $wgAbuseFilterParserClass='AbuseFilterParser' again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327269 (https://phabricator.wikimedia.org/T153217) (owner: 10Bartosz Dziewoński) [20:03:21] (03PS1) 10Dzahn: remove wikisource.cz [dns] - 10https://gerrit.wikimedia.org/r/327279 (https://phabricator.wikimedia.org/T137105) [20:04:18] i think i was the last one in swat? [20:04:46] eh, because wmf.5 doesn't bump submodules, I squeezed you in while I was waiting on JsonConfig submodule bump for wmf.5 [20:04:47] so as long as you don't accidentally undo https://gerrit.wikimedia.org/r/327269 , it's probably ok to do the train now. thcipriani twentyafterfour [20:04:56] oh [20:05:11] ugh. indeed [20:05:14] anyway, thanks :) [20:05:21] waiting on yurik to give wmf.5 JsonConfig the go ahead and then it'll be all clear [20:05:24] thanks MatmaRex [20:05:27] (03PS1) 10Dzahn: remove wikipedia.org.br [dns] - 10https://gerrit.wikimedia.org/r/327280 (https://phabricator.wikimedia.org/T137105) [20:06:35] please keep https://gerrit.wikimedia.org/r/327251 up too (just SWATted to .5 and .6) [20:07:07] (03PS1) 10Dzahn: remove wikimediacommons.eu [dns] - 10https://gerrit.wikimedia.org/r/327281 (https://phabricator.wikimedia.org/T137105) [20:07:20] ejegg: yup, that change should be live everywhere :) [20:07:50] ty [20:08:08] thcipriani, i already did i think? [20:08:28] yurik: we did wmf.6, but not wmf.5 yet [20:08:41] thcipriani, is 5 on debug1002? [20:08:45] (03CR) 10Paladox: [C: 031] remove wikisource.cz [dns] - 10https://gerrit.wikimedia.org/r/327279 (https://phabricator.wikimedia.org/T137105) (owner: 10Dzahn) [20:08:46] yup [20:09:04] (03CR) 10Paladox: [C: 031] remove wikipedia.org.br [dns] - 10https://gerrit.wikimedia.org/r/327280 (https://phabricator.wikimedia.org/T137105) (owner: 10Dzahn) [20:09:23] RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [20:09:24] thcipriani, looks good for me [20:09:31] yurik: cool, thanks, going live [20:09:41] sorry didn't notice the first ping [20:11:21] np, this SWAT got a little weird :) [20:11:33] !log thcipriani@tin Synchronized php-1.29.0-wmf.5/extensions/JsonConfig/includes/JCLuaLibrary.php: SWAT: [[gerrit:327246|Reindex tabular data array for easier lua access]] T152809 (duration: 00m 41s) [20:11:39] ^ yurik live everywhere [20:11:42] twentyafterfour: all clear [20:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:47] T152809: mw.ext.data.get() returns 0-indexed tables - https://phabricator.wikimedia.org/T152809 [20:11:49] thx [20:13:06] (03PS1) 10Filippo Giunchedi: prometheus: use key/value for gdnsd rcodes [puppet] - 10https://gerrit.wikimedia.org/r/327282 (https://phabricator.wikimedia.org/T147426) [20:13:16] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.29.0-wmf.5 refs T153184 [20:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:29] T153184: s3 database resource usage and contention increased 2-10x times - https://phabricator.wikimedia.org/T153184 [20:20:02] ok so the rollback to wmf.5 doesn't appear to be helping the s3 database load... T153184 [20:20:03] T153184: s3 database resource usage and contention increased 2-10x times - https://phabricator.wikimedia.org/T153184 [20:29:28] thcipriani: ack (re: train) twentyafterfour I'm going to merge https://gerrit.wikimedia.org/r/#/c/327256/ for scap 3.4.2-1 FYI [20:31:23] PROBLEM - puppet last run on mw1288 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:31:23] godog: awesome! [20:31:39] twentyafterfour: ^ fine with you if that happens now? [20:32:31] sure [20:33:01] (03CR) 10Filippo Giunchedi: [C: 032] Bump scap version to 3.4.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/327256 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [20:33:04] (03CR) 10EBernhardson: [C: 031] [cirrus] Reduce regex/default timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326989 (owner: 10DCausse) [20:33:07] (03PS2) 10Filippo Giunchedi: Bump scap version to 3.4.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/327256 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [20:33:18] I'm not sure what to do with the train now.. I guess I should go ahead with it as the rollback didn't seem to help database load [20:46:33] PROBLEM - puppet last run on lvs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:50:30] 06Operations, 07Puppet: Puppet failures with "Attempt to assign to a reserved variable name: 'trusted'" - https://phabricator.wikimedia.org/T153246#2873968 (10fgiunchedi) [20:50:42] 06Operations, 07Puppet: Puppet failures with "Attempt to assign to a reserved variable name: 'trusted'" - https://phabricator.wikimedia.org/T153246#2873981 (10fgiunchedi) p:05Triage>03Normal [20:50:47] (03PS1) 10Andrew Bogott: Keystone: refactor observerenv.sh [puppet] - 10https://gerrit.wikimedia.org/r/327290 (https://phabricator.wikimedia.org/T150092) [20:51:38] (03CR) 10jenkins-bot: [V: 04-1] Keystone: refactor observerenv.sh [puppet] - 10https://gerrit.wikimedia.org/r/327290 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [20:53:01] (03PS2) 10Andrew Bogott: Keystone: refactor observerenv.sh [puppet] - 10https://gerrit.wikimedia.org/r/327290 (https://phabricator.wikimedia.org/T150092) [20:55:55] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#2874018 (10hashar) The Jenkins metrics from P4571 can all be dropped. They are irrelevant to our setup: The number of jobs per status, w... [20:56:55] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): restbase-test100[13] lost power redundancy - https://phabricator.wikimedia.org/T153248#2874023 (10RobH) [20:57:19] (03CR) 10Thcipriani: [] "> Do you recommend using /srv/deployment/scholarships instead?" [puppet] - 10https://gerrit.wikimedia.org/r/326461 (https://phabricator.wikimedia.org/T129134) (owner: 10Niharika29) [20:57:19] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): setup/install restbase-test100[123] - https://phabricator.wikimedia.org/T151075#2806742 (10RobH) [20:57:41] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): setup/install restbase-test100[123] - https://phabricator.wikimedia.org/T151075#2806742 (10RobH) 05Open>03Resolved All systems are online and ready for use. The power issue is being tracked via sub-task, resolving this setup task. Services... [20:57:43] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2874042 (10RobH) [20:58:47] (03PS3) 10Andrew Bogott: Keystone: refactor observerenv.sh [puppet] - 10https://gerrit.wikimedia.org/r/327290 (https://phabricator.wikimedia.org/T150092) [20:59:18] RECOVERY - puppet last run on mw1288 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [21:00:04] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, Amir1, and yurik: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161214T2100). [21:00:14] (03PS1) 1020after4: group0 wikis to 1.29.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327294 [21:00:16] (03CR) 1020after4: [C: 032] group0 wikis to 1.29.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327294 (owner: 1020after4) [21:00:51] (03Merged) 10jenkins-bot: group0 wikis to 1.29.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327294 (owner: 1020after4) [21:01:10] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 wikis to 1.29.0-wmf.6 [21:01:21] (03PS1) 1020after4: group1 wikis to 1.29.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327296 [21:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:23] (03CR) 1020after4: [C: 032] group1 wikis to 1.29.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327296 (owner: 1020after4) [21:02:10] (03Merged) 10jenkins-bot: group1 wikis to 1.29.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327296 (owner: 1020after4) [21:02:55] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.29.0-wmf.6 [21:03:35] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): setup/install restbase-test100[123] - https://phabricator.wikimedia.org/T151075#2874069 (10Eevans) Thanks @RobH ! [21:04:45] (03PS4) 10Andrew Bogott: Keystone: refactor observerenv.sh [puppet] - 10https://gerrit.wikimedia.org/r/327290 (https://phabricator.wikimedia.org/T150092) [21:05:48] PROBLEM - are wikitech and wt-static in sync on labtestweb2001 is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech [21:05:48] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech [21:06:28] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [21:06:39] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [21:08:00] ^ this is because wikitech is down I believe and it uses the api [21:09:21] reverting wmf6 on group1 [21:09:33] (03PS1) 1020after4: group1 wikis to 1.29.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327304 [21:09:35] (03CR) 1020after4: [C: 032] group1 wikis to 1.29.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327304 (owner: 1020after4) [21:10:22] (03Merged) 10jenkins-bot: group1 wikis to 1.29.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327304 (owner: 1020after4) [21:10:39] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.29.0-wmf.5 [21:10:45] !log mholloway-shell@tin Starting deploy [mobileapps/deploy@ae1d3c2]: Update mobileapps to f5d9d86 [21:10:48] RECOVERY - are wikitech and wt-static in sync on labtestweb2001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (70840 200000s) [21:10:48] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (70845 200000s) [21:11:28] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active [21:11:39] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is OK: OK - nfs-exportd is active [21:12:23] conclusion: SemanticForms rename which landed in wmf.6 broke wikitech, rolled back. Thanks Krenair for finding the cause so quickly [21:12:50] alright [21:13:17] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@ae1d3c2]: Update mobileapps to f5d9d86 (duration: 02m 32s) [21:13:18] twentyafterfour, I think we can do a simple change of wikitech.php to include the new extension instead [21:14:31] twentyafterfour what was the error please? [21:14:38] RECOVERY - puppet last run on lvs1005 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [21:14:40] !log arlolra@tin Starting deploy [parsoid/deploy@b8fe41f]: Updating Parsoid to 60ee19ac [21:14:52] why do you want it paladox? [21:15:20] Krenair no, just would like to know the error to see if i can fix it. [21:15:48] paladox, we're fixing it [21:15:54] ok [21:16:31] (03PS1) 10Reedy: Replace WikiPage::getText() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327306 [21:17:07] (03CR) 10Chad: [C: 032] Replace WikiPage::getText() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327306 (owner: 10Reedy) [21:17:14] Krenair: I've got a patch incoming [21:17:18] paladox: ^ [21:17:29] ok [21:17:32] thanks [21:17:42] (03Merged) 10jenkins-bot: Replace WikiPage::getText() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327306 (owner: 10Reedy) [21:17:42] Just use the old branch? [21:17:53] oh, I'm already syncing twentyafterfour [21:17:54] It landed? [21:18:10] (03PS1) 1020after4: SemanticForms -> PageForms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327307 [21:18:17] !log krenair@tin Synchronized wmf-config/wikitech.php: trying pageforms (duration: 00m 39s) [21:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:37] Um [21:18:41] https://github.com/wikimedia/mediawiki-tools-release/blob/master/make-wmf-branch/config.json shows SemanticForms 3.7? [21:18:56] https://github.com/wikimedia/mediawiki-tools-release/commit/72ca89c3db84bb758d39adb950e6427f05761c46 [21:19:01] ugh [21:19:03] !log demon@tin Synchronized w/robots.php: Unbreak (duration: 00m 40s) [21:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:16] https://github.com/wikimedia/mediawiki-extensions-SemanticForms/tree/3.7 [21:19:29] Though, using the same branch, we can use PageForms [21:19:39] And stop branching SemanticForms [21:19:47] Reedy, so wait [21:19:54] you think we use the PageForms repo [21:19:56] 3.7 branch [21:19:58] No, we're not [21:20:05] Did someone forget to update make-wmf-branch before running it? [21:20:05] we -should- [21:20:18] Because we emptied master of SemanticForms [21:20:21] And updated the tools to match [21:20:27] Reedy: make-wmf-branch has a check to prevent running it without pulling [21:20:29] If the tool wasn't updated locally [21:21:00] I'm pretty sure release tools was up to date when I branched yesterday [21:21:06] ah [21:21:16] so the branch we use should have been unaffected? [21:21:22] Yes [21:21:24] !log krenair@tin Synchronized wmf-config/wikitech.php: (no message) (duration: 00m 40s) [21:21:25] But we were using master [21:21:29] for some reason [21:21:30] I stopped that [21:21:33] Because it's scary [21:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:50] (03CR) 10Paladox: [C: 031] SemanticForms -> PageForms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327307 (owner: 1020after4) [21:22:00] If we stop using SemanticForms completely, we can remove it from CI too [21:23:01] SemanticForms @ ae7b01c [21:23:06] Reedy: does make-wmf-branch even work with tags? [21:23:16] https://github.com/wikimedia/mediawiki-extensions-SemanticForms/commit/ae7b01c [21:23:20] twentyafterfour: Apparently not? [21:23:25] seems like it doesn't [21:23:25] It should. [21:23:27] The problem is it's not using the tag [21:23:31] it's just used master [21:23:58] Which is why I suggested it might've been run out of date [21:23:59] it expects a branch [21:24:19] woah woah woah [21:24:26] I had forgotten about this [21:24:30] https://gerrit.wikimedia.org/r/#/c/317279/ [21:24:49] Krenair: Yes, which is why I didn't want to just go to PageForms master [21:24:52] !log arlolra@tin Finished deploy [parsoid/deploy@b8fe41f]: Updating Parsoid to 60ee19ac (duration: 10m 12s) [21:24:57] Hence, stick to SMW 3.7 branch [21:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:04] Then we could move to the same branch in PageForms [21:25:15] 07Puppet, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: puppet failure on deployment-phab01 ... is not a Hash. It looks to be a Array at /etc/puppet/modules/phabricator/manifests/init.pp:68 - https://phabricator.wikimedia.org/T147818#2874235 (10hashar) a:05hashar>03None deployment-phab01 got fixed... [21:25:18] To save CI [21:25:45] that +2 ACL extends to all branches [21:25:57] Yeah, but I presume 3.7 wouldn't be touched [21:26:02] Anyway, we weren't moving to PageForms [21:26:07] Why we've been branching it.. I'm not sure [21:27:00] !log Parsoid updated to version 60ee19ac (T119265, T104523, T104662) [21:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:15] T104523: Parsoid infinite recursion due to template loop involving - https://phabricator.wikimedia.org/T104523 [21:27:15] T104662: Parsoid allows nested ref tags - https://phabricator.wikimedia.org/T104662 [21:27:15] T119265: More metadata in Parsoid output - https://phabricator.wikimedia.org/T119265 [21:27:51] Fix the ref that .6 is pointing at, and just deploy that? [21:28:04] Krenair: That is not how we grant owner, either. You add to a group. [21:28:12] I really should remove that single-user-group plugin. [21:28:16] yes [21:28:17] It was default at some point. [21:29:18] single-user ACLs are a terrible idea [21:29:18] 07Puppet, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: puppet failure on deployment-phab01 ... is not a Hash. It looks to be a Array at /etc/puppet/modules/phabricator/manifests/init.pp:68 - https://phabricator.wikimedia.org/T147818#2874283 (10mmodell) I had to upgrade mariadb manually on phab01 [21:29:31] I should probably fix mediawiki/* acl so projects can't override wmf/* acls. [21:30:44] Oh, I do [21:30:46] I use BLOCK [21:30:49] * ostriches whews [21:32:48] PROBLEM - Disk space on elastic1031 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 65025 MB (12% inode=99%) [21:33:24] Krenair: So is wikitech still broken? [21:33:37] not right now [21:34:01] Using PageForms? [21:34:13] it's not using PageForms right now afaik [21:34:16] I filed https://phabricator.wikimedia.org/T153257#2874251 for make-wmf-branch failing with tags [21:34:22] Krenair: https://gerrit.wikimedia.org/r/#/c/327360 [21:34:47] Reedy: Um, tell me how it's worked for other configured things then? We certainly haven't been using master of some of those... [21:35:00] ostriches: Well, either a git pull was missing, or it's broken [21:35:04] I didn't run it, so I don't know [21:35:19] Aren't the rest branches not tags? [21:35:35] ostriches, does that deny people in the mediawiki group who also have the right from the wmf-deployment group? [21:35:37] ostriches: Like I say, it is still pointing at master [21:35:40] So it might be the first [21:36:00] Krenair: No, using BLOCK in the *exact same config stanza* allows ALLOW to override it [21:36:07] BLOCK will prevent any inherited config from overriding [21:36:09] ok [21:37:48] RECOVERY - Disk space on elastic1031 is OK: DISK OK [21:37:53] (03CR) 10Jcrespo: [] "Can we call them dbmonitor or sqlmonitor, or anything more generic? I am not sure tendril as it is now will survive; we may change it in t" [dns] - 10https://gerrit.wikimedia.org/r/327266 (https://phabricator.wikimedia.org/T149340) (owner: 10Dzahn) [21:41:54] I'm sure I pulled [21:42:10] heh [21:43:24] https://gerrit.wikimedia.org/r/327362 Push SemanticForms back to 3.7 branch [21:43:59] * twentyafterfour just checked bash history, indeed I did `cd release; git pull` before make-wmf-branch [21:45:11] twentyafterfour: Do you have the scrollback? Any errors that just continued? [21:45:25] Why was wmf.6 reverted? [21:45:37] hoo: s3 db issues [21:45:50] the s3 db issues didn't turn out to be related [21:45:52] (03PS5) 10Andrew Bogott: Keystone: refactor observerenv.sh [puppet] - 10https://gerrit.wikimedia.org/r/327290 (https://phabricator.wikimedia.org/T150092) [21:45:57] but then it was reverted because of wikitech breakage [21:46:01] hm [21:46:17] I'm always confused then such things are logged neither in SAL nor in the gerrit commit messages [21:46:38] 20:13 twentyafterfour@tin: rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.29.0-wmf.5 refs T153184 [21:46:38] T153184: s3 database resource usage and contention increased 2-10x times - https://phabricator.wikimedia.org/T153184 [21:46:48] https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:18] yeah I always try to log related tasks [21:47:25] sometimes I forget :-/ [21:47:34] Reedy: hm… but was before the wmf.6 deploy [21:47:40] that's why I didn't notice [21:48:10] the last one was wikitech breakage which didn't have a task filed afaik [21:48:32] Not according to SAL at least [21:48:39] or the deployment tracker bug [21:49:03] Anyway, I see that Wikidata is not an issue here, so vanishing for now [21:49:03] hoo|away: Fixing broken is more important than an immediate SAL [21:49:08] Reedy: True [21:49:08] :P [21:49:19] But doing a follow-up SAL is helpful [21:49:20] but yeah, there should be an appropriate log after [21:49:30] I'll log it [21:49:47] wait what's the conclusion about semanticforms? [21:50:09] thanks for fixing wikitech :) [21:51:29] !log wmf.6 deploy broke wikitech, group1 temporarily reverted to wmf.5 while we fix semanticforms extension. [21:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:01] beh, missing the services deployment window by a bit, need a few extra mins [21:53:12] semanticforms legacy version, joy [21:53:16] (03PS6) 10Andrew Bogott: Keystone: refactor observerenv.sh [puppet] - 10https://gerrit.wikimedia.org/r/327290 (https://phabricator.wikimedia.org/T150092) [21:56:35] twentyafterfour: I love that dry-run isn't exposed [21:59:51] What forms do we actually still use on Wikitech? [22:00:41] reedy: https://phabricator.wikimedia.org/P4619 [22:00:51] from my scrollback [22:01:22] ostriches: doesn't the openstackmanager use semantic forms somehow? [22:01:34] It uses the semantic page data stuff [22:01:40] For the status pages, but those aren't forms.... [22:01:55] hmm I don't really know [22:02:23] The forms for access requests [22:03:07] * Reedy is running a dry run for .7 [22:03:08] Hardly worth the pain of installing an extension :p [22:03:13] !log reedy@tin Synchronized php-1.29.0-wmf.6/extensions/SemanticForms: Bring back tag of 3.7 (duration: 00m 44s) [22:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:02] ostriches, it's the process that every tools user has to go through [22:04:20] bd808, ^ [22:04:24] and that's dumb af :) [22:04:28] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/x509-bundle] [22:04:30] lol [22:04:39] It's less now due to Horizon, right? [22:04:49] no [22:05:14] horizon is for projectadmins. [22:05:17] nothing to do with tools users. [22:05:26] I didn't say it had anything to do with tool users [22:05:32] I was meaning other usages of SF and such [22:05:33] and SMW [22:05:37] hm [22:05:40] not sure. [22:05:45] I thought it removed some [22:05:47] we still make the nova resource pages [22:06:37] ok https://gerrit.wikimedia.org/r/#/c/327362/1 merged finally... [22:06:44] Reedy: I'm gonna deploy it [22:06:53] twentyafterfour: look up? :P [22:07:00] I did at 22:03:14 UTC: D [22:07:01] :D [22:07:04] oh [22:07:22] I didn't put anything back to .6 though [22:07:32] ok [22:07:41] that was my next question ;) [22:08:38] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:08:52] !log promote group1 to 1.29.0-wmf.6, now with less wikitech breakage (hopefully) [22:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:23] (03PS1) 1020after4: group1 wikis to 1.29.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327370 [22:09:25] (03CR) 1020after4: [C: 032] group1 wikis to 1.29.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327370 (owner: 1020after4) [22:10:03] (03Merged) 10jenkins-bot: group1 wikis to 1.29.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327370 (owner: 1020after4) [22:11:11] twentyafterfour: ostriches found the bug [22:11:14] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.29.0-wmf.6 [22:11:20] Failed to log message to wiki. Somebody should check the error logs. [22:11:26] https://phabricator.wikimedia.org/P4620 [22:11:51] stashbot: that's weird, wikitech doesn't appear broken :P [22:11:51] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [22:12:04] !log wikitech appears unbroken [22:12:06] Failed to log message to wiki. Somebody should check the error logs. [22:12:09] wtf [22:12:46] [7b6e73447edbcf187f68d3ed] 2016-12-14 22:12:37: Fatal exception of type MWException [22:14:06] [db861a5f503c130f273b7289] 2016-12-14 22:13:55: Fatal exception of type MWException [22:14:40] 2016-12-14 22:12:37 [7b6e73447edbcf187f68d3ed] silver labswiki 1.29.0-wmf.6 exception ERROR: [7b6e73447edbcf187f68d3ed] /wiki/Tool:Stashbot MWException from line 335 of /srv/mediawiki/php-1.29.0-wmf.6/includes/MagicWord.php: Error: invalid magic word 'default_form' {"exception_id":"7b6e73447edbcf187f68d3ed"} [22:14:43] twentyafterfour: Localisation cache ok? [22:15:05] ohhh [22:15:09] ostriches: it is dumb, and I'm going to fix it but haven't gotten there yet. I think there are something like 3 forms used on wikitech [22:15:35] just when i was about to ask about next week swat / config changes windows [22:15:37] meh [22:15:46] what's the status of the next week? [22:15:51] no deploys [22:15:58] yurik: Trump isn't president yet [22:16:33] ostriches, no deploys at all, or no train? config changes also no? [22:16:37] !log updating l10n after reverting SemanticForms submodule [22:16:39] Failed to log message to wiki. Somebody should check the error logs. [22:16:43] oh duh [22:16:45] lol [22:16:49] tlol [22:16:57] yurik: No train, no swat, no config, no fun :) [22:17:10] * yurik leaves for the north pole [22:17:15] Standard exceptions for site outages, etc. [22:17:26] twentyafterfour: it will be in SAL I think so the !logs aren't lost [22:17:38] oh nice [22:17:38] nah, no means no, let the world wait until after new years :D [22:17:45] https://tools.wmflabs.org/sal/ [22:18:02] bd808: is there even a reason to update the wiki then? [22:18:05] * twentyafterfour shrugs [22:18:18] hysterical raisins? [22:18:23] lol [22:18:28] the same bot does both these days at least [22:19:01] bd808: yeah, I really like what you did with the sal tool [22:19:12] ugh, l10n-update is SLOW [22:20:14] wait... tin has 6 cores and we're only using 4 [22:20:32] * twentyafterfour makes scap use all of the cores [22:20:39] I think it's purposely not all the cores [22:20:46] no cores for you, scap gets all of them [22:20:49] lol [22:20:49] lol [22:21:03] it's N-2 I think on purpose [22:21:23] * twentyafterfour puts in a request for more cores [22:21:28] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /home 41235 MB (3% inode=98%) [22:21:35] wikitech seems to be broken: https://wikitech.wikimedia.org/wiki/Deployments [22:21:35] what it really needs is an SSD [22:21:36] why optimize code when you can throw hardware at the problem? [22:21:40] Apologies if you're all aware of this already [22:21:44] bd808: no doubt! [22:21:52] Deskana: yup. we are workign on it [22:21:54] Deskana: working on it [22:21:58] Thanks :-) [22:22:16] Didn't we have it on an SSD at some point? [22:22:37] we tried to do it in a ram disk and ran out of space [22:22:49] I can't think of a reason to use anything other than ssd these days, the price isn't that much higher [22:23:12] that depends greatly on the size of disk you want [22:23:19] We like our rust spinning [22:23:37] server SSD and laptop SSD are very different things [22:23:55] bd808: but servers can use laptop ssds [22:24:36] enterprise ssds might be worth the money for certain really critical data but for what amount to temp files, I'd think consumer grade would be just fine [22:24:37] Not that it matters [22:24:58] If there's a decent reason to have an SSD, Ops will let you have an SSD [22:25:14] robh tends not to be too mean ;) [22:25:37] /dev/mapper/tin--vg-srv 409G 49G 340G 13% /srv [22:25:52] looks like we could do with a 120gb ssd and there would be plenty of room to spare [22:25:53] i guess you know wikitech is down ? [22:26:09] matanya: it's not entirely down...working on it it'll be all back to normal shortly [22:26:24] twentyafterfour: Wonder if the server has any spare disk slots [22:26:29] waiting on l10nupdate, which is almost finished now [22:26:34] Open some ops (procurement) tasks? [22:26:38] [a54c414987a10c320331c8cf] 2016-12-14 22:25:34: Fatal exception of type MWException is somewhat down :) [22:26:46] matanya: yeah we are aware [22:26:51] it's only some pages [22:27:02] ok, cool, sorry for the noise [22:27:09] matanya: no worries :) [22:27:40] there was a ticket about wikitech running low on disk [22:27:43] is it just that now? [22:27:50] or did a drive really break [22:28:08] mutante: drive break? no, it's an extension issue [22:28:18] actually it's an l10n issue right now [22:28:31] ok, ignore me, i am talking about silver, the host that wikitech runs on [22:28:32] mutante: drive discussion was related to l10nupdate being slow due to being IOPS bound [22:28:40] gotcha, ok [22:29:03] it stats 11 billion files [22:29:06] bd808: considering that 4 cores of l10nupdate are using 100% cpu, it may not be completely io bound [22:29:41] are they active of in IOWAIT though? [22:29:42] part of the process is definitely io but part seems to be cpu bound as well [22:30:06] top doesn't do a good job of showing iowait [22:30:19] hmm yeah that may be the case [22:30:35] does all this mean i can no longer do Graphoid service scap3 ? [22:31:06] i cannot even look at the schedule :) [22:31:37] yurik: it should be fixed shortly [22:31:45] syncing l10n now [22:31:48] fun fun [22:32:16] yurik: if you need to deploy something then it's ok with me if you go ahead [22:32:18] PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:32:27] bd808: iptop is your friend [22:32:28] twentyafterfour, oki, its a service [22:32:45] matanya: sysadmins are my friend :) [22:33:02] *iotop [22:33:11] that too :) [22:33:55] twentyafterfour: there is an issue with echo/flow and icons :https://www.mediawiki.org/w/load.php?modules=ext.echo.emailicons&image=thanks&lang=en&format=rasterized [22:34:11] i filled it, but it might be a blocker [22:34:31] that's a fatal for me too :/ [22:34:49] hmm [22:35:05] "File '/Thanks/ThankYou.png' does not exist" [22:35:06] !log yurik@tin Starting deploy [graphoid/deploy@6900a8f]: (no message) [22:35:06] twentyafterfour: ref : https://phabricator.wikimedia.org/T153261 [22:35:08] Failed to log message to wiki. Somebody should check the error logs. [22:35:28] RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [22:35:44] !log yurik@tin Finished deploy [graphoid/deploy@6900a8f]: (no message) (duration: 00m 39s) [22:35:47] Failed to log message to wiki. Somebody should check the error logs. [22:35:50] if you look in fatal monitor you will see some hiding between all the spam [22:37:38] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [22:39:10] (03PS1) 10Mattflaschen: Beta Cluster: Make it easy to run Flow scripts only on enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327377 [22:39:47] (03CR) 10jenkins-bot: [V: 04-1] Beta Cluster: Make it easy to run Flow scripts only on enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327377 (owner: 10Mattflaschen) [22:40:51] * yurik is done causing chaos and suffering, and switches back to flowers and ponies [22:41:30] yurik: ponies create a lot of shit [22:42:31] (03PS2) 10Mattflaschen: Beta Cluster: Make it easy to run Flow scripts only on enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327377 [22:43:11] (03CR) 10jenkins-bot: [V: 04-1] Beta Cluster: Make it easy to run Flow scripts only on enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327377 (owner: 10Mattflaschen) [22:43:28] PROBLEM - puppet last run on mw1265 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:44:25] wooo finally [22:44:37] wikitech appears to be back [22:44:50] !log l10n rebuilt and wikitech is back [22:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:32] https://www.mediawiki.org/w/load.php?modules=ext.echo.emailicons&image=thanks&lang=en&format=rasterized is still a fatal though [22:45:38] I have no idea where to begin fixing that [22:46:27] we'll probably want to backfill SAL [22:46:34] Heh. [22:46:42] hmm [22:46:55] 2016-12-14 22:46:30 [WFHLxgpAMFQAAHJJk0QAAABN] mw1249 mediawikiwiki 1.29.0-wmf.6 exception ERROR: [WFHLxgpAMFQAAHJJk0QAAABN] /w/load.php?modules=ext.echo.emailicons&image=thanks&lang=en&format=rasterized MWException from line 225 of /srv/mediawiki/php-1.29.0-wmf.6/includes/resourceloader/ResourceLoaderImage.php: File '/Thanks/ThankYou.png' does not exist {"exception_id":"WFHLxgpAMFQAAHJJk0QAAABN"} [22:46:57] RoanKattouw: ^^ [22:47:26] !log twentyafterfour@tin scap sync-l10n completed (1.29.0-wmf.6) (duration: 16m 05s) [22:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:47] that 16m felt like an hour :-/ [22:47:48] (03PS3) 10Mattflaschen: Beta Cluster: Make it easy to run Flow scripts only on enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327377 [22:49:23] legoktm: Yup, reported in -collab too, investigating [22:49:35] T153261: "so should I now revert again because of this or is this a tolerable bug for now as long as it blocks group2?" [22:49:36] T153261: icons missing in flow and echo - https://phabricator.wikimedia.org/T153261 [22:49:51] RoanKattouw: is this unbreak now or just a group2 blocker? [22:51:43] (03PS1) 10Chad: Support python 2/3 octals, not just python2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327383 [22:51:45] (03PS1) 10Chad: scap patch: Remove unused print_function import [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327384 [22:52:29] twentyafterfour: I suspect UBN but lemme check [22:56:06] twentyafterfour: Hmm somehow there isn't any observable breakage [22:56:13] heh, i was relocating away from office so missed the ssd disucsion [22:56:15] ooh wait I see it'll only be broken for emails [22:56:25] twentyafterfour: non enterprise ssds died on us in less than 3 years, we tried using them in the past ;] [22:56:55] so we use enterprise grade intel s3610 line ssds at present. [22:57:26] twentyafterfour: Not UBN-worthy but I will get on fixing it right away [22:57:34] We've also investigated the cost difference to shifting all systems to SSD, its not quite there yet, but every system request does specifically request if ssd or hdd is needed =] [22:58:02] All that's broken is icons in emails [22:58:30] robh: so basically, if we can prove a need for SSDs, we can (probably) get them? [22:58:33] And if you're using gmail or another provider that caches icons, then only in emails that were sent while the breakage occurred I think [22:58:47] Reedy: That is my understanding yep! [22:59:07] if the iops warrant it, SSD is cheaper than SAS on average these days [22:59:21] so if sata doesnt cut it, then we tend to skip right to ssd. [22:59:31] 06Operations, 10DBA, 10MediaWiki-Database: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2874723 (10kaldari) @Marostegui: It's taking much longer than I expected. One thing that I've learned is that I'm terrible at esti... [23:01:04] 06Operations, 06Performance-Team, 10scap, 07HHVM, and 2 others: Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352#2874727 (10greg) [23:01:18] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [23:01:44] Reedy, your ponies may create a lot of shit, my ponies bring cute graphs [23:01:56] yurik: I'll put some evidence onto Commons [23:02:09] yurik: I'd offer to bring some to the dev summit, but I don't think i'll get through customs [23:02:18] :-P [23:04:05] robh: I'm surprised about the SSDs. Was it a write-heavy use like database that killed them in < 3 years? [23:04:39] even consumer SSDs have good enough longevity (on paper) for something like tin [23:05:36] I thought ssd's have limitations like you carn't keep rewriting to it? I saw a bug from spotify that caused some ssd to degrade very quickly due to heavy writing. [23:09:29] yeah, repeatedly writing on the same block(s) isn't good [23:09:32] but TRIM and stuff [23:09:42] Yep [23:11:03] (03PS1) 10Dzahn: hiera override to skip base icinga for test/decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) [23:11:28] RECOVERY - puppet last run on mw1265 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [23:11:45] twentyafterfour: was cp servers using m25 intels [23:12:04] so moderate to high usage [23:12:35] we've also seen failures of the samsungs in restbase due to them just not being quite up to the task compared to the intels (but that point is not under consensus!) [23:13:05] (03CR) 10jenkins-bot: [V: 04-1] hiera override to skip base icinga for test/decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn) [23:16:38] twentyafterfour: I have a fix ( https://gerrit.wikimedia.org/r/327387 ) , currently badgering my team members to get someone to review it. Once it's merged, could you cherry-pick + deploy? I probably shouldn't from this high-latency airplane wifi that drops every now and then [23:16:59] RoanKattouw: of course [23:19:16] 06Operations, 10ops-eqiad, 10netops: asw-a2-eqiad PEM 0 not powered - https://phabricator.wikimedia.org/T153273#2874808 (10faidon) [23:24:48] PROBLEM - puppet last run on wtp1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:34:51] twentyafterfour: The patch finally merged [23:34:59] RoanKattouw: just noticed, I'm on it [23:35:15] !log 327387 merged, deploying refs T153261 [23:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:30] T153261: Icons in Echo emails broken - https://phabricator.wikimedia.org/T153261 [23:36:21] thanks RoanKattouw [23:37:05] !log sent an email to the owners of the biggest home directories on stat1002 [23:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:35] (03PS2) 10Dzahn: hiera override to skip base icinga for test/decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) [23:40:49] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:41:52] 06Operations, 13Patch-For-Review: Fix Icinga checks for test/decom servers - https://phabricator.wikimedia.org/T151632#2874869 (10Dzahn) ^ Here's an approach to make it simpler and just skip the whole base::monitoring part if set in Hiera. [23:45:23] (03PS6) 10Paladox: phabricator: allow mirroring from git.legoktm.com into Diffusion [puppet] - 10https://gerrit.wikimedia.org/r/306900 (https://phabricator.wikimedia.org/T143969) [23:53:48] RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [23:58:57] (03CR) 10Catrope: [C: 031] Beta Cluster: Make it easy to run Flow scripts only on enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327377 (owner: 10Mattflaschen)