[01:04:22] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1509930259 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 3783647 keys, up 4 minutes 16 seconds - replication_delay is 1509930259 [01:05:11] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1509930304 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3786223 keys, up 5 minutes 1 seconds - replication_delay is 1509930304 [01:05:11] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1509930304 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3789839 keys, up 5 minutes 1 seconds - replication_delay is 1509930304 [01:08:11] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3782703 keys, up 8 minutes 4 seconds - replication_delay is 0 [01:08:11] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3780809 keys, up 8 minutes 4 seconds - replication_delay is 0 [01:08:22] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 3778245 keys, up 8 minutes 14 seconds - replication_delay is 0 [02:37:46] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.6) (duration: 07m 17s) [02:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:42] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Nov 6 02:44:41 UTC 2017 (duration 6m 55s) [02:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:21] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 779.55 seconds [03:54:31] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 135.92 seconds [04:03:35] (03PS1) 10BryanDavis: bd808 home: Add mw helper script and fix ~/bin perms [puppet] - 10https://gerrit.wikimedia.org/r/389408 [04:04:04] (03CR) 10jerkins-bot: [V: 04-1] bd808 home: Add mw helper script and fix ~/bin perms [puppet] - 10https://gerrit.wikimedia.org/r/389408 (owner: 10BryanDavis) [04:29:26] (03CR) 10BryanDavis: "Apparently the docker test doesn't use the tox.ini and its exclusions of files from flake8 testing?" [puppet] - 10https://gerrit.wikimedia.org/r/389408 (owner: 10BryanDavis) [04:47:47] (03CR) 10BryanDavis: Rakefile: split in modules, refactor git interaction (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/382165 (owner: 10Giuseppe Lavagetto) [04:50:19] (03CR) 10BryanDavis: "> Apparently the docker test doesn't use the tox.ini and its" [puppet] - 10https://gerrit.wikimedia.org/r/389408 (owner: 10BryanDavis) [05:32:21] (03PS1) 10BryanDavis: dynamicproxy: Add vhost to access.log [puppet] - 10https://gerrit.wikimedia.org/r/389409 (https://phabricator.wikimedia.org/T178963) [06:16:57] !log Deploy alter table on s4 codfw master (with no replication) - T174569 [06:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:04] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:19:25] !log Optimize pagelinks and templatelinks on s6 master (db1061) - T174509 [06:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:31] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [06:24:03] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389410 (https://phabricator.wikimedia.org/T178359) [06:26:07] anyone from ops with OTRS access? [06:26:53] well maybe dev stuff [06:28:19] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389410 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:30:22] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389410 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:30:37] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389410 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:32:00] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1101 - T178359 (duration: 00m 48s) [06:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:07] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:38:38] !log Stop MySQL on db1101 to copy its content to db1103.s4 - T178359 [06:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:44] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:42:33] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Add db2084:3314 and db2084:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388425 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:42:37] (03PS2) 10Marostegui: db-eqiad.php: Add db2084:3314 and db2084:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388425 (https://phabricator.wikimedia.org/T178359) [06:46:10] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add db2084 as multi-instance core host on eqiad file T178553 T178359 (duration: 00m 46s) [06:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:17] T178553: Support multi-instance hosts on mediawiki-config - https://phabricator.wikimedia.org/T178553 [06:46:17] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:46:35] (03CR) 10jenkins-bot: db-eqiad.php: Add db2084:3314 and db2084:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388425 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:08:19] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db2085 in s3 and s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389413 (https://phabricator.wikimedia.org/T178359) [07:09:54] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Pool db2085 in s3 and s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389413 (https://phabricator.wikimedia.org/T178359) [07:20:40] (03PS2) 10Dzahn: Restrict HTTP access for racktables [puppet] - 10https://gerrit.wikimedia.org/r/388461 (owner: 10Muehlenhoff) [07:24:50] (03CR) 10Dzahn: [C: 032] Restrict HTTP access for racktables [puppet] - 10https://gerrit.wikimedia.org/r/388461 (owner: 10Muehlenhoff) [07:27:03] (03CR) 10Dzahn: "https://racktables.wikimedia.org/ works" [puppet] - 10https://gerrit.wikimedia.org/r/388461 (owner: 10Muehlenhoff) [07:29:57] (03CR) 10Dzahn: "i think this will break it on labs?" [puppet] - 10https://gerrit.wikimedia.org/r/388509 (owner: 10Muehlenhoff) [07:37:01] (03CR) 10Dzahn: [C: 032] role::microsites::peopleweb: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388508 (owner: 10Muehlenhoff) [07:38:39] (03PS2) 10Dzahn: role::microsites::peopleweb: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388508 (owner: 10Muehlenhoff) [07:41:05] (03CR) 10Muehlenhoff: "labs hosts don't enable base::firewall by default and if someone actually sets that up, $CACHE_MISC can e.g. grant the entire WMCS subnet " [puppet] - 10https://gerrit.wikimedia.org/r/388509 (owner: 10Muehlenhoff) [07:43:57] 10Operations, 10Ops-Access-Requests: Requesting access to perf-teams for phedenskog - https://phabricator.wikimedia.org/T179729#3734308 (10Dzahn) > access to the servers where our services run on (e.g. hafnium for webperf/navtiming, and tungsten for xhgui - Note: host names to change per T158837). Access is... [07:44:42] 10Operations, 10Ops-Access-Requests: Requesting access to perf-teams for phedenskog (add phedenskog to perf-roots) - https://phabricator.wikimedia.org/T179729#3736611 (10Dzahn) [07:47:11] 10Operations, 10Ops-Access-Requests: Requesting access to perf-teams for phedenskog (add phedenskog to perf-roots) - https://phabricator.wikimedia.org/T179729#3736613 (10Dzahn) Though... Also see T179317#3731710 "perf-roots grants full root access to nearly half the servers in production" @Muehlenhoff [07:50:40] (03CR) 10Dzahn: [C: 031] admin: add sharvaniharan to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/388545 (https://phabricator.wikimedia.org/T179611) (owner: 10Herron) [07:50:59] (03PS3) 10Dzahn: role::microsites::peopleweb: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388508 (owner: 10Muehlenhoff) [08:01:44] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3736633 (10MoritzMuehlenhoff) p:05Triage>03High [08:02:08] 10Operations, 10ops-esams, 10DC-Ops, 10netops: cr2-esams temperature warning - https://phabricator.wikimedia.org/T176816#3736634 (10MoritzMuehlenhoff) p:05Triage>03Normal [08:03:05] 10Operations, 10Mail, 10Surveys: Qualtrics email-LDAP issue - https://phabricator.wikimedia.org/T176666#3736636 (10MoritzMuehlenhoff) p:05Triage>03Normal [08:03:28] 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): Scap: Standardize git version - https://phabricator.wikimedia.org/T179353#3736639 (10MoritzMuehlenhoff) p:05Triage>03Normal [08:03:44] 10Operations: Backport firejail 0.9.52 for use on Wikimedia appservers - https://phabricator.wikimedia.org/T179022#3736640 (10MoritzMuehlenhoff) p:05Triage>03Normal [08:04:04] (03CR) 10Dzahn: "https://people.wikimedia.org works" [puppet] - 10https://gerrit.wikimedia.org/r/388508 (owner: 10Muehlenhoff) [08:04:51] (03CR) 10Dzahn: [C: 032] "ok, cool, yep" [puppet] - 10https://gerrit.wikimedia.org/r/388509 (owner: 10Muehlenhoff) [08:09:15] (03PS2) 10Dzahn: profile::planet::venus: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388509 (owner: 10Muehlenhoff) [08:12:52] (03PS3) 10Dzahn: profile::planet::venus: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388509 (owner: 10Muehlenhoff) [08:13:22] (03CR) 10Dzahn: [V: 032 C: 032] profile::planet::venus: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388509 (owner: 10Muehlenhoff) [08:14:47] 10Operations, 10ops-eqiad: mw1191 ipmi-sel cpu errors - https://phabricator.wikimedia.org/T179640#3736656 (10MoritzMuehlenhoff) p:05Triage>03Normal [08:15:28] 10Operations, 10ops-codfw: check mw2176 power supply redundancy - https://phabricator.wikimedia.org/T177639#3736657 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03Papaul [08:15:49] 10Operations, 10ops-codfw: check mw2160 power supply redundancy - https://phabricator.wikimedia.org/T177638#3736659 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03Papaul [08:16:39] (03CR) 10Dzahn: "applied - https://en.planet.wikimedia.org/ works" [puppet] - 10https://gerrit.wikimedia.org/r/388509 (owner: 10Muehlenhoff) [08:17:16] 10Operations, 10ops-ulsfo: Multiple systems in ulsfo 1.22 showing PSU failures - https://phabricator.wikimedia.org/T177622#3736666 (10MoritzMuehlenhoff) [08:17:18] 10Operations, 10ops-ulsfo: check cp4007 power supply redundancy - https://phabricator.wikimedia.org/T177624#3736662 (10MoritzMuehlenhoff) 05Open>03declined This server is decommissioned via T176366, so closing the task. [08:17:27] (03CR) 10Dzahn: [C: 032] role::wikimania_scholarships: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388507 (owner: 10Muehlenhoff) [08:17:58] 10Operations, 10ops-ulsfo: Multiple systems in ulsfo 1.22 showing PSU failures - https://phabricator.wikimedia.org/T177622#3664980 (10MoritzMuehlenhoff) [08:18:01] 10Operations, 10ops-ulsfo: check cp4008 power supply redundancy - https://phabricator.wikimedia.org/T177625#3736668 (10MoritzMuehlenhoff) 05Open>03declined This server is decommissioned via T176366, so closing the task. [08:19:16] 10Operations, 10ops-ulsfo: check lvs4002 power supply redundancy - https://phabricator.wikimedia.org/T177623#3736674 (10MoritzMuehlenhoff) p:05Triage>03Normal [08:19:24] 10Operations, 10ops-ulsfo: Multiple systems in ulsfo 1.22 showing PSU failures - https://phabricator.wikimedia.org/T177622#3736675 (10MoritzMuehlenhoff) p:05Triage>03Normal [08:19:31] (03PS2) 10Dzahn: role::wikimania_scholarships: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388507 (owner: 10Muehlenhoff) [08:19:54] 10Operations: Revisit Pybal depool thresholds for app servers - https://phabricator.wikimedia.org/T178799#3736676 (10MoritzMuehlenhoff) p:05Triage>03Normal [08:20:02] (03PS3) 10Dzahn: role::wikimania_scholarships: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388507 (owner: 10Muehlenhoff) [08:20:30] 10Operations: Revisit Pybal depool thresholds for app servers - https://phabricator.wikimedia.org/T178799#3703098 (10MoritzMuehlenhoff) [08:20:52] 10Operations, 10HHVM, 10User-Elukey: Provide a forward port of ICU 52 for stretch / Investigate best ICU update strategy - https://phabricator.wikimedia.org/T177498#3736679 (10MoritzMuehlenhoff) p:05Triage>03High a:03MoritzMuehlenhoff [08:21:08] 10Operations: Phase out DSA keys for SSH access (ssh-dss) - https://phabricator.wikimedia.org/T177371#3736681 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03MoritzMuehlenhoff [08:26:55] (03CR) 10Dzahn: "applied https://scholarships.wikimedia.org/ works" [puppet] - 10https://gerrit.wikimedia.org/r/388507 (owner: 10Muehlenhoff) [08:27:38] (03CR) 10Giuseppe Lavagetto: [C: 031] Backends: add support to external backends plugins [software/cumin] - 10https://gerrit.wikimedia.org/r/384616 (https://phabricator.wikimedia.org/T178342) (owner: 10Volans) [08:30:44] (03PS2) 10Dzahn: Scap: add codfw host to the list of librenms hosts [puppet] - 10https://gerrit.wikimedia.org/r/388462 (owner: 10Ayounsi) [08:32:31] (03CR) 10Dzahn: [C: 032] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/388462 (owner: 10Ayounsi) [08:34:41] (03CR) 10Dzahn: "i don't know about vagrant roles, but why do we still have a role that is called "deprecated" in the name?" [puppet] - 10https://gerrit.wikimedia.org/r/389295 (owner: 10Paladox) [08:37:21] (03CR) 10Dzahn: "re-add me when it's getting closer, need to clean gerrit queues" [software/servermon] - 10https://gerrit.wikimedia.org/r/362600 (owner: 10Paladox) [08:38:50] (03CR) 10Dzahn: [C: 031] "i'll leave this one to analytics but should be good" [puppet] - 10https://gerrit.wikimedia.org/r/388510 (owner: 10Muehlenhoff) [08:39:43] (03CR) 10Dzahn: [C: 031] otrs: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388506 (owner: 10Muehlenhoff) [08:40:12] (03CR) 10Phuedx: [C: 031] "This LGTM. I'll rebase, merge, and get it on the deployment host." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388127 (https://phabricator.wikimedia.org/T179546) (owner: 10Jdlrobson) [08:41:08] (03CR) 10Dzahn: "thanks Eddie! since that is a mw-core change it might take a while, please re-add me to this once it's not "DNM" anymore" [puppet] - 10https://gerrit.wikimedia.org/r/382631 (https://phabricator.wikimedia.org/T176754) (owner: 10EddieGP) [08:41:17] (03CR) 10Phuedx: [C: 031] "Hashar:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388127 (https://phabricator.wikimedia.org/T179546) (owner: 10Jdlrobson) [08:42:30] (03PS1) 10Marostegui: db-eqiad.php: Repool db1101 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389424 [08:43:53] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3736715 (10Dzahn) I see that puppet is disabled on this host since 2 weeks, is this necessary for the temp test? [08:45:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1101 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389424 (owner: 10Marostegui) [08:47:05] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1101 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389424 (owner: 10Marostegui) [08:47:14] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1101 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389424 (owner: 10Marostegui) [08:48:13] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1101 with low weight after maintenance - T178359 (duration: 00m 47s) [08:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:20] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [08:54:14] (03CR) 10Dzahn: [C: 031] role::servermon::wmf: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388512 (owner: 10Muehlenhoff) [08:54:37] (03PS1) 10Marostegui: tools.my.cnf.erb: Enable innodb_large_prefix [puppet] - 10https://gerrit.wikimedia.org/r/389425 (https://phabricator.wikimedia.org/T179614) [08:54:56] (03PS2) 10Marostegui: tools.my.cnf.erb: Enable innodb_large_prefix [puppet] - 10https://gerrit.wikimedia.org/r/389425 (https://phabricator.wikimedia.org/T179614) [08:56:58] (03CR) 10Dzahn: [C: 031] "waiting for a deployment slot defined by Chad :)" [puppet] - 10https://gerrit.wikimedia.org/r/366910 (owner: 10Paladox) [08:57:41] (03CR) 10Marostegui: "Puppet compiler looks good: https://puppet-compiler.wmflabs.org/compiler02/8637/" [puppet] - 10https://gerrit.wikimedia.org/r/389425 (https://phabricator.wikimedia.org/T179614) (owner: 10Marostegui) [08:58:37] (03CR) 10Dzahn: [C: 04-1] "per Hashar and "stretch in CI will be done via Docker container and we are not going to use puppet anymore"" [puppet] - 10https://gerrit.wikimedia.org/r/386889 (owner: 10Paladox) [08:58:40] (03CR) 10Elukey: [C: 031] statistics::sites::pivot: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388510 (owner: 10Muehlenhoff) [09:02:33] (03PS6) 10Dzahn: mediawiki: Add explicit dependency on ghostscript [puppet] - 10https://gerrit.wikimedia.org/r/313963 (owner: 10Muehlenhoff) [09:03:27] (03CR) 10Dzahn: [C: 031] "anything that blocked a merge? already had 2 x +1" [puppet] - 10https://gerrit.wikimedia.org/r/313963 (owner: 10Muehlenhoff) [09:03:32] (03PS7) 10Dzahn: mediawiki: Add explicit dependency on ghostscript [puppet] - 10https://gerrit.wikimedia.org/r/313963 (owner: 10Muehlenhoff) [09:05:57] (03PS1) 10Giuseppe Lavagetto: jobrunner: make refreshlinks jobs low-priority [puppet] - 10https://gerrit.wikimedia.org/r/389427 (https://phabricator.wikimedia.org/T173710) [09:06:11] <_joe_> elukey: ^^ [09:06:47] (03CR) 10Marostegui: [C: 032] tools.my.cnf.erb: Enable innodb_large_prefix [puppet] - 10https://gerrit.wikimedia.org/r/389425 (https://phabricator.wikimedia.org/T179614) (owner: 10Marostegui) [09:07:00] (03CR) 10Elukey: [C: 031] jobrunner: make refreshlinks jobs low-priority [puppet] - 10https://gerrit.wikimedia.org/r/389427 (https://phabricator.wikimedia.org/T173710) (owner: 10Giuseppe Lavagetto) [09:07:53] !log removed monitoring/RAID packages from stretch-wikimedia/thirdparty, now in thirdparty/hwraid [09:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:45] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Implement authentication/authorization in Kubernetes clusters - https://phabricator.wikimedia.org/T177393#3736730 (10akosiaris) [09:11:07] (03CR) 10Filippo Giunchedi: [C: 031] Fix code commenting out installer apt lines with new repository layout [puppet] - 10https://gerrit.wikimedia.org/r/388427 (owner: 10Muehlenhoff) [09:11:10] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Implement authentication/authorization in Kubernetes clusters - https://phabricator.wikimedia.org/T177393#3657094 (10akosiaris) p:05Triage>03Normal Authn wise, current consensus is to keep going with tokenauth authentication method, unt... [09:13:17] (03PS1) 10Marostegui: db-eqiad.php: Increase db1101 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389428 [09:15:21] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1101 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389428 (owner: 10Marostegui) [09:15:34] (03PS2) 10Giuseppe Lavagetto: jobrunner: make refreshlinks jobs low-priority [puppet] - 10https://gerrit.wikimedia.org/r/389427 (https://phabricator.wikimedia.org/T173710) [09:16:28] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: make refreshlinks jobs low-priority [puppet] - 10https://gerrit.wikimedia.org/r/389427 (https://phabricator.wikimedia.org/T173710) (owner: 10Giuseppe Lavagetto) [09:16:35] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389413 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:16:37] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1101 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389428 (owner: 10Marostegui) [09:16:47] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1101 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389428 (owner: 10Marostegui) [09:17:56] (03CR) 10Alexandros Kosiaris: [C: 032] role::servermon::wmf: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388512 (owner: 10Muehlenhoff) [09:17:58] (03PS3) 10Marostegui: db-eqiad,db-codfw.php: Pool db2085 in s3 and s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389413 (https://phabricator.wikimedia.org/T178359) [09:18:00] (03PS2) 10Alexandros Kosiaris: role::servermon::wmf: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388512 (owner: 10Muehlenhoff) [09:18:02] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] role::servermon::wmf: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388512 (owner: 10Muehlenhoff) [09:18:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Give db1101 more weight after maintenance - T178359 (duration: 00m 46s) [09:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:19] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [09:18:50] (03PS2) 10Alexandros Kosiaris: otrs: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388506 (owner: 10Muehlenhoff) [09:18:52] (03PS1) 10Elukey: role::druid::*::worker: instrument historical to log metrics [puppet] - 10https://gerrit.wikimedia.org/r/389429 (https://phabricator.wikimedia.org/T177458) [09:18:54] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] otrs: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388506 (owner: 10Muehlenhoff) [09:23:06] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Pool db2085 in s3 and s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389413 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:24:19] (03PS3) 10Phuedx: [labs] Unconditionally enable popups for anonymous users on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388127 (https://phabricator.wikimedia.org/T179546) (owner: 10Jdlrobson) [09:24:22] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2085 in s3 and s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389413 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:24:45] (03PS2) 10Alexandros Kosiaris: etherpad: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388513 (owner: 10Muehlenhoff) [09:24:45] ^ gonna merge that beta-cluster only change and update the deployment host [09:24:51] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] etherpad: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388513 (owner: 10Muehlenhoff) [09:25:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add db2085 as multi-instance core host on eqiad file T178553 T178359 (duration: 00m 46s) [09:25:44] (03PS1) 10Dzahn: piwik: restrict access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/389430 [09:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:49] T178553: Support multi-instance hosts on mediawiki-config - https://phabricator.wikimedia.org/T178553 [09:25:49] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [09:26:19] (03CR) 10jerkins-bot: [V: 04-1] piwik: restrict access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/389430 (owner: 10Dzahn) [09:26:24] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2085 in s3 and s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389413 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:26:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Pool db2085 as recentchanges multi-instance host on s3 and s5 T178553 T178359 (duration: 00m 46s) [09:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:09] (03PS2) 10Dzahn: piwik: restrict access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/389430 [09:29:31] (03PS1) 10Dzahn: iegreview: restrict access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/389431 [09:31:26] (03PS1) 10Dzahn: ci: restrict http access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/389432 [09:34:01] (03PS2) 10Elukey: role::druid::*::worker: instrument historical to log more metrics [puppet] - 10https://gerrit.wikimedia.org/r/389429 (https://phabricator.wikimedia.org/T177458) [09:34:47] (03PS6) 10Volans: Backends: add support to external backends plugins [software/cumin] - 10https://gerrit.wikimedia.org/r/384616 (https://phabricator.wikimedia.org/T178342) [09:36:30] (03CR) 10Filippo Giunchedi: "> Looks good to me overall. Is it difficult to adjust counters later" [puppet] - 10https://gerrit.wikimedia.org/r/388032 (https://phabricator.wikimedia.org/T179565) (owner: 10Filippo Giunchedi) [09:37:45] <_joe_> !log manually running htmlCacheUpdate for commonswiki and ruwiki on terbium, T173710 [09:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:55] T173710: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710 [09:38:57] (03PS3) 10Muehlenhoff: Fix code commenting out installer apt lines with new repository layout [puppet] - 10https://gerrit.wikimedia.org/r/388427 [09:39:14] (03PS2) 10ArielGlenn: rsync xml/sql dumps on an ongoing basis to fallback dumps nfs server and to web server from primary dump nfs host [puppet] - 10https://gerrit.wikimedia.org/r/389025 (https://phabricator.wikimedia.org/T178893) [09:39:33] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db2088 in s1 and s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389434 (https://phabricator.wikimedia.org/T178359) [09:39:49] (03CR) 10jerkins-bot: [V: 04-1] rsync xml/sql dumps on an ongoing basis to fallback dumps nfs server and to web server from primary dump nfs host [puppet] - 10https://gerrit.wikimedia.org/r/389025 (https://phabricator.wikimedia.org/T178893) (owner: 10ArielGlenn) [09:41:37] (03CR) 10Muehlenhoff: [C: 032] Fix code commenting out installer apt lines with new repository layout [puppet] - 10https://gerrit.wikimedia.org/r/388427 (owner: 10Muehlenhoff) [09:42:47] (03CR) 10Elukey: "pcc https://puppet-compiler.wmflabs.org/compiler02/8640/" [puppet] - 10https://gerrit.wikimedia.org/r/389429 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [09:44:11] PROBLEM - puppet last run on lvs2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:44:38] (03CR) 10Phuedx: [C: 032] [labs] Unconditionally enable popups for anonymous users on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388127 (https://phabricator.wikimedia.org/T179546) (owner: 10Jdlrobson) [09:44:54] (03PS1) 10Marostegui: mariadb: Enable notifications on db2085 and db2088 [puppet] - 10https://gerrit.wikimedia.org/r/389435 (https://phabricator.wikimedia.org/T178359) [09:46:06] (03CR) 10Volans: [C: 032] Backends: add support to external backends plugins [software/cumin] - 10https://gerrit.wikimedia.org/r/384616 (https://phabricator.wikimedia.org/T178342) (owner: 10Volans) [09:46:24] (03Merged) 10jenkins-bot: [labs] Unconditionally enable popups for anonymous users on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388127 (https://phabricator.wikimedia.org/T179546) (owner: 10Jdlrobson) [09:46:33] (03CR) 10jenkins-bot: [labs] Unconditionally enable popups for anonymous users on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388127 (https://phabricator.wikimedia.org/T179546) (owner: 10Jdlrobson) [09:47:33] (03PS3) 10ArielGlenn: rsync xml/sql dumps on an ongoing basis from primary dump nf host [puppet] - 10https://gerrit.wikimedia.org/r/389025 (https://phabricator.wikimedia.org/T178893) [09:48:12] (03CR) 10jerkins-bot: [V: 04-1] rsync xml/sql dumps on an ongoing basis from primary dump nf host [puppet] - 10https://gerrit.wikimedia.org/r/389025 (https://phabricator.wikimedia.org/T178893) (owner: 10ArielGlenn) [09:49:44] (03Merged) 10jenkins-bot: Backends: add support to external backends plugins [software/cumin] - 10https://gerrit.wikimedia.org/r/384616 (https://phabricator.wikimedia.org/T178342) (owner: 10Volans) [09:50:39] (03PS2) 10Marostegui: mariadb: Enable notifications on db2085 and db2088 [puppet] - 10https://gerrit.wikimedia.org/r/389435 (https://phabricator.wikimedia.org/T178359) [09:50:41] (03CR) 10jenkins-bot: Backends: add support to external backends plugins [software/cumin] - 10https://gerrit.wikimedia.org/r/384616 (https://phabricator.wikimedia.org/T178342) (owner: 10Volans) [09:52:01] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [09:52:16] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389434 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:52:21] (03PS4) 10ArielGlenn: rsync xml/sql dumps on an ongoing basis from primary dump nf host [puppet] - 10https://gerrit.wikimedia.org/r/389025 (https://phabricator.wikimedia.org/T178893) [09:52:41] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [09:52:50] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Pool db2088 in s1 and s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389434 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:54:06] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2088 in s1 and s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389434 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:55:29] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Pool db2088 as recentchanges multi-instance host on s1 and s2 T178359 (duration: 00m 46s) [09:55:32] PROBLEM - DPKG on ms-fe2006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:35] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [09:55:48] (03CR) 10Marostegui: [C: 032] mariadb: Enable notifications on db2085 and db2088 [puppet] - 10https://gerrit.wikimedia.org/r/389435 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:56:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Pool db2088 as recentchanges multi-instance host on s1 and s2 T178359 (duration: 00m 46s) [09:56:25] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2088 in s1 and s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389434 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:32] RECOVERY - DPKG on ms-fe2006 is OK: All packages OK [09:56:36] that's me ^ [09:57:19] !log roll-upgrade swift to 2.10.2 in codfw and roll-restart for kernel upgrade - T177739 [09:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:24] T177739: Integrate stretch 9.2 point release - https://phabricator.wikimedia.org/T177739 [09:58:27] (03Draft2) 10Jayprakash12345: Add Translation: namespace on Punjabi Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389433 [09:59:29] (03PS3) 10Jayprakash12345: Add Translation: namespace on Punjabi Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389433 (https://phabricator.wikimedia.org/T179807) [10:01:27] jouncebot: refresh [10:01:30] zeljkof: ^^ [10:01:32] I refreshed my knowledge about deployments. [10:01:35] jouncebot: next [10:01:36] In 3 hour(s) and 58 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171106T1400) [10:02:40] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389437 [10:02:42] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [10:03:42] PROBLEM - Host ms-fe2006 is DOWN: PING CRITICAL - Packet loss = 100% [10:04:01] RECOVERY - Host ms-fe2006 is UP: PING OK - Packet loss = 0%, RTA = 36.05 ms [10:04:02] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [10:04:30] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389437 (owner: 10Marostegui) [10:05:26] !log uploading linux 4.9.51-1~bpo8+1 for jessie-wikimedia to apt.wikimedia.org [10:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:58] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389437 (owner: 10Marostegui) [10:06:08] oh neat [10:06:29] someone's already updated the deployment host with the bc-only change [10:06:29] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389437 (owner: 10Marostegui) [10:06:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Give db1101 more weight after maintenance - T178359 (duration: 00m 46s) [10:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:01] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [10:09:11] RECOVERY - puppet last run on lvs2005 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [10:15:20] PROBLEM - DPKG on ms-fe2007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:16:20] RECOVERY - DPKG on ms-fe2007 is OK: All packages OK [10:27:37] (03PS3) 10Elukey: role::druid::*::worker: instrument historical to log more metrics [puppet] - 10https://gerrit.wikimedia.org/r/389429 (https://phabricator.wikimedia.org/T177458) [10:29:57] (03PS1) 10Volans: wmf-auto-reimage: fix wait_puppet_run, use utcnow [puppet] - 10https://gerrit.wikimedia.org/r/389440 [10:34:50] (03PS1) 10Muehlenhoff: Bump meta package for new ABI in 4.9 Let linux-meta depend on the latest linux-meta-4.9 [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/389441 [10:36:50] PROBLEM - DPKG on ms-be2020 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:37:03] (03CR) 10MarcoAurelio: [C: 04-1] "Namespace numbering and naming is reserved for Extension:Translate. Please do not merge." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389433 (https://phabricator.wikimedia.org/T179807) (owner: 10Jayprakash12345) [10:37:09] PROBLEM - DPKG on ms-be2023 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:37:09] PROBLEM - DPKG on ms-be2025 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:37:19] heh that's me, sorry about the spam [10:37:19] PROBLEM - DPKG on ms-be2017 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:37:29] PROBLEM - DPKG on ms-be2022 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:37:30] PROBLEM - DPKG on ms-be2024 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:37:39] PROBLEM - DPKG on ms-be2019 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:37:50] RECOVERY - DPKG on ms-be2020 is OK: All packages OK [10:37:57] (03PS1) 10Marostegui: Revert "tools.my.cnf.erb: Enable innodb_large_prefix" [puppet] - 10https://gerrit.wikimedia.org/r/389442 [10:38:02] (03PS2) 10Marostegui: Revert "tools.my.cnf.erb: Enable innodb_large_prefix" [puppet] - 10https://gerrit.wikimedia.org/r/389442 [10:38:20] RECOVERY - DPKG on ms-be2017 is OK: All packages OK [10:38:35] (03CR) 10Marostegui: [C: 032] Revert "tools.my.cnf.erb: Enable innodb_large_prefix" [puppet] - 10https://gerrit.wikimedia.org/r/389442 (owner: 10Marostegui) [10:38:39] RECOVERY - DPKG on ms-be2019 is OK: All packages OK [10:39:09] RECOVERY - DPKG on ms-be2023 is OK: All packages OK [10:39:09] RECOVERY - DPKG on ms-be2025 is OK: All packages OK [10:39:29] RECOVERY - DPKG on ms-be2022 is OK: All packages OK [10:39:30] RECOVERY - DPKG on ms-be2024 is OK: All packages OK [10:40:59] PROBLEM - Check systemd state on ms-be2024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:41:10] PROBLEM - Check systemd state on ms-be2025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:41:10] PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:42:15] (03CR) 10Jayprakash12345: "So Translation: is reserved for Extension:Translate, then Is there need to change Both Number and Name. Or The Task is invaild." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389433 (https://phabricator.wikimedia.org/T179807) (owner: 10Jayprakash12345) [10:42:21] (03CR) 10Elukey: [C: 031] wmf-auto-reimage: fix wait_puppet_run, use utcnow [puppet] - 10https://gerrit.wikimedia.org/r/389440 (owner: 10Volans) [10:43:43] PROBLEM - Check systemd state on ms-be2022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:43:52] (03CR) 10Muehlenhoff: [C: 032] Bump meta package for new ABI in 4.9 Let linux-meta depend on the latest linux-meta-4.9 [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/389441 (owner: 10Muehlenhoff) [10:44:54] (03PS4) 10Elukey: role::druid::*::worker: instrument historical to log more metrics [puppet] - 10https://gerrit.wikimedia.org/r/389429 (https://phabricator.wikimedia.org/T177458) [10:46:45] (03PS1) 10Marostegui: db-eqiad.php: Restore db1101 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389447 (https://phabricator.wikimedia.org/T178359) [10:49:53] (03CR) 10Dzahn: [C: 032] iegreview: restrict access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/389431 (owner: 10Dzahn) [10:50:26] (03PS1) 10Alexandros Kosiaris: Switch to docker-registry.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/389448 [10:51:36] (03PS5) 10Elukey: role::druid::*::worker: instrument historical to log more metrics [puppet] - 10https://gerrit.wikimedia.org/r/389429 (https://phabricator.wikimedia.org/T177458) [10:52:12] RECOVERY - Check systemd state on ms-be2024 is OK: OK - running: The system is fully operational [10:52:13] (03PS2) 10Volans: wmf-auto-reimage: fix wait_puppet_run, use utcnow [puppet] - 10https://gerrit.wikimedia.org/r/389440 [10:52:22] RECOVERY - Check systemd state on ms-be2025 is OK: OK - running: The system is fully operational [10:52:23] RECOVERY - Check systemd state on ms-be2023 is OK: OK - running: The system is fully operational [10:52:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1101 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389447 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:52:52] RECOVERY - Check systemd state on ms-be2022 is OK: OK - running: The system is fully operational [10:52:52] (03CR) 10Volans: [C: 032] wmf-auto-reimage: fix wait_puppet_run, use utcnow [puppet] - 10https://gerrit.wikimedia.org/r/389440 (owner: 10Volans) [10:52:56] (03CR) 10Giuseppe Lavagetto: [C: 031] Switch to docker-registry.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/389448 (owner: 10Alexandros Kosiaris) [10:53:19] (03PS6) 10Elukey: role::druid::*::worker: instrument historical to log more metrics [puppet] - 10https://gerrit.wikimedia.org/r/389429 (https://phabricator.wikimedia.org/T177458) [10:53:47] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1101 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389447 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:56:27] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1101 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389447 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:57:22] PROBLEM - DPKG on ms-be2028 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:57:33] PROBLEM - DPKG on ms-be2027 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:57:43] PROBLEM - DPKG on ms-be2026 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:58:36] (03CR) 10Gehel: [C: 04-1] "minor comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/388482 (https://phabricator.wikimedia.org/T63782) (owner: 10Ema) [10:58:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1101 original weight after maintenance - T178359 (duration: 00m 46s) [10:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:50] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [10:59:33] RECOVERY - DPKG on ms-be2027 is OK: All packages OK [10:59:43] RECOVERY - DPKG on ms-be2026 is OK: All packages OK [11:00:23] RECOVERY - DPKG on ms-be2028 is OK: All packages OK [11:01:23] PROBLEM - Check systemd state on ms-be2026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:01:33] PROBLEM - DPKG on ms-be2036 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:01:42] PROBLEM - Check systemd state on ms-be2027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:02:03] PROBLEM - DPKG on ms-be2039 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:03:04] (03PS2) 10Dzahn: iegreview: restrict access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/389431 [11:03:22] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:03:53] PROBLEM - DPKG on ms-be2034 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:04:08] (03PS3) 10Dzahn: iegreview: restrict access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/389431 [11:05:12] RECOVERY - DPKG on ms-be2039 is OK: All packages OK [11:05:41] (03PS1) 10Muehlenhoff: Reimport of 1.13 to git and bump version [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/389453 [11:05:42] RECOVERY - DPKG on ms-be2036 is OK: All packages OK [11:06:09] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add db2087 to s6 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389454 (https://phabricator.wikimedia.org/T178359) [11:06:53] RECOVERY - DPKG on ms-be2034 is OK: All packages OK [11:06:53] PROBLEM - Check systemd state on ms-be2036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:07:22] PROBLEM - DPKG on ms-be2032 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:07:32] PROBLEM - DPKG on ms-be2037 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:07:40] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Add db2087 to s6 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389454 (https://phabricator.wikimedia.org/T178359) [11:07:42] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:08:08] (03CR) 10Dzahn: "https://iegreview.wikimedia.org/credits still working (krypon.eqiad.wmnet)" [puppet] - 10https://gerrit.wikimedia.org/r/389431 (owner: 10Dzahn) [11:09:12] PROBLEM - DPKG on ms-be2029 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:09:23] PROBLEM - Check systemd state on ms-be2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:11:26] (03CR) 10Muehlenhoff: [C: 032] Reimport of 1.13 to git and bump version [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/389453 (owner: 10Muehlenhoff) [11:11:26] RECOVERY - DPKG on ms-be2032 is OK: All packages OK [11:11:36] RECOVERY - DPKG on ms-be2037 is OK: All packages OK [11:12:25] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389454 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [11:12:46] thanks volans :) [11:12:54] (03PS1) 10Marostegui: db2087.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/389455 (https://phabricator.wikimedia.org/T178359) [11:12:58] yw :) [11:13:00] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Add db2087 to s6 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389454 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [11:13:16] RECOVERY - DPKG on ms-be2029 is OK: All packages OK [11:13:26] PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:13:27] PROBLEM - DPKG on ms-be2033 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:13:56] PROBLEM - DPKG on ms-be2030 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:14:04] (03PS1) 10Filippo Giunchedi: nrpe: don't check broken packages while dpkg is running [puppet] - 10https://gerrit.wikimedia.org/r/389456 [11:14:13] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2087 to s6 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389454 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [11:14:17] (03CR) 10Marostegui: [C: 032] db2087.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/389455 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [11:14:19] the fix for the above spam [11:15:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Pool db2087 as recentchanges multi-instance host on s6 and s7 T178359 (duration: 00m 48s) [11:15:37] PROBLEM - DPKG on ms-be2031 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:43] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [11:16:27] PROBLEM - puppet last run on ms-be2029 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 4 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[openssl],Package[python-swift],Package[swift],Package[swift-container] [11:16:27] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2087 to s6 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389454 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [11:16:31] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Pool db2087 as recentchanges multi-instance host on s6 and s7 T178359 (duration: 00m 47s) [11:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:46] PROBLEM - puppet last run on ms-be2032 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[swift-container],Package[swift-account],Package[swift-object] [11:17:27] RECOVERY - DPKG on ms-be2033 is OK: All packages OK [11:17:47] (03PS1) 10Dzahn: phabricator: drop ferm rule to open port 443 [puppet] - 10https://gerrit.wikimedia.org/r/389457 [11:17:56] RECOVERY - DPKG on ms-be2030 is OK: All packages OK [11:18:37] RECOVERY - DPKG on ms-be2031 is OK: All packages OK [11:19:27] (03PS1) 10Dzahn: phabricator: limit http access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/389459 [11:19:29] !log Compress InnoDB on db1103.s2 - T178359 [11:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:36] PROBLEM - Check systemd state on ms-be2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:20:06] PROBLEM - Check systemd state on ms-be2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:21:26] RECOVERY - puppet last run on ms-be2029 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [11:21:36] PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:21:46] RECOVERY - puppet last run on ms-be2032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:22:43] (03PS1) 10Dzahn: wikibase: restrict http access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/389461 [11:23:39] (03PS1) 10Marostegui: s2.hosts: Add db1103 [software] - 10https://gerrit.wikimedia.org/r/389462 (https://phabricator.wikimedia.org/T178359) [11:24:51] (03CR) 10Marostegui: [C: 032] s2.hosts: Add db1103 [software] - 10https://gerrit.wikimedia.org/r/389462 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [11:25:17] PROBLEM - DPKG on ms-be2035 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:25:18] PROBLEM - DPKG on ms-be2038 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:25:37] (03Merged) 10jenkins-bot: s2.hosts: Add db1103 [software] - 10https://gerrit.wikimedia.org/r/389462 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [11:27:44] (03CR) 10Dzahn: [C: 032] "not in prod yet - labs only so far" [puppet] - 10https://gerrit.wikimedia.org/r/389461 (owner: 10Dzahn) [11:29:17] RECOVERY - DPKG on ms-be2035 is OK: All packages OK [11:29:18] RECOVERY - DPKG on ms-be2038 is OK: All packages OK [11:29:30] (03PS2) 10Alexandros Kosiaris: Switch to docker-registry.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/389448 [11:29:38] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Switch to docker-registry.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/389448 (owner: 10Alexandros Kosiaris) [11:30:57] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational [11:31:06] RECOVERY - Check systemd state on ms-be2027 is OK: OK - running: The system is fully operational [11:31:07] RECOVERY - Check systemd state on ms-be2030 is OK: OK - running: The system is fully operational [11:31:27] RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational [11:31:35] (03CR) 10Alexandros Kosiaris: [C: 031] nrpe: don't check broken packages while dpkg is running [puppet] - 10https://gerrit.wikimedia.org/r/389456 (owner: 10Filippo Giunchedi) [11:31:37] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational [11:31:38] RECOVERY - Check systemd state on ms-be2033 is OK: OK - running: The system is fully operational [11:31:38] RECOVERY - Check systemd state on ms-be2031 is OK: OK - running: The system is fully operational [11:31:38] RECOVERY - Check systemd state on ms-be2034 is OK: OK - running: The system is fully operational [11:31:38] RECOVERY - Check systemd state on ms-be2026 is OK: OK - running: The system is fully operational [11:31:47] RECOVERY - Check systemd state on ms-be2036 is OK: OK - running: The system is fully operational [11:36:03] (03PS4) 10Volans: Logging: uniform loggers [software/cumin] - 10https://gerrit.wikimedia.org/r/386399 (https://phabricator.wikimedia.org/T179002) [11:36:27] (03PS4) 10Volans: Logging: use % syntax for parameters [software/cumin] - 10https://gerrit.wikimedia.org/r/386400 (https://phabricator.wikimedia.org/T179002) [11:39:35] hmm upgrading linux-meta is failing with [11:39:36] The following packages have unmet dependencies: [11:39:36] linux-meta : Depends: linux-meta-4.9 (= 1.13) but 1.14 is to be installed [11:39:36] E: Unable to correct problems, you have held broken packages. [11:44:05] !log Deploy alter table on s6 codfw master (with replication, so this will generate lag on codfw) - T174569 [11:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:11] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [11:45:32] !log oblivian@tin Started deploy [jobrunner/jobrunner@a20d043]: (no justification provided) [11:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:54] <_joe_> ouch, that wasn't intended to happen, a typo [11:45:59] <_joe_> I stopped the deploy [11:46:08] (03PS1) 10Muehlenhoff: Use (= ${binary:Version}) in linux-meta dependencies [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/389467 [11:47:08] paladox: thanks, just pushed a fix [11:47:14] thanks :) [11:47:24] (03CR) 10Muehlenhoff: [C: 032] Use (= ${binary:Version}) in linux-meta dependencies [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/389467 (owner: 10Muehlenhoff) [11:50:16] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:50:36] PROBLEM - puppet last run on lvs2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:50:56] PROBLEM - puppet last run on mw2201 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:51:06] PROBLEM - puppet last run on ganeti2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:51:16] PROBLEM - puppet last run on restbase2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:51:26] PROBLEM - puppet last run on mw1249 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:51:51] <_joe_> uh? [11:52:16] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:52:16] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:52:26] <|404> looks like a lot is going to come [11:52:36] PROBLEM - puppet last run on rdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:52:37] PROBLEM - puppet last run on db2090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:52:38] <|404> moritzm: ^ [11:52:46] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:53:06] seems puppetdb again [11:53:12] yeah [11:53:21] manual puppet run on rdb1001 worked fine [11:53:26] PROBLEM - puppet last run on diadem is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:53:30] [Mon Nov 6 11:47:14 2017] Out of memory: Kill process 3012 (java) score 368 or sacrifice child [11:53:36] PROBLEM - puppet last run on mw2230 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:53:50] akosiaris: FYI ^^^ [11:54:06] PROBLEM - puppet last run on etcd1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:54:17] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:54:47] PROBLEM - puppet last run on kubestagetcd1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:54:47] PROBLEM - puppet last run on cp4029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:55:16] PROBLEM - puppet last run on mw1257 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:56:04] sigh... [11:56:23] at least we are upgrading this quarter.. maybe we will get lucky [11:57:37] RECOVERY - puppet last run on rdb1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:57:51] yeah [12:03:02] (03PS2) 10Filippo Giunchedi: nrpe: don't check broken packages while dpkg is running [puppet] - 10https://gerrit.wikimedia.org/r/389456 [12:05:28] (03CR) 10Filippo Giunchedi: [C: 032] nrpe: don't check broken packages while dpkg is running [puppet] - 10https://gerrit.wikimedia.org/r/389456 (owner: 10Filippo Giunchedi) [12:06:33] (03CR) 10Muehlenhoff: nrpe: don't check broken packages while dpkg is running (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389456 (owner: 10Filippo Giunchedi) [12:11:09] (03CR) 10Muehlenhoff: ci: restrict http access to cache_misc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389432 (owner: 10Dzahn) [12:12:33] (03CR) 10Filippo Giunchedi: nrpe: don't check broken packages while dpkg is running (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389456 (owner: 10Filippo Giunchedi) [12:17:38] RECOVERY - puppet last run on db2090 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [12:18:37] RECOVERY - puppet last run on mw2230 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [12:19:07] RECOVERY - puppet last run on etcd1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:19:18] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:19:47] RECOVERY - puppet last run on kubestagetcd1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:19:47] RECOVERY - puppet last run on cp4029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:20:17] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:20:17] RECOVERY - puppet last run on mw1257 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:20:37] RECOVERY - puppet last run on lvs2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:20:39] (03CR) 10Elukey: [C: 031] piwik: restrict access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/389430 (owner: 10Dzahn) [12:20:57] RECOVERY - puppet last run on mw2201 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:21:07] RECOVERY - puppet last run on ganeti2005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:21:17] RECOVERY - puppet last run on restbase2005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:21:27] RECOVERY - puppet last run on mw1249 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:22:17] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:22:18] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:22:47] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:22:51] 10Operations, 10Phabricator, 10Traffic, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3737107 (10Dispenser) **7 days**. In 7 days the IP information will start disappearing. [12:23:27] RECOVERY - puppet last run on diadem is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:25:08] (03PS5) 10ArielGlenn: rsync xml/sql dumps on an ongoing basis from primary dump nf host [puppet] - 10https://gerrit.wikimedia.org/r/389025 (https://phabricator.wikimedia.org/T178893) [12:27:44] 10Operations, 10Phabricator, 10Traffic, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3737108 (10Aklapper) https://phabricator.wikimedia.org/T174342#3559407 already lists the IP ranges I'd say. Link seems to be https://meta.wikimedia.org/wiki/Steward_requests/Ch... [12:29:17] !log installing imagemagick security updates [12:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:41] (03PS6) 10ArielGlenn: rsync xml/sql dumps on an ongoing basis from primary dump nf host [puppet] - 10https://gerrit.wikimedia.org/r/389025 (https://phabricator.wikimedia.org/T178893) [12:39:05] (03PS1) 10Muehlenhoff: Add library hint for imagemagick [puppet] - 10https://gerrit.wikimedia.org/r/389472 [12:39:52] (03PS1) 10Giuseppe Lavagetto: scap_source: also execute scap deploy --init [puppet] - 10https://gerrit.wikimedia.org/r/389473 [12:40:29] (03CR) 10Muehlenhoff: [C: 032] Add library hint for imagemagick [puppet] - 10https://gerrit.wikimedia.org/r/389472 (owner: 10Muehlenhoff) [12:43:03] !log mobrovac@tin Started restart [electron-render/deploy@8dd5f13]: Electron stuck, restrting - T174916 [12:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:11] T174916: electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 [12:45:28] (03PS1) 10Elukey: [WIP] First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 [12:46:29] (03CR) 10Herron: [C: 032] admin: add sharvaniharan to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/388545 (https://phabricator.wikimedia.org/T179611) (owner: 10Herron) [12:46:36] (03PS2) 10Herron: admin: add sharvaniharan to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/388545 (https://phabricator.wikimedia.org/T179611) [12:49:55] (03PS7) 10Elukey: role::druid::*::worker: instrument historical to log more metrics [puppet] - 10https://gerrit.wikimedia.org/r/389429 (https://phabricator.wikimedia.org/T177458) [12:50:35] (03PS3) 10Addshore: Revert "Revert "Set wgWikiDiff2MovedParagraphDetectionCutoff for group0"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387978 (https://phabricator.wikimedia.org/T177891) (owner: 10Legoktm) [12:51:44] (03CR) 10Elukey: [C: 032] role::druid::*::worker: instrument historical to log more metrics [puppet] - 10https://gerrit.wikimedia.org/r/389429 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [12:52:51] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Analytics-EventLogging, 10Patch-For-Review: Requesting Sharvani Haran to be added to researchers group - https://phabricator.wikimedia.org/T179611#3737173 (10herron) 05Open>03Resolved a:03herron Hi @Sharvaniharan, you have been added to group `res... [12:55:20] !log restarting apache on netmon to pick up openssl update [12:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:58] cumin [12:56:24] (03PS7) 10ArielGlenn: rsync xml/sql dumps on an ongoing basis from primary dump nf host [puppet] - 10https://gerrit.wikimedia.org/r/389025 (https://phabricator.wikimedia.org/T178893) [12:58:48] !log installing openssl security updates [12:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:26] marostegui: trying to prove something? :-P I'm here and reading channels... so doesn't prove anything ;) [12:59:45] volans: hahaha, just trying.. [13:03:22] !log rolling restart of druid historical daemons on druid100[1-6] to apply https://gerrit.wikimedia.org/r/#/c/389429 [13:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:04] 10Operations, 10Dumps-Generation, 10Patch-For-Review: Architecture and puppetize setup for dumpsdata boxes - https://phabricator.wikimedia.org/T169849#3737184 (10ArielGlenn) 05Open>03Resolved This is done now. While there's still moving the cron misc dump jobs to write to the filesystem on the dumpsdata... [13:07:06] 10Operations, 10Dumps-Generation, 10Patch-For-Review: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3737186 (10ArielGlenn) [13:08:13] 10Operations, 10Dumps-Generation, 10Patch-For-Review: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3259047 (10ArielGlenn) 05Open>03Resolved Closing this. There's more to be done as far as misc dump cron jobs running on these new hosts, but we're well past the basic setu... [13:24:30] (03PS3) 10Dzahn: piwik: restrict access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/389430 [13:26:21] (03CR) 10Dzahn: [C: 032] piwik: restrict access to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/389430 (owner: 10Dzahn) [13:28:17] (03CR) 10Dzahn: "applied on bohrium.eqiad.wmnet - https://piwik.wikimedia.org/ works" [puppet] - 10https://gerrit.wikimedia.org/r/389430 (owner: 10Dzahn) [13:28:56] 10Operations, 10Dumps-Generation: fix up datasets uid - https://phabricator.wikimedia.org/T113467#3737258 (10ArielGlenn) The replacement user is the dumpsgen user, which has a uid < 999. I'm updating specific jobs one at a time, because a chown -R on everything will break some jobs. We have folks that rsync t... [13:30:35] (03PS1) 10Marostegui: tools.my.cnf.erb: Enable innodb_large_prefix [puppet] - 10https://gerrit.wikimedia.org/r/389477 (https://phabricator.wikimedia.org/T179614) [13:38:44] (03PS1) 10Herron: puppet: conditionally pin packages to appropriate repo for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/389478 (https://phabricator.wikimedia.org/T179724) [13:39:17] (03CR) 10jerkins-bot: [V: 04-1] puppet: conditionally pin packages to appropriate repo for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/389478 (https://phabricator.wikimedia.org/T179724) (owner: 10Herron) [14:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171106T1400). [14:00:04] Pchelolo, mobrovac, and addshore: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:21] I'm here [14:00:25] I can swat today [14:00:51] o/ [14:00:58] zeljkof: I'll do my patch (its last)( [14:01:14] addshore: want to do it now while I review other patches? [14:01:27] nah, as it will probably take me some time to do that one :) [14:01:37] !log awight@tin Started deploy [ores/deploy@29905e5]: celery 4 -> ores* (non-production) [14:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:51] zeljkof: for my 2 patches a stict order is required 388399 mush go before 388079 [14:01:56] !log full restart of wdqs eqiad / codfw for multiple upgrades [14:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:03] Pchelolo: ok [14:02:33] addshore: will ping you when I am done [14:02:38] thanks! [14:02:53] Pchelolo: do you want to deploy your patches? I can do it, just asking :) [14:03:26] mine has to go after Pchelolo's [14:03:29] zeljkof: I don't think I'm a deployer for mediawiki [14:03:40] i can do them all if you are lazy zeljkof :) [14:04:04] once the first is synced, i'll remove my -1 on the second on [14:04:05] one [14:04:06] (03PS2) 10BBlack: cache_text: reduce applayer timeouts to reasonable values [puppet] - 10https://gerrit.wikimedia.org/r/387225 (https://phabricator.wikimedia.org/T179156) [14:04:19] mobrovac: I can swat, but I always ask :) do you want to swat? [14:05:10] sure, but i'd like your +2 zeljkof for the third one when the time comes (so that i don't self-+2) [14:05:29] (03PS1) 10Arturo Borrero Gonzalez: base: labs: unattended upgrades for wikimedia packages [puppet] - 10https://gerrit.wikimedia.org/r/389480 (https://phabricator.wikimedia.org/T177920) [14:05:43] mobrovac: sure [14:06:07] mobrovac: swat is yours then, ping me when you need me [14:06:14] kk thnx [14:06:45] (03PS2) 10Arturo Borrero Gonzalez: base: labs: unattended upgrades for wikimedia packages [puppet] - 10https://gerrit.wikimedia.org/r/389480 (https://phabricator.wikimedia.org/T177920) [14:10:07] !jouncebot: next [14:10:32] (03CR) 10Zfilipin: [C: 031] JobQueue: Use EventBus for all "hearted" jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388139 (https://phabricator.wikimedia.org/T175210) (owner: 10Mobrovac) [14:11:12] jouncebot: next [14:11:13] In 3 hour(s) and 48 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171106T1800) [14:12:33] (03CR) 10BBlack: [C: 032] cache_text: reduce applayer timeouts to reasonable values [puppet] - 10https://gerrit.wikimedia.org/r/387225 (https://phabricator.wikimedia.org/T179156) (owner: 10BBlack) [14:13:16] !log mobrovac@tin Synchronized php-1.31.0-wmf.6/extensions/EventBus/EventBus.php: EventBus - Improve logging for T150106 (duration: 00m 48s) [14:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:26] T150106: Type collisions in log events causing indexing failures in ELK Elasticsearch - https://phabricator.wikimedia.org/T150106 [14:13:39] Pchelolo: first one synced ^, will go with the config now and the we can check [14:14:03] I'm not sure how would we check though.. [14:14:07] (03CR) 10Mobrovac: [C: 032] [Logging] Enable logstash logging for EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388079 (https://phabricator.wikimedia.org/T150106) (owner: 10Ppchelko) [14:14:14] we can't really provoke an exception [14:14:27] indeed [14:14:33] !log gehel@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=wdqs1003.eqiad.wmnet [14:14:37] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1003.eqiad.wmnet [14:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:53] Pchelolo: but we can wait 10 mins or so after it's synced to make sure nothing explodes [14:15:04] ya sure [14:15:14] I'm looking at the logs [14:15:20] (03Merged) 10jenkins-bot: [Logging] Enable logstash logging for EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388079 (https://phabricator.wikimedia.org/T150106) (owner: 10Ppchelko) [14:16:05] PROBLEM - Host wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:16:26] ^ that was me, extending downtime... [14:16:58] (03CR) 10jenkins-bot: [Logging] Enable logstash logging for EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388079 (https://phabricator.wikimedia.org/T150106) (owner: 10Ppchelko) [14:18:05] RECOVERY - Host wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [14:18:05] (03PS2) 10BBlack: cache_text: reduce inter-cache backend timeouts as well [puppet] - 10https://gerrit.wikimedia.org/r/387228 (https://phabricator.wikimedia.org/T179156) [14:18:42] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: EventBus: Enable logging - T150106 (duration: 00m 47s) [14:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:49] T150106: Type collisions in log events causing indexing failures in ELK Elasticsearch - https://phabricator.wikimedia.org/T150106 [14:19:00] Pchelolo: ok, logs enabled ^, let's wait and monitor a bit before continuing [14:19:26] yup, looking at logstash [14:19:43] (03PS1) 10Ema: logstash: split logstash_syslog hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/389483 [14:20:31] (03CR) 10Gehel: [C: 031] "Great! Much nicer!" [puppet] - 10https://gerrit.wikimedia.org/r/389483 (owner: 10Ema) [14:20:36] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:20:58] !log awight@tin Finished deploy [ores/deploy@29905e5]: celery 4 -> ores* (non-production) (duration: 19m 22s) [14:21:03] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wdqs1003.eqiad.wmnet [14:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:14] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1004.eqiad.wmnet [14:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:22] (03CR) 10Ema: "pcc seems fine https://puppet-compiler.wmflabs.org/compiler02/8646/" [puppet] - 10https://gerrit.wikimedia.org/r/389483 (owner: 10Ema) [14:23:25] 10Operations, 10ops-eqiad, 10netops: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#3737395 (10Cmjohnson) I have connected the Ripe atlas anchor to iron if you want to load the image. [14:23:47] _joe_: does https://gerrit.wikimedia.org/r/#/c/389483/ seem ok to you? [14:24:26] (03PS5) 10Filippo Giunchedi: smart: add ensure metaparameter [puppet] - 10https://gerrit.wikimedia.org/r/388057 (https://phabricator.wikimedia.org/T86552) [14:24:28] (03PS1) 10Filippo Giunchedi: smart: enable SMART health collection in esams [puppet] - 10https://gerrit.wikimedia.org/r/389484 (https://phabricator.wikimedia.org/T86552) [14:24:30] (03PS1) 10Filippo Giunchedi: smart: enable SMART health collection in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/389485 (https://phabricator.wikimedia.org/T86552) [14:26:30] <_joe_> ema: uhm, yes [14:26:53] <_joe_> ema: I just found a couple things that should be fixed across our hiera defs [14:27:04] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wdqs1004.eqiad.wmnet [14:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:11] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1005.eqiad.wmnet [14:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:33] (03CR) 10Giuseppe Lavagetto: [C: 031] logstash: split logstash_syslog hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/389483 (owner: 10Ema) [14:29:11] _joe_: nice, what in particular? Can I go ahead and merge meanwhile or should I wait for the changes you had in mind? [14:30:15] #j #dockerhub [14:30:19] blerghhg [14:30:22] !log awight@tin Started deploy [ores/deploy@29905e5]: restart services on ores* (non-production) [14:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:44] <_joe_> ema: no go on merging [14:30:54] (03CR) 10Ema: [C: 032] logstash: split logstash_syslog hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/389483 (owner: 10Ema) [14:31:48] <_joe_> I'm just thinking we're using duplicated definitions everywhere :/ [14:32:16] about logstash, definitely! I have a cleanup of that in my backlog... [14:32:25] <_joe_> cool [14:32:52] not sure when I'm going to actually get it done... [14:33:15] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wdqs1005.eqiad.wmnet [14:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:11] (03PS3) 10BBlack: cache_text: reduce inter-cache backend timeouts as well [puppet] - 10https://gerrit.wikimedia.org/r/387228 (https://phabricator.wikimedia.org/T179156) [14:34:19] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs2001.codfw.wmnet [14:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:03] zeljkof: hows it is looking? [14:35:10] !log awight@tin Finished deploy [ores/deploy@29905e5]: restart services on ores* (non-production) (duration: 04m 48s) [14:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:15] ahh wait, mobrovac has taken over! [14:35:25] zeljkof: can i get a +2 on https://gerrit.wikimedia.org/r/#/c/388139/ ? [14:35:34] addshore: ^ this will be the last one, then you can go [14:35:41] mobrovac: on it [14:35:54] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388139 (https://phabricator.wikimedia.org/T175210) (owner: 10Mobrovac) [14:36:33] _joe_: going to sync the switch of those 3 jobs to cp-jq now ^ [14:36:45] <_joe_> mobrovac: ok [14:37:03] <_joe_> we're still executing them on the "normal" jq though, right? [14:37:06] (03Merged) 10jenkins-bot: JobQueue: Use EventBus for all "hearted" jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388139 (https://phabricator.wikimedia.org/T175210) (owner: 10Mobrovac) [14:37:15] (03CR) 10jenkins-bot: JobQueue: Use EventBus for all "hearted" jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388139 (https://phabricator.wikimedia.org/T175210) (owner: 10Mobrovac) [14:38:44] (03PS1) 10Herron: puppet: add conditional for puppetmaster rack path [puppet] - 10https://gerrit.wikimedia.org/r/389490 (https://phabricator.wikimedia.org/T179720) [14:39:19] (03CR) 10jerkins-bot: [V: 04-1] puppet: add conditional for puppetmaster rack path [puppet] - 10https://gerrit.wikimedia.org/r/389490 (https://phabricator.wikimedia.org/T179720) (owner: 10Herron) [14:39:57] !log awight@tin Started deploy [ores/deploy@29905e5]: Force deployment on ores* (non-production) [14:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:39] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wdqs2001.codfw.wmnet [14:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:46] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs2002.codfw.wmnet [14:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:18] (03CR) 10WMDE-Fisch: [C: 031] "Revert revert!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387978 (https://phabricator.wikimedia.org/T177891) (owner: 10Legoktm) [14:42:46] addshore: just 5 more mins, sorry about the hold-up [14:44:23] (03PS8) 10ArielGlenn: rsync xml/sql dumps on an ongoing basis from primary dump nf host [puppet] - 10https://gerrit.wikimedia.org/r/389025 (https://phabricator.wikimedia.org/T178893) [14:45:28] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wdqs2002.codfw.wmnet [14:45:29] ack! [14:45:33] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs2003.codfw.wmnet [14:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:41] !log awight@tin Finished deploy [ores/deploy@29905e5]: Force deployment on ores* (non-production) (duration: 05m 43s) [14:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:42] !log ppchelko@tin Started deploy [cpjobqueue/deploy@e93feba]: (no justification provided) [14:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:23] !log ppchelko@tin Started deploy [cpjobqueue/deploy@e93feba]: Start processing all 'hearted' jobs T175210 [14:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:29] T175210: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210 [14:50:07] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@e93feba]: Start processing all 'hearted' jobs T175210 (duration: 00m 44s) [14:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:20] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: Switch MessageIndexRebuildJob, flaggedrevs_CacheUpdate and deleteLinks jobs to the EventBus infrastructure - T175210 (duration: 00m 46s) [14:50:21] (03Draft1) 10Paladox: planet: Doin't set http_proxy or https_proxy if on labs [puppet] - 10https://gerrit.wikimedia.org/r/389492 [14:50:23] (03PS2) 10Paladox: planet: Doin't set http_proxy or https_proxy if on labs [puppet] - 10https://gerrit.wikimedia.org/r/389492 [14:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:36] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:50:45] addshore: we're done, passing the torch on to you [14:50:50] (03PS5) 10Ema: cache: send varnish logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/388482 (https://phabricator.wikimedia.org/T63782) [14:50:55] ack! thanks zeljkof and mobrovac [14:51:07] (03PS4) 10Addshore: Revert "Revert "Set wgWikiDiff2MovedParagraphDetectionCutoff for group0"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387978 (https://phabricator.wikimedia.org/T177891) (owner: 10Legoktm) [14:51:10] (03CR) 10Addshore: [C: 032] Revert "Revert "Set wgWikiDiff2MovedParagraphDetectionCutoff for group0"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387978 (https://phabricator.wikimedia.org/T177891) (owner: 10Legoktm) [14:51:20] (03CR) 10jerkins-bot: [V: 04-1] cache: send varnish logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/388482 (https://phabricator.wikimedia.org/T63782) (owner: 10Ema) [14:51:47] (03CR) 10Ema: cache: send varnish logs to logstash (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/388482 (https://phabricator.wikimedia.org/T63782) (owner: 10Ema) [14:52:25] (03Merged) 10jenkins-bot: Revert "Revert "Set wgWikiDiff2MovedParagraphDetectionCutoff for group0"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387978 (https://phabricator.wikimedia.org/T177891) (owner: 10Legoktm) [14:52:38] (03CR) 10jenkins-bot: Revert "Revert "Set wgWikiDiff2MovedParagraphDetectionCutoff for group0"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387978 (https://phabricator.wikimedia.org/T177891) (owner: 10Legoktm) [14:52:51] (03PS1) 10Awight: Reduce the number of Celery workers for ORES stress testing [puppet] - 10https://gerrit.wikimedia.org/r/389493 (https://phabricator.wikimedia.org/T169246) [14:53:10] (03PS6) 10Ema: cache: send varnish logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/388482 (https://phabricator.wikimedia.org/T63782) [14:53:48] wow nice! --^ [14:57:19] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3737484 (10awight) @akosiaris FYI, I'm backing off from our attempt to use 480 workers. The change is ready for review in https://ge... [14:57:50] !log reboot of relforge for kernel + jvm upgrade [14:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:24] (03PS9) 10ArielGlenn: rsync xml/sql dumps on an ongoing basis from primary dump nf host [puppet] - 10https://gerrit.wikimedia.org/r/389025 (https://phabricator.wikimedia.org/T178893) [14:59:23] gehel: patch updated for your reviewing pleasure! https://gerrit.wikimedia.org/r/#/c/388482/ [15:00:10] (03CR) 10Gehel: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/388482 (https://phabricator.wikimedia.org/T63782) (owner: 10Ema) [15:00:16] ema: looks great! [15:00:43] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT [[gerrit:387978|Set wgWikiDiff2MovedParagraphDetectionCutoff for group0]] PT 1/3 (duration: 00m 47s) [15:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:34] (03PS2) 10Herron: puppet: conditionally pin packages to appropriate repo for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/389478 (https://phabricator.wikimedia.org/T179724) [15:01:50] !log addshore@tin Synchronized wmf-config/CommonSettings.php: SWAT [[gerrit:387978|Set wgWikiDiff2MovedParagraphDetectionCutoff for group0]] PT 2/3 (duration: 00m 46s) [15:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:24] (03CR) 10jerkins-bot: [V: 04-1] puppet: conditionally pin packages to appropriate repo for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/389478 (https://phabricator.wikimedia.org/T179724) (owner: 10Herron) [15:03:22] !log addshore@tin Synchronized wmf-config/CommonSettings-labs.php: SWAT [[gerrit:387978|Set wgWikiDiff2MovedParagraphDetectionCutoff for group0]] PT 3/3 (duration: 00m 46s) [15:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:03] zeljkof: thats SWAT all done! :) [15:04:13] nice :) [15:04:51] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3737490 (10Addshore) [15:05:20] 10Operations, 10Cloud-Services, 10Developer-Relations: Use the term "developer account" for Wikimedia LDAP accounts - https://phabricator.wikimedia.org/T179461#3725481 (10faidon) [15:05:53] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3674094 (10Addshore) Deployed the config change to group0. Again when i pulled the config onto mwdebug1002 only i couldnt make the feat... [15:06:42] (03CR) 10Giuseppe Lavagetto: [C: 04-1] puppet: conditionally pin packages to appropriate repo for puppet 4 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/389478 (https://phabricator.wikimedia.org/T179724) (owner: 10Herron) [15:09:11] (03CR) 10BryanDavis: "I think this will cause Puppet resource conflicts on hosts which apply ::contint::packages::labs. This is an exact duplication of the code" [puppet] - 10https://gerrit.wikimedia.org/r/389480 (https://phabricator.wikimedia.org/T177920) (owner: 10Arturo Borrero Gonzalez) [15:12:51] !log ppchelko@tin Started deploy [cpjobqueue/deploy@96f55c6]: USe correct regex for job selection [15:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:20] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@96f55c6]: USe correct regex for job selection (duration: 00m 29s) [15:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:56] (03CR) 10BryanDavis: "> i don't know about vagrant roles, but why do we still have a role" [puppet] - 10https://gerrit.wikimedia.org/r/389295 (owner: 10Paladox) [15:16:30] (03CR) 10Filippo Giunchedi: cache: send varnish logs to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/388482 (https://phabricator.wikimedia.org/T63782) (owner: 10Ema) [15:18:18] (03PS10) 10ArielGlenn: rsync xml/sql dumps on an ongoing basis from primary dump nf host [puppet] - 10https://gerrit.wikimedia.org/r/389025 (https://phabricator.wikimedia.org/T178893) [15:19:18] (03CR) 10ArielGlenn: [C: 032] rsync xml/sql dumps on an ongoing basis from primary dump nf host [puppet] - 10https://gerrit.wikimedia.org/r/389025 (https://phabricator.wikimedia.org/T178893) (owner: 10ArielGlenn) [15:23:36] ignore whines about puppet on dumpsdata1001 please, that would be me [15:24:07] (03PS7) 10Ema: cache: send varnish logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/388482 (https://phabricator.wikimedia.org/T63782) [15:26:25] (03PS4) 10Herron: puppet: add puppet_major_version variable [puppet] - 10https://gerrit.wikimedia.org/r/388538 (https://phabricator.wikimedia.org/T178825) [15:26:53] (03CR) 10Alexandros Kosiaris: [C: 032] Reduce the number of Celery workers for ORES stress testing [puppet] - 10https://gerrit.wikimedia.org/r/389493 (https://phabricator.wikimedia.org/T169246) (owner: 10Awight) [15:26:56] !log reboot of maps-test cluster for jvm and kernel upgrades [15:26:56] (03PS2) 10Alexandros Kosiaris: Reduce the number of Celery workers for ORES stress testing [puppet] - 10https://gerrit.wikimedia.org/r/389493 (https://phabricator.wikimedia.org/T169246) (owner: 10Awight) [15:26:58] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Reduce the number of Celery workers for ORES stress testing [puppet] - 10https://gerrit.wikimedia.org/r/389493 (https://phabricator.wikimedia.org/T169246) (owner: 10Awight) [15:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:13] PROBLEM - puppet last run on dumpsdata1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:39] (03PS8) 10Ema: cache: send varnish logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/388482 (https://phabricator.wikimedia.org/T63782) [15:30:24] (03CR) 10Ema: cache: send varnish logs to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/388482 (https://phabricator.wikimedia.org/T63782) (owner: 10Ema) [15:30:58] akosiaris: thanks! [15:31:11] !log awight@tin Started deploy [ores/deploy@29905e5]: restart services on ores* (non-production) [15:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:23] 10Operations, 10Traffic: Renew unified certificates 2017 - https://phabricator.wikimedia.org/T178173#3737533 (10BBlack) Copying in some commentary I was accidentally putting in the wrong ticket (the private purchasing one) for the new globalsign certs over the past few days: >>! In T178831#3731564, @BBlack wro... [15:36:35] (03CR) 10Herron: [C: 032] puppet: add puppet_major_version variable [puppet] - 10https://gerrit.wikimedia.org/r/388538 (https://phabricator.wikimedia.org/T178825) (owner: 10Herron) [15:36:39] (03PS5) 10Herron: puppet: add puppet_major_version variable [puppet] - 10https://gerrit.wikimedia.org/r/388538 (https://phabricator.wikimedia.org/T178825) [15:36:56] (03PS1) 10BBlack: TLS: switch US sites to 2017 globalsign cert [puppet] - 10https://gerrit.wikimedia.org/r/389501 (https://phabricator.wikimedia.org/T178173) [15:38:03] awight: done. I 've just ran puppet on all of ores1XXX boxes [15:38:09] moritzm: hello can i take down mw2176 and mw2160 for troubleshooting ? [15:38:09] awight: you 're welcome [15:38:17] (03PS4) 10BBlack: cache_text: reduce inter-cache backend timeouts as well [puppet] - 10https://gerrit.wikimedia.org/r/387228 (https://phabricator.wikimedia.org/T179156) [15:39:02] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3737557 (10mobrovac) [15:39:15] (03CR) 10BBlack: [C: 032] cache_text: reduce inter-cache backend timeouts as well [puppet] - 10https://gerrit.wikimedia.org/r/387228 (https://phabricator.wikimedia.org/T179156) (owner: 10BBlack) [15:41:38] akosiaris: My hopes for the next week or so are that we can stabilize something at least as performance as scb*, and switch over to using the new cluster. There are too many blockers to try to get maximum performance for the hardware, so I’m planning to work on that tuning, bugfixing, and optimization incrementally. LMK if that sounds good to you. [15:43:23] 10Operations, 10Cloud-Services, 10Developer-Relations: Use the term "developer account" for Wikimedia LDAP accounts - https://phabricator.wikimedia.org/T179461#3737562 (10bd808) >>! In T179461#3731159, @Tgr wrote: > (Well-actually sidetrack: AIUI we basically implemented an OpenID Connect lookalike on top of... [15:43:40] (03CR) 10Filippo Giunchedi: [C: 031] cache: send varnish logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/388482 (https://phabricator.wikimedia.org/T63782) (owner: 10Ema) [15:43:56] !log awight@tin Finished deploy [ores/deploy@29905e5]: restart services on ores* (non-production) (duration: 12m 45s) [15:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:27] awight: fine by me. This looks like it's going to take a lot of time and it's probably better that we put newer hardware in use that have it there doing nothing [15:44:37] cool. [15:45:02] akosiaris: Meanwhile, I’m having repeated failures to deploy to ores1008 and 1009. I can ask releng, but thought I might mention it to you first. [15:45:31] what kind of failures ? [15:46:21] akosiaris: Hard to say, but maybe “fatal: reference is not a tree: 26ecbfa5181ed640e860f583afafefe642e2bc09”. Happily deployed to the other boxes, though. [15:46:27] Disk space is OK. [15:46:51] that sounds like scap problems [15:47:21] can I try to reproduce ? [15:47:27] I guess a deploy can't harm, right ? [15:47:31] please do! [15:47:48] fwiw, I’ve been having to use scap -r HEAD for some reason. [15:48:44] maybe it defaults to master ? [15:49:03] and given the CELERY_4 branch it just does not do what you want it to ? [15:49:16] !log akosiaris@tin Started deploy [ores/deploy@29905e5]: testing deploy [15:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:12] !log akosiaris@tin Started deploy [ores/deploy@29905e5]: testing deploy [15:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:33] akosiaris: It’s been acting up even when I’m on master… strange business. Other aspects of the deployment are fragile, releng is aware. For example, the repos are too darn big and sometimes it just won’t fetch. [15:50:54] ah yes... the pickled objects [15:51:00] 100M+ sometimes [15:51:25] yeah it’s 2GB or so just of the compressed .git [15:52:38] Another thing that’s acting up sometimes is that /srv/deployment/ores/venv/lib/python3.4/site-packages/ isn’t rebuilt correctly. I’m doing an evil thing where I upgrade pip in order to use newish wheels (multilinux, specifically). [15:52:52] These are just notes so you don’t think you’re the crazy one :D [15:54:21] it's indeed taking a long time ... [15:56:59] I want to ask questions, but I kinda don't want to know the answers. I'm gonna pretend I didn't see a reference above to multi-gigabyte git repos :P [15:57:23] bblack: oh no, you made the mistake, now you are going to pay for it [15:57:27] sooooooooooooooooooooo...... [15:57:37] bblack, the answer is git-lfs isn't ready for us yet [15:57:45] ores is shipping via git the machine learning models [15:57:50] and having versioned assets is really useful [15:57:52] which are picked python objects IIRC [15:57:54] 10Operations, 10ops-codfw: check mw2176 power supply redundancy - https://phabricator.wikimedia.org/T177639#3737580 (10Papaul) @MoritzAccountTest The system is showing that PS1 Present/failure. I need the system put in maintenance mode for me to troubleshoot. Thanks. {F10643894} [15:57:59] pickled* [15:58:04] and they are huge :-) [15:58:22] lol /me offers bblack some ice [15:58:39] pickled python objects I assume are just some serializing of a giant in-memory data structure that was the result of the machine learning? [15:58:49] bblack, yup [15:58:50] yes [15:58:55] (03PS1) 10ArielGlenn: fixups for the dumps generator rsync service [puppet] - 10https://gerrit.wikimedia.org/r/389506 [15:58:59] 10Operations, 10ops-codfw: check mw2160 power supply redundancy - https://phabricator.wikimedia.org/T177638#3737582 (10Papaul) @MoritzAccountTest same as mw2176 The system is showing that PS1 Present/failure. I need the system put in maintenance mode for me to troubleshoot. Thanks. {F10643894} [15:59:09] halfak: btw, remind me why git-fat was rejected [15:59:12] There’s an XML standard for serialized models, but I don’t think it’ll save us any storage. [15:59:17] (could they be compressed?) [15:59:18] no using it in labs [15:59:20] :\ [15:59:28] bblack, yes, a bit. We've started to do that. [15:59:29] git-fat doesn't work on labs ? sigh [15:59:39] ^ right [15:59:44] akosiaris: github supports git-lfs natively, and gerrit has a plugin that we might be able to integrate. [16:00:01] (03CR) 10jerkins-bot: [V: 04-1] fixups for the dumps generator rsync service [puppet] - 10https://gerrit.wikimedia.org/r/389506 (owner: 10ArielGlenn) [16:01:13] hm... it's been quite so time since I used git-lfs... it was relatively crude back then [16:01:20] I am guessing it has improved [16:02:35] (03PS2) 10ArielGlenn: fixups for the dumps generator rsync service [puppet] - 10https://gerrit.wikimedia.org/r/389506 [16:04:40] That’s a good point—I haven’t tried git-lfs to see what the everyday experience will be like. halfak: have you tried it? [16:06:16] akosiaris: ores1001 was deployed with an old (master) venv, donno how that happened. [16:06:32] awight: not my doing... I am still waiting [16:06:40] haha perfect [16:07:39] akosiaris: uh oh—you’re deploying experimental code to actual production. [16:07:49] what ? [16:07:56] damn... cancelling [16:07:57] sorry that was the main thing I should have mentioned. -l “ores100*” is needed. [16:08:17] ok, emergency redeploying of master then [16:08:23] I can do it [16:08:26] shall I? [16:08:28] yes please [16:08:31] kk [16:08:33] thanks! [16:09:01] (03PS2) 10Marostegui: tools.my.cnf.erb: Enable innodb_large_prefix [puppet] - 10https://gerrit.wikimedia.org/r/389477 (https://phabricator.wikimedia.org/T179614) [16:09:05] !log awight@tin Started deploy [ores/deploy@82a13ae]: Redeploy ORES master to scb* [16:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:41] akosiaris: I was going to say, it’s really awkward to be doing semi-experimental deployments to ores*… [16:09:52] (03CR) 10Marostegui: [C: 032] tools.my.cnf.erb: Enable innodb_large_prefix [puppet] - 10https://gerrit.wikimedia.org/r/389477 (https://phabricator.wikimedia.org/T179614) (owner: 10Marostegui) [16:09:57] Totally my fault for not mentioning the “-l” [16:10:50] awight: yeah, I can't say we ever said in scap requirements it should easily support running multiple versions of software on different clusters [16:10:56] it does, but it's awkward [16:11:19] and wasn't strongly required since we never had up to now that use case [16:11:43] Maybe we were talking about the environment vs group thing, I’m not sure if that would help. But like I was saying a minute ago, I’m very interested in getting onto the new hardware, for keeps. [16:13:18] (03PS1) 10Alexandros Kosiaris: kubernetes: Allow disabling proxy masquerade_all [puppet] - 10https://gerrit.wikimedia.org/r/389509 [16:15:02] (03PS2) 10BBlack: TLS: switch US sites to 2017 globalsign cert [puppet] - 10https://gerrit.wikimedia.org/r/389501 (https://phabricator.wikimedia.org/T178173) [16:16:31] (03PS9) 10Ema: cache: send varnish logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/388482 (https://phabricator.wikimedia.org/T63782) [16:16:33] (03PS2) 10Alexandros Kosiaris: kubernetes: Allow disabling proxy masquerade_all [puppet] - 10https://gerrit.wikimedia.org/r/389509 [16:16:40] (03CR) 10Ema: [V: 032 C: 032] cache: send varnish logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/388482 (https://phabricator.wikimedia.org/T63782) (owner: 10Ema) [16:16:47] (03CR) 10Ayounsi: [C: 032] uwsgi: fix dependency for stretch [puppet] - 10https://gerrit.wikimedia.org/r/388750 (owner: 10Ayounsi) [16:16:52] (03PS2) 10Ayounsi: uwsgi: fix dependency for stretch [puppet] - 10https://gerrit.wikimedia.org/r/388750 [16:18:24] (03PS1) 10Umherirrender: Install php5.5-iconv, php5.5-xml and php5.5-zip [puppet] - 10https://gerrit.wikimedia.org/r/389512 (https://phabricator.wikimedia.org/T179772) [16:19:38] RECOVERY - IPMI Sensor Status on mw2176 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [16:20:52] (03PS1) 10Ema: Revert "cache: send varnish logs to logstash" [puppet] - 10https://gerrit.wikimedia.org/r/389514 [16:21:02] (03CR) 10Ema: [V: 032 C: 032] Revert "cache: send varnish logs to logstash" [puppet] - 10https://gerrit.wikimedia.org/r/389514 (owner: 10Ema) [16:21:58] PROBLEM - puppet last run on cp4029 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/80-varnish.conf] [16:21:58] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/80-varnish.conf] [16:22:08] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/80-varnish.conf] [16:22:22] that's me, change reverted ^ [16:22:48] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/80-varnish.conf] [16:24:18] PROBLEM - puppet last run on cp1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/80-varnish.conf] [16:24:28] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/80-varnish.conf] [16:24:49] PROBLEM - puppet last run on cp2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/80-varnish.conf] [16:25:09] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/80-varnish.conf] [16:25:16] !log awight@tin Finished deploy [ores/deploy@82a13ae]: Redeploy ORES master to scb* (duration: 16m 10s) [16:25:18] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/80-varnish.conf] [16:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:00] (03PS3) 10BBlack: TLS: switch US sites to 2017 globalsign cert [puppet] - 10https://gerrit.wikimedia.org/r/389501 (https://phabricator.wikimedia.org/T178173) [16:26:00] akosiaris: ORES on scb* is back to normal, feel free to experiment on ores* if you wish [16:26:07] awight: ok thanks ! [16:26:10] sorry for the mistake [16:26:17] * awight resets deadfall trap [16:26:21] (03CR) 10BBlack: [C: 032] TLS: switch US sites to 2017 globalsign cert [puppet] - 10https://gerrit.wikimedia.org/r/389501 (https://phabricator.wikimedia.org/T178173) (owner: 10BBlack) [16:26:56] !log Switching unified TLS certificate in the US [16:26:58] RECOVERY - puppet last run on cp4029 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:26:58] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:08] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:27:18] PROBLEM - puppet last run on cp2014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/80-varnish.conf] [16:27:19] PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/80-varnish.conf] [16:27:48] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:28:55] (03PS3) 10Ayounsi: uwsgi: fix dependency for stretch [puppet] - 10https://gerrit.wikimedia.org/r/388750 [16:29:18] RECOVERY - puppet last run on cp1071 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [16:29:48] RECOVERY - puppet last run on cp2004 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [16:30:08] PROBLEM - Host mw2160 is DOWN: PING CRITICAL - Packet loss = 100% [16:30:18] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:32:09] RECOVERY - puppet last run on cp2014 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:32:13] (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes: Allow disabling proxy masquerade_all [puppet] - 10https://gerrit.wikimedia.org/r/389509 (owner: 10Alexandros Kosiaris) [16:32:17] (03PS3) 10Alexandros Kosiaris: kubernetes: Allow disabling proxy masquerade_all [puppet] - 10https://gerrit.wikimedia.org/r/389509 [16:32:19] RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:32:19] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes: Allow disabling proxy masquerade_all [puppet] - 10https://gerrit.wikimedia.org/r/389509 (owner: 10Alexandros Kosiaris) [16:32:58] RECOVERY - Host mw2160 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms [16:33:53] (03PS1) 10Ema: cache: send varnish logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/389515 (https://phabricator.wikimedia.org/T63782) [16:34:18] PROBLEM - Host mw2176 is DOWN: PING CRITICAL - Packet loss = 100% [16:40:22] <_joe_> what's up with those hosts in codfw? [16:40:48] RECOVERY - Host mw2176 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [16:41:25] <_joe_> oh, reboots [16:45:52] (03Abandoned) 10Chad: Scap3: Go ahead and `scap deploy --init` a freshly provisioned repo [puppet] - 10https://gerrit.wikimedia.org/r/377304 (owner: 10Chad) [16:45:56] (03PS1) 10Ema: 5.1.3-1wm2: backport 'record-prefix support for varnishncsa' [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/389516 [16:46:02] _joe_: hw analysis by papaul, there's two hosts with non-redundant power supplies [16:47:57] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3737719 (10Gilles) [16:47:58] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3737717 (10Gilles) 05stalled>03Open Yes, it was to avoid "random" network requests happening. Our test is over, you can resume ta... [16:48:01] (03CR) 10jerkins-bot: [V: 04-1] 5.1.3-1wm2: backport 'record-prefix support for varnishncsa' [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/389516 (owner: 10Ema) [16:49:38] RECOVERY - IPMI Sensor Status on mw2160 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [16:49:44] 10Operations, 10ops-codfw: check mw2176 power supply redundancy - https://phabricator.wikimedia.org/T177639#3737729 (10Papaul) - Swapped PS1 and PS2 - Upgrade firmware Firmware Version = 2.32.31.30 Firmware Build = 03 Last Firmware Update = 11/06/2017 15:15:53 no more PSU error {F10644... [16:50:08] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [16:51:11] !log Optimize pagelinks and templatelinks on s2 master - db1054 - T174509 [16:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:18] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [16:51:55] (03CR) 10Chad: "Lgtm, one minor-ish question inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389473 (owner: 10Giuseppe Lavagetto) [16:53:24] 10Operations, 10ops-codfw: check mw2160 power supply redundancy - https://phabricator.wikimedia.org/T177638#3737754 (10Papaul) same fix as in T177639 [16:54:28] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:00:43] 10Operations, 10Kubernetes: Operations 2017-18 Q2 Program 6 umbrella task - https://phabricator.wikimedia.org/T178325#3737765 (10akosiaris) [17:00:45] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Implement authentication/authorization in Kubernetes clusters - https://phabricator.wikimedia.org/T177393#3737763 (10akosiaris) 05Open>03Resolved a:03akosiaris [17:04:41] 10Operations, 10fundraising-tech-ops, 10netops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3737774 (10Jgreen) [17:04:43] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: connect second ethernet interface for fundraising codfw hosts - https://phabricator.wikimedia.org/T176175#3737772 (10Jgreen) 05Open>03Resolved This is done, all codfw hosts have bonded ethernet now. [17:09:27] akosiaris: I’ll go ahead and play with ores* some more, unless you wanted to do anything more? [17:20:55] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3737824 (10Tobi_WMDE_SW) Woohh.. first test looking good on test.wikipedia.org https://test.wikipedia.org/w/index.php?title=Test_page_l... [17:21:37] zhuyifei1999_: o/ [17:21:50] yeah? [17:22:23] hi! if you have a bit of time I'd like to ask you some questions about https://commons.wikimedia.org/wiki/User:FlickreviewR_2 and related [17:22:56] sure [17:24:21] thanks! So we are working on reducing the htmlCacheUpdate and refreshLinks for T173710, there seems to be a huge backlog for commons and ruwiki [17:24:21] T173710: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710 [17:25:10] ok [17:25:12] I am wondering if the bot might be related to the queue growth (not blaming the bot, it is an issue with the underlying job queues architeture) [17:25:30] but if so it might help a bit to stop it and wait for the queue to recover [17:25:46] there was a recent change in the review templates [17:26:06] <_joe_> elukey: I don't think it's that bot's fault after all, most changes I see flying are wikidata changes from october 25th [17:26:19] https://commons.wikimedia.org/w/index.php?title=User:FlickreviewR/reviewed-pass&action=history [17:26:43] I'm not sure how much impact that does [17:27:19] (that's the thing I can think of that might have made the queue grow large) [17:27:27] awight: feel free, I am in a meeting currently [17:28:41] _joe_ sure, I was following up on the fact that the dbs are seeing a lot of changes for pagelinks more or less related to what the bot does, so I thought to ask :) [17:28:56] if you want I could stop the bot for a day to observe the impact, but the flickr review queue could get backlogged [17:30:03] (03CR) 10Alexandros Kosiaris: [C: 031] puppet: add conditional for puppetmaster rack path [puppet] - 10https://gerrit.wikimedia.org/r/389490 (https://phabricator.wikimedia.org/T179720) (owner: 10Herron) [17:31:55] zhuyifei1999_: it might be enough even for a couple of hours if possible [17:32:08] PROBLEM - puppet last run on labcontrol1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:32:59] !log awight@tin Started deploy [ores/deploy@29905e5]: restart services on ores* (non-production) [17:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:14] ok I blocked it 6 hours https://commons.wikimedia.org/wiki/Special:Contributions/FlickreviewR_2 [17:33:28] zhuyifei1999_: thanks a lot! [17:33:53] (easier than logging into toolforge then killing the job & removing crontab :P) [17:33:54] np [17:34:22] (03CR) 10Alexandros Kosiaris: [C: 04-1] puppet: conditionally pin packages to appropriate repo for puppet 4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389478 (https://phabricator.wikimedia.org/T179724) (owner: 10Herron) [17:48:10] 10Operations, 10ops-codfw: check mw2176 power supply redundancy - https://phabricator.wikimedia.org/T177639#3665292 (10RobH) Papaul, Please advise, did the error clear after the #1 and #2 power supplies were swapped, or not until after firmware update? [17:48:31] !log T179083: Restarting Cassandra instances on restbase2001.codfw.wmnet [17:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:38] T179083: Cassandra 3.11.0 schema creation seems unreliable - https://phabricator.wikimedia.org/T179083 [17:50:17] !log awight@tin Finished deploy [ores/deploy@29905e5]: restart services on ores* (non-production) (duration: 17m 18s) [17:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:08] PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:51:18] ^^^ that's me [17:51:38] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused [17:51:40] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3738004 (10RobH) a:03RobH I'll old yeller this server. [17:52:08] RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2018-08-17 16:11:39 +0000 (expires in 283 days) [17:52:38] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.162 port 9042 [17:56:58] PROBLEM - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.163 and port 9042: Connection refused [17:56:59] PROBLEM - cassandra-b SSL 10.192.16.163:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:57:58] RECOVERY - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.163 port 9042 [17:57:59] RECOVERY - cassandra-b SSL 10.192.16.163:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-b valid until 2018-08-17 16:11:40 +0000 (expires in 283 days) [18:00:05] gehel: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171106T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:28] WMF sites really slow for anyone else? [18:01:07] starting around 10 minutes ago, every WMF site is hanging, and if it does load it's missing stylesheets [18:01:20] 10Operations, 10Ops-Access-Requests, 10Performance-Team: Add hoo to perf-roots - https://phabricator.wikimedia.org/T179317#3738069 (10herron) [18:01:29] PROBLEM - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.164 and port 9042: Connection refused [18:01:59] PROBLEM - cassandra-c SSL 10.192.16.164:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:02:08] RECOVERY - puppet last run on labcontrol1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:02:55] ~50% packet loss when I ping en.wikipedia.org [18:02:59] RECOVERY - cassandra-c SSL 10.192.16.164:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-c valid until 2018-08-17 16:11:42 +0000 (expires in 283 days) [18:03:29] RECOVERY - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.164 port 9042 [18:04:07] musikanimal: Which cache server are you hitting? [18:04:26] how do I figure that out? [18:04:43] musikanimal: it would help a lot if you could share a traceroute from your host to en.wikipedia.org [18:04:45] hmm, pinging isn't working either [18:05:10] sorry, happy to do that if you tell me how! [18:05:28] tools.wmflabs.org also isn't loading, which I think are on different servers, no? [18:05:49] well, probably a network issue if pinging is having problems too [18:05:57] musikanimal: if you are running a linux box it should be sufficient to run 'traceroute en.wikipedia.org' [18:06:06] it might be that your ISP is having troubles [18:06:08] yeah I thought maybe it was my internet but I tried pinging/loading other websites and all is fine [18:07:25] !log gehel@tin Started deploy [wdqs/wdqs@fa8bb12]: WDQS GUI and Blazegraph update [18:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:41] https://www.irccloud.com/pastebin/pe4q4npi/ [18:08:30] okay actually maybe this is my ISP [18:08:56] bing.com isn't loading either, not that I ever use it [18:09:15] but Google/Facebook/Twitter/etc are fine [18:09:19] !log gehel@tin Finished deploy [wdqs/wdqs@fa8bb12]: WDQS GUI and Blazegraph update (duration: 01m 53s) [18:09:21] ignore me! [18:09:21] i'm getting 0% packet loss for google.com but >50% for craigslist.org [18:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:46] actually reno.craigslist.org is what i tested, for the record [18:10:05] https://www.irccloud.com/pastebin/OaLh06qg/ [18:11:44] 0% packet loss on reno.craigslist.org for me [18:11:59] yeah, CL is better for me now too, just tried again [18:12:08] en.wikipedia.org still bad [18:12:16] okay so not just me! [18:12:24] what about bing.com ? [18:12:25] SMalyshev: wdqs deployment completed, tests are green [18:12:48] bing.com seems fine here [18:13:01] hmm weird [18:13:19] there's possible issues related to Telia [18:13:28] I'm assuming we have different "availability zones", which is probably part of the reason we're getting different results [18:13:29] can you get an mtr? [18:13:45] "mtr --report-wide --aslookup en.wikipedia.org" [18:13:51] gehel: updater too? [18:14:00] we need to update the updater probably [18:14:05] SMalyshev: restart in progress [18:14:44] bblack: I'm about to deactivate telia in ulsfo, waiting for that user's mtr [18:14:45] well guess I get to take the day off! =P [18:14:45] SMalyshev: done [18:15:18] !log deactivating BGP session to telia in ulsfo [18:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:24] (03PS3) 10ArielGlenn: fixups for the dumps generator rsync service [puppet] - 10https://gerrit.wikimedia.org/r/389506 [18:17:36] https://www.irccloud.com/pastebin/iZzv4cGA/ [18:17:37] (03CR) 10ArielGlenn: [C: 032] fixups for the dumps generator rsync service [puppet] - 10https://gerrit.wikimedia.org/r/389506 (owner: 10ArielGlenn) [18:17:40] bblack: ^ [18:21:27] mdholloway: thank you [18:21:38] XioNoX: no problem! [18:21:56] gehel: ok, great [18:22:02] mdholloway: is it okay to share your mtr with our provider, I'm opening a ticket with them [18:22:04] ? [18:22:17] XioNoX: that's fine [18:22:18] RECOVERY - puppet last run on dumpsdata1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:23:27] mdholloway: I asked some other people, it seems us East Coast folks are all affected [18:24:49] musikanimal: i suspected as much :[ [18:25:20] maybe the Virginia server farm? [18:25:30] I wonder if Bing is hosted there [18:30:38] like half the internet resides in Virginia, but it seems more a comcast-specific issue so far... [18:32:58] maybe more than that heh [18:33:25] in an article from 2016: Today, up to 70 percent of Internet traffic worldwide travels through this region [meaning Tyson's Corner / Northern Virginia], as the Loudon county economic-development board cheerfully notes in its marketing materials. [18:34:06] bblack: yes one other person I talked to had Comcast [18:34:18] I have RCN though, I don't think they're the same [18:34:35] and one other person said they have Bell (which I didn't realize existed anymore!) [18:35:21] yeah I don't think any of the RBOCs even still exist in name, do they? [18:35:45] not sure, he just told me he had Bell [18:35:51] he's in Montreal [18:36:05] everyone I've talked to is on the East Coast, BUT I have one friend here in NYC who isn't having any problems [18:36:10] https://en.wikipedia.org/wiki/Bell_System#Today [18:36:24] his ISP is reported as the hospital he works in, which I don't really believe [18:36:30] I wish I could load that page! [18:36:41] lol [18:36:48] oh wait, it's working now! [18:37:28] jouncebot: refresh [18:37:32] I refreshed my knowledge about deployments. [18:37:32] jouncebot: reload [18:38:51] anyways, wikipedia seems to think there's no true indepedent original RBOCs left with Bell in the name, but also there's probably still lots of "Bell"-branded services out there in various forms (AT&T may still market stuff as Bell Foo in various regions through subsidiaries, and there's also an independent Cincinatti Bell, but they weren't an original RBOC) [18:39:35] mdholloway: all is fine on the East Coast now! [18:40:01] half an hour without Wikipedia. I felt so empty inside! [18:41:20] musikanimal: really? i'm still getting about 50% packet loss pinging enwiki [18:42:09] gerrit's awfully slow to load as well [18:42:43] then i guess ann arbor isn't exactly on the east coast, so... :) [18:48:21] http://downdetector.com/status/comcast-xfinity [18:48:48] Comcast outage in progress [18:51:10] yeah works fine for me, but I have RCN [18:51:55] (03CR) 10Gehel: Report 429s to logstash too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/388696 (https://phabricator.wikimedia.org/T178533) (owner: 10Smalyshev) [18:53:00] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3738262 (10bd808) [18:53:10] RCN: https://twitter.com/RCNconnects?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor [18:53:29] err [18:53:32] 10Operations, 10Documentation: Improve documentation for mirrors.wikimedia.org - https://phabricator.wikimedia.org/T179856#3738265 (10Quiddity) [18:53:33] better link heh: https://twitter.com/RCNconnects/status/927608950425636865 [18:53:54] (03PS2) 10Gehel: Report 429s to logstash too [puppet] - 10https://gerrit.wikimedia.org/r/388696 (https://phabricator.wikimedia.org/T178533) (owner: 10Smalyshev) [18:53:57] I wonder why RCN is calling it widespread east-coast disruption, though [18:54:14] (maybe it is more than comcast?) [18:55:01] (03CR) 10Gehel: [C: 032] Report 429s to logstash too [puppet] - 10https://gerrit.wikimedia.org/r/388696 (https://phabricator.wikimedia.org/T178533) (owner: 10Smalyshev) [18:55:27] oh wow [19:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT (Max 8 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171106T1900). [19:00:04] kaldari: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:17] https://twitter.com/hashtag/outage is interesting too, lots of speculation there [19:00:26] but implicating more-widespread issues than just Comcast or Telia [19:00:28] XioNoX: ^ [19:01:43] 10Operations, 10Ops-Access-Requests: Requesting access to perf-teams for phedenskog (add phedenskog to perf-roots) - https://phabricator.wikimedia.org/T179729#3738316 (10Krinkle) [19:01:45] 10Operations, 10Performance-Team: Create perf-team shell group - https://phabricator.wikimedia.org/T179728#3738315 (10Krinkle) [19:01:49] "We have figured out how to get internet in the Space Station but @GetSpectrum can't figure out how to get internet in my living room" [19:02:17] 10Operations, 10Performance-Team: Create perf-team shell group - https://phabricator.wikimedia.org/T179728#3734285 (10Krinkle) [19:02:19] 10Operations, 10Ops-Access-Requests: Requesting access to perf-teams for phedenskog (add phedenskog to perf-roots) - https://phabricator.wikimedia.org/T179729#3734308 (10Krinkle) [19:04:03] Is it me, or is WMF sites really slow to load. [19:04:08] Who's doing SWAT today? [19:04:09] Even hangs indefinitely. [19:04:14] Cyberpower678: there's a comcast outage apparently [19:04:14] Cyberpower678: Comcast east coast outage [19:04:24] That'll do it [19:04:30] * Cyberpower678 is on comcast [19:05:03] (03PS5) 10Reedy: Initial configuration for electcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374528 (https://phabricator.wikimedia.org/T174370) (owner: 10Urbanecm) [19:05:07] Odd other sites are loading fine. [19:05:10] kaldari: I'm around if no one is [19:05:13] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for electcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374528 (https://phabricator.wikimedia.org/T174370) (owner: 10Urbanecm) [19:05:22] also I can be there to monitor things [19:05:27] cool [19:05:36] Woot, Pandora finally loaded. [19:05:42] After 10 minutes of waiting. [19:05:52] serious stuff Cyberpower678 [19:06:05] Krenair: indeed. [19:06:28] But I'm sitting here unable to debug my code because my connections to Wikipedia keeps timing out. [19:07:05] (03CR) 10Kaldari: [C: 032] Enable draftquality model in ORES extension for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388092 (https://phabricator.wikimedia.org/T179596) (owner: 10Ladsgroup) [19:07:17] Cyberpower678, debugging in prod? [19:07:31] (03PS6) 10Reedy: Initial configuration for electcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374528 (https://phabricator.wikimedia.org/T174370) (owner: 10Urbanecm) [19:07:38] kaldari: Are you deploying? [19:07:39] (03PS7) 10Reedy: Initial configuration for electcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374528 (https://phabricator.wikimedia.org/T174370) (owner: 10Urbanecm) [19:07:48] Krenair: testing new IABot code. It's kind of environment specific. [19:08:06] * Cyberpower678 is developing IABot v2.0 [19:08:08] relying on templates and stuff I guess? [19:08:17] (03PS8) 10Reedy: Initial configuration for electcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374528 (https://phabricator.wikimedia.org/T174370) (owner: 10Urbanecm) [19:08:18] Cyberpower678: try using mosh, and complain to comcast and try again later? :( [19:08:28] legoktm: what is mosh? [19:08:42] * Cyberpower678 wonders if a VPN will work. [19:08:45] https://mosh.org/ [19:09:11] Cyberpower678: should, it just depends what your route to the VPN looks like. Not all comcast routes are having issues, just some of them [19:09:20] i gave up on comcast for the morning and am on 4g ... [19:09:37] ebernhardson: I can VPN to anywhere in the world. [19:09:47] Amir1: sure [19:09:48] Cyberpower678, depends on the nature of the problem on Comcast's end... if you have a good connection to your VPN you should be good from there. but it might be just like connecting to wikipedia [19:10:03] Except I can't seem to open a VPN connection. :/ [19:10:12] well then it won't work :p [19:11:03] VPN connection established. [19:11:07] Krenair: I had my VPN client look for the fastest available server in the US. [19:11:12] (03PS2) 10Kaldari: Enable draftquality model in ORES extension for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388092 (https://phabricator.wikimedia.org/T179596) (owner: 10Ladsgroup) [19:11:18] * Cyberpower678 tests Wikipedia now. [19:11:38] Running smoothly now. [19:11:47] :S [19:11:49] *:D [19:11:50] Amir1: I'll deploy if you'll help keep an eye on things :) [19:12:02] definitely [19:12:27] kaldari: legoktm: Krenair: ebernhardson: Thanks for the info [19:12:42] I didn't do much [19:12:43] np etc. [19:13:17] (03PS9) 10Reedy: Initial configuration for electcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374528 (https://phabricator.wikimedia.org/T174370) (owner: 10Urbanecm) [19:13:33] Amir1: merge conflict, had to rebase [19:15:53] kaldari: when the sync is finished, let me know to run the maintenance script and check the database and jobs and etc. [19:16:05] will do. thanks! [19:17:11] (03CR) 10Kaldari: Enable draftquality model in ORES extension for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388092 (https://phabricator.wikimedia.org/T179596) (owner: 10Ladsgroup) [19:17:15] (03CR) 10Kaldari: [C: 032] Enable draftquality model in ORES extension for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388092 (https://phabricator.wikimedia.org/T179596) (owner: 10Ladsgroup) [19:18:28] (03Merged) 10jenkins-bot: Enable draftquality model in ORES extension for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388092 (https://phabricator.wikimedia.org/T179596) (owner: 10Ladsgroup) [19:18:38] (03CR) 10jenkins-bot: Enable draftquality model in ORES extension for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388092 (https://phabricator.wikimedia.org/T179596) (owner: 10Ladsgroup) [19:18:56] finally merged... [19:21:22] !log kaldari@tin Synchronized wmf-config/InitialiseSettings.php: Updating InitialiseSettings for draftquality data (duration: 00m 48s) [19:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:42] Amir1: Sync is done. Feel free to run script now. [19:21:59] !log ladsgroup@terbium:/srv/mediawiki-staging/php-1.31.0-wmf.4$ mwscript extensions/ORES/maintenance/CheckModelVersions.php --wiki=enwiki (T179596) [19:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:05] T179596: Enable draftquality model in ORES extension for enwiki - https://phabricator.wikimedia.org/T179596 [19:24:30] kaldari: I confirm that it's going smooth and gets the data we wanted [19:25:21] but it also include redirects, I don't know how to exclude that (I knew about this and I think it's okay if they are there but just we should exclude them when querying) [19:25:24] yay! So can you remind me exactly what it's currently collecting? Is it just draftquality for new pages in main and draft namespace? [19:26:39] Amir1: There's an page_is_redirect column in the page table [19:27:26] kaldari: the first revision of every new page creation in main and draft namespace, I think it also excludes bots but not super sure [19:27:39] cool [19:36:43] 10Operations, 10Ops-Access-Requests: Requesting access to perf-teams for phedenskog (add phedenskog to perf-roots) - https://phabricator.wikimedia.org/T179729#3738429 (10Krinkle) >>! In T179729#3736609, @Dzahn wrote: > Access is (should be) generally based on puppet role names, not host names. Yeah, we use ro... [19:42:29] Is SWAT done? I'm gonna make a start on a wiki creation if so... [19:47:36] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.5 [keeping static files] (duration: 01m 38s) [19:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:44] (03PS10) 10Reedy: Initial configuration for electcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374528 (https://phabricator.wikimedia.org/T174370) (owner: 10Urbanecm) [19:47:52] (03CR) 10Reedy: [C: 032] Initial configuration for electcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374528 (https://phabricator.wikimedia.org/T174370) (owner: 10Urbanecm) [19:49:05] (03Merged) 10jenkins-bot: Initial configuration for electcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374528 (https://phabricator.wikimedia.org/T174370) (owner: 10Urbanecm) [19:49:14] (03CR) 10jenkins-bot: Initial configuration for electcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374528 (https://phabricator.wikimedia.org/T174370) (owner: 10Urbanecm) [19:52:23] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.2 (duration: 02m 49s) [19:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:08] PROBLEM - eventstreams on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 8092: Connection refused [19:54:18] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 6.674 second response time [19:55:23] !log reedy@tin Synchronized dblists/: electcomwiki T174370 (duration: 00m 44s) [19:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:30] T174370: Create elections committee private wiki - https://phabricator.wikimedia.org/T174370 [19:55:39] PROBLEM - puppet last run on restbase-dev1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:56:00] !log reedy@tin rebuilt wikiversions.php and synchronized wikiversions files: electcomwiki T174370 [19:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:11] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: electcomwiki T174370 (duration: 00m 45s) [19:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] Reedy: It is that lovely time of the day again! You are hereby commanded to deploy Wiki Creation Window!. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171106T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:00:15] (03PS1) 10Reedy: Update IW map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389548 [20:00:16] lol [20:00:45] (03CR) 10Reedy: [C: 032] Update IW map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389548 (owner: 10Reedy) [20:02:03] (03Merged) 10jenkins-bot: Update IW map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389548 (owner: 10Reedy) [20:02:05] (03PS1) 10EBernhardson: Deploy MjoLniR with new deploy repository [puppet] - 10https://gerrit.wikimedia.org/r/389550 [20:02:16] (03CR) 10jenkins-bot: Update IW map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389548 (owner: 10Reedy) [20:04:22] !log reedy@tin Synchronized wmf-config/interwiki.php: Update IW map (duration: 00m 45s) [20:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:40] (03CR) 10EBernhardson: "puppet compiler output: http://puppet-compiler.wmflabs.org/8653/" [puppet] - 10https://gerrit.wikimedia.org/r/389550 (owner: 10EBernhardson) [20:08:18] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 873.25 seconds [20:08:29] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 885.39 seconds [20:12:28] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.09 seconds [20:13:20] (03PS2) 10EBernhardson: Deploy MjoLniR with new deploy repository [puppet] - 10https://gerrit.wikimedia.org/r/389550 [20:17:20] 10Operations, 10DBA, 10Support-and-Safety, 10Patch-For-Review, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370#3738541 (10Reedy) 05Open>03Resolved Wiki is created, but have not created any accounts for anyone as requested. I had created one f... [20:19:19] (03PS4) 10Reedy: Initial configuration for hifwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372798 (https://phabricator.wikimedia.org/T173643) (owner: 10Urbanecm) [20:19:29] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for hifwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372798 (https://phabricator.wikimedia.org/T173643) (owner: 10Urbanecm) [20:21:47] (03PS5) 10Reedy: Initial configuration for hifwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372798 (https://phabricator.wikimedia.org/T173643) (owner: 10Urbanecm) [20:23:20] (03CR) 10Reedy: [C: 032] Initial configuration for hifwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372798 (https://phabricator.wikimedia.org/T173643) (owner: 10Urbanecm) [20:24:33] (03Merged) 10jenkins-bot: Initial configuration for hifwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372798 (https://phabricator.wikimedia.org/T173643) (owner: 10Urbanecm) [20:25:39] RECOVERY - puppet last run on restbase-dev1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [20:26:34] [c97ad8d44ce946d66c760e26] [no req] MediaWiki\Services\NoSuchServiceException from line 364 of /srv/mediawiki/php-1.31.0-wmf.6/includes/services/ServiceContainer.php: No such service: CognateStore [20:26:38] fucks sake :P [20:26:50] (03PS2) 10Herron: puppet: add conditional for puppetmaster rack path [puppet] - 10https://gerrit.wikimedia.org/r/389490 (https://phabricator.wikimedia.org/T179720) [20:27:02] (03CR) 10jenkins-bot: Initial configuration for hifwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372798 (https://phabricator.wikimedia.org/T173643) (owner: 10Urbanecm) [20:27:47] (03CR) 10Herron: [C: 032] puppet: add conditional for puppetmaster rack path [puppet] - 10https://gerrit.wikimedia.org/r/389490 (https://phabricator.wikimedia.org/T179720) (owner: 10Herron) [20:31:32] 10Operations, 10Puppet, 10Patch-For-Review: Granular puppet version selection - https://phabricator.wikimedia.org/T178825#3738596 (10herron) [20:31:34] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3738597 (10herron) [20:31:36] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: puppet4: puppet master passenger apache backend config changes - https://phabricator.wikimedia.org/T179720#3738594 (10herron) 05Open>03Resolved a:03herron [20:33:31] !log reedy@tin Synchronized dblists/: hifwiktionary T173643 (duration: 00m 47s) [20:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:37] T173643: Create Wiktionary Fiji Hindi - https://phabricator.wikimedia.org/T173643 [20:34:14] !log reedy@tin rebuilt wikiversions.php and synchronized wikiversions files: hifwiktionary T173643 [20:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:20] <_joe_> hey, what did we release a few mnutes ago? [20:34:32] <_joe_> Reedy: a lot of jobs are failing right now on the jobqueue [20:34:48] orly? [20:34:51] <_joe_> uhm it was just a blip [20:34:54] Hmm [20:34:57] <_joe_> nevermind, it went away [20:35:10] Poke me again if it comes back and seems related :) [20:36:19] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: hifwiktionary T173643 (duration: 00m 45s) [20:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:37] hoo: About? [20:40:38] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 871.83 seconds [20:41:32] Reedy: yes [20:41:44] hoo: cognate fucked up when running addwiki for a wiktionary [20:41:49] Not sure if/what to do [20:42:41] 10Operations, 10DBA, 10Support-and-Safety, 10Patch-For-Review, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370#3738636 (10Marostegui) Just to confirm: the filtering works fine, the wiki isn't on labs. ``` mysql:root@localhost [(none)]> select @@h... [20:43:27] (03PS1) 10Reedy: Add hifwiktioanry too labsdb.yaml [puppet] - 10https://gerrit.wikimedia.org/r/389555 (https://phabricator.wikimedia.org/T173643) [20:43:38] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.75 seconds [20:43:50] (03PS2) 10Reedy: Add hifwiktionary too labsdb.yaml [puppet] - 10https://gerrit.wikimedia.org/r/389555 (https://phabricator.wikimedia.org/T173643) [20:45:30] Reedy: Ticket? [20:45:52] hoo: I dumped the stack trace on https://phabricator.wikimedia.org/T179863 [20:49:52] (03PS1) 10Reedy: Update IW map again... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389557 [20:50:06] (03CR) 10Reedy: [C: 032] Update IW map again... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389557 (owner: 10Reedy) [20:50:35] Reedy: What was the exact command line when calling addWiki? [20:50:51] $cognateSitesPopulation = $this->runChild( [20:50:51] 'Cognate\PopulateCognateSites', [20:50:51] "$IP/extensions/Cognate/maintenance/populateCognateSites.php" [20:50:51] ); [20:50:51] $cognateSitesPopulation->mOptions[ 'site-group' ] = $siteGroup; [20:50:52] $cognateSitesPopulation->execute(); [20:50:53] !log Deploy alter table on s2 codfw master (db2017) with replication, this will generate lag in codfw - T174569 [20:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:04] So it runs the script from addWiki.php [20:51:04] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [20:51:06] Reedy: I know… which --wiki was set? [20:51:12] hifwiktionary [20:51:15] well... no [20:51:17] (03Merged) 10jenkins-bot: Update IW map again... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389557 (owner: 10Reedy) [20:51:17] aawiki [20:51:20] as all wikis are created under [20:51:30] (03CR) 10jenkins-bot: Update IW map again... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389557 (owner: 10Reedy) [20:52:31] In that case it's quite obvious that Cognate is not there [20:52:39] it's just not loaded [20:52:42] Well, sure [20:52:56] We could (inline) call wfLoadExt there [20:53:01] would probably(?) work [20:53:17] umm... [20:53:38] Does that immediately load it? [20:53:58] It won't load any other config set in commonsettings though [20:54:00] 10Operations, 10Performance-Team (Radar): Create perf-team shell group - https://phabricator.wikimedia.org/T179728#3738701 (10Krinkle) [20:54:08] 10Operations, 10Performance-Team (Radar): Create perf-team shell group - https://phabricator.wikimedia.org/T179728#3734285 (10Krinkle) p:05Triage>03High [20:54:33] Ugh, it doesn't [20:54:35] 10Operations, 10Ops-Access-Requests, 10Performance-Team: Requesting access to perf-teams for phedenskog (add phedenskog to perf-roots) - https://phabricator.wikimedia.org/T179729#3738705 (10Krinkle) [20:54:39] Right :/ [20:55:00] 10Operations, 10Ops-Access-Requests, 10Performance-Team: Requesting access to perf-teams for phedenskog (add phedenskog to perf-roots) - https://phabricator.wikimedia.org/T179729#3734308 (10Krinkle) p:05Triage>03High [20:55:02] Run it as $someWiktionary? [20:56:13] That would be the obvious workaround [20:56:23] Which is kinda crappy just for one set of projects [20:56:30] But the fact we run it as aawiki is crappy anyway :P [20:56:34] !log reedy@tin Synchronized wmf-config/interwiki.php: Update IW Map again (duration: 00m 46s) [20:56:39] True… I guess this should be removed from the script and done per hand afterwards [20:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:18] Like we do with similar stuff [20:57:36] There's very little that needs running seperately afterwards currently ;) [20:58:03] Mostly the Wikibase stuff, I guess?! Services? Search? [20:58:13] Search has been part of it for ages [20:58:21] So has the wikibase stuff... Though, there's still some outstanding questions there [20:59:14] We still run it per hand on all wikibase clients (whenever a new Wikibase client is added) [20:59:45] Which... [20:59:53] Makes me wonder why we have the code we do in the script now [20:59:56] Plus why https://gerrit.wikimedia.org/r/#/c/339144/ was CR -1'd [21:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171106T2100). [21:00:04] No GERRIT patches in the queue for this window AFAICS. [21:00:14] https://wikitech.wikimedia.org/wiki/Add_a_wiki#Wikidata [21:00:15] So confusing [21:00:23] 10Operations, 10Ops-Access-Requests, 10Performance-Team: Add hoo to perf-roots - https://phabricator.wikimedia.org/T179317#3738741 (10Gilles) If what @hoo needs is only a subset of what we have access to, why not create a new group for that? [21:00:35] 10Operations, 10Ops-Access-Requests, 10Performance-Team (Radar): Add hoo to perf-roots - https://phabricator.wikimedia.org/T179317#3738743 (10Gilles) [21:01:59] I'm deploying ORES [21:02:14] fingers crossed, prayers and thoughts and etc. [21:02:40] Amir1: prayers only help with guns [21:03:29] Reedy: Also with floods http://neukolln.net/img/memes/the-first-truckload-of-thoughts-ans-prayers-just-arrived-in-texas.jpg [21:03:43] lol [21:03:58] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [21:04:23] Reedy: Yeah, it's a mess [21:05:49] !log ladsgroup@tin Started deploy [ores/deploy@97a1d80]: Deploying early November (T179837) [21:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:56] T179837: Deploy ORES early Nov 2017 - https://phabricator.wikimedia.org/T179837 [21:06:59] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [21:09:11] (03PS3) 10Herron: puppet: conditionally pin packages to appropriate repo for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/389478 (https://phabricator.wikimedia.org/T179724) [21:09:41] (03CR) 10jerkins-bot: [V: 04-1] puppet: conditionally pin packages to appropriate repo for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/389478 (https://phabricator.wikimedia.org/T179724) (owner: 10Herron) [21:11:10] (03PS4) 10Herron: puppet: conditionally pin packages to appropriate repo for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/389478 (https://phabricator.wikimedia.org/T179724) [21:11:33] (03CR) 10jerkins-bot: [V: 04-1] puppet: conditionally pin packages to appropriate repo for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/389478 (https://phabricator.wikimedia.org/T179724) (owner: 10Herron) [21:12:48] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:12:59] PROBLEM - ores on scb1002 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 136 bytes in 0.001 second response time [21:13:22] This is canary [21:13:39] 10Operations, 10Parsoid, 10Traffic, 10VisualEditor, 10HTTPS: Parsoid, VisualEditor not working with SSL / HTTPS - https://phabricator.wikimedia.org/T178778#3738796 (10PlanetKrypton) @Arlolra ``` kryptonit3@ubuntu-2gb-nyc3-01:/etc/mediawiki/parsoid$ cat config.yaml # This is a sample configuration file... [21:14:35] Strange, icinga says canary is dead but it gets 25% traffic and there is not much error showing up [21:15:36] logging in into scb1002 to do curls [21:16:09] Oh yeah, gives internal server error [21:16:12] rolling back [21:16:22] cc awight [21:16:36] !log ladsgroup@tin Finished deploy [ores/deploy@97a1d80]: Deploying early November (T179837) (duration: 10m 47s) [21:16:39] ty, I’ll look at the logs [21:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:42] T179837: Deploy ORES early Nov 2017 - https://phabricator.wikimedia.org/T179837 [21:17:02] ladsgroup@scb1002:~$ curl 0.0.0.0:8081/scores [21:17:02] Internal Server Errorladsgroup@scb1002:~$ [21:17:27] Amir1: ImportError: No module named 'revscoring.scorer_models' [21:18:09] 1log re-activating BGP session to telia in ulsfo [21:18:27] awight: mismatch between versions? [21:19:26] Amir1: Maybe. revscoring is present and importable. The next relevant stack frame up is “ Class = yamlconf.import_module(class_path)” [21:19:52] awight: let's deploy to beta and break it [21:20:02] !log re-activating BGP session to telia in ulsfo [21:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:23] Amir1: so, “scorer_models” is a deprecated module… [21:20:49] awight: I'm guessing we need to rebuild some models [21:21:52] 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): Scap: Standardize git version - https://phabricator.wikimedia.org/T179353#3738820 (10greg) [21:21:54] This change happened a few versions ago, though. [21:22:11] like revscoring 1->2 IIRC [21:22:43] Also deployment to beta failed, I'm guessing because of space issues [21:22:47] https://www.irccloud.com/pastebin/ajGPH1Jy/ [21:23:55] Amir1: looks like it, you have to sudo rm -rf some of the older dirs under deploy-cache...revs [21:25:55] Deleted two [21:26:44] I really don’t get it. Revscoring 2 was already in production. I looked around at 97a1d80d20a0a4b9794bfdda28e96a96aeb94b1d and it seems correct. [21:26:52] Where would a revscoring 1 model have come from? [21:29:21] Amir1: This is really funky, but it’s possible that scap deployed a different revision than it reported. I can reproduce that. It’s creepy as hell. [21:29:39] Amir1: Maybe try the deployment with “-r HEAD” to ensure that it’s deploying what you think it is? [21:30:01] awight: I'm pretty sure I did submodule updaye [21:30:12] it’s not that, something deeper and more disturbing [21:30:19] also ores beta is completely down now [21:30:25] deployed there ^_^ [21:30:27] ooh nice [21:30:32] k ty [21:31:08] [2017-11-06T21:28:55] ImportError: No module named 'ores' [21:31:17] WTF [21:31:55] the ores submodule didn’t clone [21:32:26] maybe um “scap deploy -r HEAD -f -v ‘dammit’" [21:32:48] awight, Amir1: any reason for me to wait until you all are done debugging before I deploy Parsoid? [21:33:16] arlolra: You’re good to go, thank you for asking! [21:33:21] k [21:33:29] !log arlolra@tin Started deploy [parsoid/deploy@d1df3c3]: (no justification provided) [21:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:13] awight: deployed in beta and still the same [21:35:28] !log Begin stress testing ores* (non-production) [21:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:26] Amir1: I think it’s still a deployment error. Will check in a minute [21:37:49] 10Operations, 10ORES, 10Scoring-platform-team, 10Traffic, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3738870 (10BBlack) 05Open>03stalled p:05High>03Normal The timeout changes above will offer some insulation, and as time... [21:38:57] Amir1: <3 you actually did it: -v ‘dammit’ [21:39:08] of course :D [21:42:15] !log arlolra@tin Finished deploy [parsoid/deploy@d1df3c3]: (no justification provided) (duration: 08m 46s) [21:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:47] !log Updated Parsoid to 6cb77104 (T179579, T175792) [21:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:55] T179579: Cannot read property 'substring' of null - https://phabricator.wikimedia.org/T179579 [21:48:55] T175792: HTTP 500 on af:Sjabloon:Omreken/Duaal/KafAafVhEaf - https://phabricator.wikimedia.org/T175792 [22:00:05] dapatrick, bawolff, and Reedy: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171106T2200). [22:00:05] No GERRIT patches in the queue for this window AFAICS. [22:33:15] 10Operations, 10DBA, 10Support-and-Safety, 10Patch-For-Review, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370#3739045 (10jrbs) >>! In T174370#3738541, @Reedy wrote: > Wiki is created, but have not created any accounts for anyone as requested. I... [22:34:35] !log Deploy patch for T178451 [22:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:02] 10Operations, 10DBA, 10Support-and-Safety, 10Patch-For-Review, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370#3739050 (10KTC) Thanks a lot everyone that have helped get this sorted. [22:38:37] 10Operations: Central Auth users who's home wiki is a test wiki should be deleted occassionally - https://phabricator.wikimedia.org/T179877#3739079 (10dbarratt) [22:42:49] (03PS1) 10Greg Grossmeier: betacluster: add -videoscaler to scap dsh group [puppet] - 10https://gerrit.wikimedia.org/r/389642 (https://phabricator.wikimedia.org/T179688) [23:03:50] elukey: how did the job queue go? the block will expire soon [23:05:30] 10Operations: Central Auth users who's home wiki is a test wiki should be deleted occassionally - https://phabricator.wikimedia.org/T179877#3739079 (10Bawolff) I disagree. I think it would be unexpected to delete such users. [23:07:55] 10Operations: Central Auth users who's home wiki is a test wiki should be deleted occassionally - https://phabricator.wikimedia.org/T179877#3739171 (10dbarratt) >>! In T179877#3739169, @Bawolff wrote: > I disagree. I think it would be unexpected to delete such users. What's more unexpected, that test users are... [23:10:50] 10Operations: Central Auth users who's home wiki is a test wiki should be deleted occassionally - https://phabricator.wikimedia.org/T179877#3739173 (10Bawolff) Arguably test(2)wiki, despite its name, is a real production wiki. The surprising part would not be that //"test users are real production users"// but t... [23:19:03] (03CR) 10Chad: [C: 031] betacluster: add -videoscaler to scap dsh group [puppet] - 10https://gerrit.wikimedia.org/r/389642 (https://phabricator.wikimedia.org/T179688) (owner: 10Greg Grossmeier) [23:22:33] (03CR) 10Chad: [C: 031] "Let's do it tomorrow!" [puppet] - 10https://gerrit.wikimedia.org/r/366910 (owner: 10Paladox) [23:23:52] (03CR) 10Thcipriani: [C: 031] "cherry-picked on beta puppetmaster, working as intended." [puppet] - 10https://gerrit.wikimedia.org/r/389642 (https://phabricator.wikimedia.org/T179688) (owner: 10Greg Grossmeier)