[00:01:45] 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): Scap: Standardize git version - https://phabricator.wikimedia.org/T179353#3721989 (10thcipriani) Adding @MoritzMuehlenhoff explicitly since IIRC he did the work to add git 2.11 to jessie-backports. [00:02:44] 10Operations, 10Continuous-Integration-Infrastructure (shipyard): wikimedia-jessie & wikimedia-stretch docker images don't have deb-src set for apt.wikimedia.org - https://phabricator.wikimedia.org/T179354#3721999 (10Legoktm) [00:25:45] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] [01:04:15] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1509411848 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 3887789 keys, up 4 minutes 5 seconds - replication_delay is 1509411848 [01:04:46] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1509411884 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3889047 keys, up 4 minutes 41 seconds - replication_delay is 1509411884 [01:04:48] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1509411884 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3888869 keys, up 4 minutes 41 seconds - replication_delay is 1509411884 [01:05:55] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3882307 keys, up 5 minutes 43 seconds - replication_delay is 0 [01:06:16] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 3881181 keys, up 6 minutes 11 seconds - replication_delay is 0 [01:06:55] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3883545 keys, up 6 minutes 43 seconds - replication_delay is 0 [01:14:45] PROBLEM - Host cp4010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:14:46] PROBLEM - Host cp4009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:31:55] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] [02:34:20] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.4) (duration: 10m 33s) [02:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:30] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.5) (duration: 10m 05s) [03:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:08:49] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Oct 31 03:08:48 UTC 2017 (duration 7m 18s) [03:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:24:56] (03CR) 10Andrew Bogott: [C: 032] labs: Rebranding for Puppet failure emails [puppet] - 10https://gerrit.wikimedia.org/r/387492 (https://phabricator.wikimedia.org/T168480) (owner: 10BryanDavis) [03:27:46] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 699.48 seconds [03:55:05] PROBLEM - BGP status on cr2-knams is CRITICAL: BGP CRITICAL - AS1257/IPv6: Active, AS1257/IPv4: Active [04:11:56] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 166.82 seconds [04:15:15] RECOVERY - BGP status on cr2-knams is OK: BGP OK - up: 9, down: 0, shutdown: 2 [04:17:15] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 55, down: 1, dormant: 0, excluded: 0, unused: 0 [04:23:16] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 [04:25:16] PROBLEM - BGP status on cr2-knams is CRITICAL: BGP CRITICAL - AS1257/IPv4: Active, AS1257/IPv6: Active [04:44:26] RECOVERY - BGP status on cr2-knams is OK: BGP OK - up: 11, down: 0, shutdown: 0 [04:59:07] (03PS5) 10BBlack: [WIP] backend transaction_timeout [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/387236 (https://phabricator.wikimedia.org/T179156) [04:59:23] (03CR) 10jerkins-bot: [V: 04-1] [WIP] backend transaction_timeout [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/387236 (https://phabricator.wikimedia.org/T179156) (owner: 10BBlack) [06:12:31] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2039" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387517 [06:12:38] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2039" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387517 [06:14:25] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2039" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387517 (owner: 10Marostegui) [06:15:39] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2039" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387517 (owner: 10Marostegui) [06:16:23] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2039" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387517 (owner: 10Marostegui) [06:17:39] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2039 (duration: 01m 12s) [06:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:35] (03CR) 10Dereckson: "A little adventurous to enable both extensions in one change, as imagine there is an issue with one but not the other, you can't easily re" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386779 (https://phabricator.wikimedia.org/T178919) (owner: 10Jayprakash12345) [06:24:25] (03PS1) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387522 (https://phabricator.wikimedia.org/T161088) [06:25:25] (03CR) 10Dereckson: [C: 04-1] "Per task description, "@jhsoby-WMNO will ask the (very small) community for consensus soon."." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) (owner: 10Zoranzoki21) [06:26:16] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387522 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [06:27:24] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387522 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [06:27:33] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387522 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [06:28:53] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1084 - T161088 (duration: 00m 49s) [06:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:00] T161088: Migrate some s4 hosts to file per table - https://phabricator.wikimedia.org/T161088 [06:37:16] !log Stop MySQL on db1097 and db1084 - T161088 [06:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:22] T161088: Migrate some s4 hosts to file per table - https://phabricator.wikimedia.org/T161088 [06:43:33] (03PS1) 10Marostegui: install_server: Reimage db2089 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/387524 (https://phabricator.wikimedia.org/T170662) [06:44:25] (03CR) 10Marostegui: [C: 032] install_server: Reimage db2089 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/387524 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [06:50:40] (03PS1) 10Marostegui: mariadb: Add db2089 to s6 and s5(s8) [puppet] - 10https://gerrit.wikimedia.org/r/387525 (https://phabricator.wikimedia.org/T178359) [06:52:15] (03PS2) 10Marostegui: mariadb: Add db2089 to s6 and s5(s8) [puppet] - 10https://gerrit.wikimedia.org/r/387525 (https://phabricator.wikimedia.org/T178359) [06:53:11] (03PS1) 10Marostegui: s5,s6.hosts: Add db2089 [software] - 10https://gerrit.wikimedia.org/r/387526 (https://phabricator.wikimedia.org/T178359) [06:55:06] (03CR) 10Marostegui: [C: 032] mariadb: Add db2089 to s6 and s5(s8) [puppet] - 10https://gerrit.wikimedia.org/r/387525 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:56:33] (03CR) 10Marostegui: [C: 032] s5,s6.hosts: Add db2089 [software] - 10https://gerrit.wikimedia.org/r/387526 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:57:20] (03Merged) 10jenkins-bot: s5,s6.hosts: Add db2089 [software] - 10https://gerrit.wikimedia.org/r/387526 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:57:35] (03PS3) 10Marostegui: db-codfw.php: Pool db2084 as multi-instance host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386810 (https://phabricator.wikimedia.org/T178553) [07:09:32] !log Stop MySQL on db2087.s6 to copy its content to db2089 - T178359 [07:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:40] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [07:41:37] (03CR) 10Giuseppe Lavagetto: [C: 031] puppet: change discovery-statefile template to parse under puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/387237 (https://phabricator.wikimedia.org/T179084) (owner: 10Herron) [07:42:37] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "See inline comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/387307 (https://phabricator.wikimedia.org/T179290) (owner: 10Herron) [08:15:00] !log Stop MySQL on db2086 to copy s5 to db2089 - T178359 [08:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:08] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [08:18:54] (03PS1) 10Marostegui: db-eqiad.php: Repool db1084 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387531 (https://phabricator.wikimedia.org/T161088) [08:20:39] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1084 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387531 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [08:21:49] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1084 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387531 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [08:21:58] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1084 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387531 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [08:22:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1084 with low weight - T161088 (duration: 00m 50s) [08:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:00] T161088: Migrate some s4 hosts to file per table - https://phabricator.wikimedia.org/T161088 [08:35:07] (03CR) 10Filippo Giunchedi: diamond: nfsiostat update collector to read from arbitrary NFS mount points (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/387243 (https://phabricator.wikimedia.org/T179024) (owner: 10Arturo Borrero Gonzalez) [08:39:03] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack and setup db1107 and db1108 - https://phabricator.wikimedia.org/T177405#3722441 (10Marostegui) db1047's data has been migrated and imported to db1108 [08:41:00] (03PS1) 10Marostegui: db-eqiad.php: Increase db1084 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387533 (https://phabricator.wikimedia.org/T161088) [08:43:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1084 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387533 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [08:44:57] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1084 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387533 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [08:46:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1084 weight - T161088 (duration: 00m 50s) [08:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:11] T161088: Migrate some s4 hosts to file per table - https://phabricator.wikimedia.org/T161088 [08:46:23] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1084 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387533 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [08:48:01] 10Operations, 10media-storage, 10User-fgiunchedi: Deleting file on Commons "Error deleting file: An unknown error occurred in storage backend "local-multiwrite"." - https://phabricator.wikimedia.org/T173374#3722447 (10fgiunchedi) @Jcb mind trying one more time? sorry for all the back and forth! I've purged t... [08:52:01] (03CR) 10Gehel: Cumin: fix wmf-style violations (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/386799 (owner: 10Volans) [08:52:47] 10Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3722449 (10Marostegui) db1047's data has been migrated to db1108. It is working fine now. We are going to leave db1108 working for a few days, make sure the event logging syn... [08:53:38] 10Operations, 10Continuous-Integration-Infrastructure (shipyard): wikimedia-jessie & wikimedia-stretch docker images don't have deb-src set for apt.wikimedia.org - https://phabricator.wikimedia.org/T179354#3722450 (10hashar) package_builder set up some copy on write images for the distribution we support. When... [09:09:29] (03CR) 10Marostegui: [C: 032] db-codfw.php: Pool db2084 as multi-instance host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386810 (https://phabricator.wikimedia.org/T178553) (owner: 10Marostegui) [09:10:39] (03Merged) 10jenkins-bot: db-codfw.php: Pool db2084 as multi-instance host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386810 (https://phabricator.wikimedia.org/T178553) (owner: 10Marostegui) [09:10:48] (03CR) 10jenkins-bot: db-codfw.php: Pool db2084 as multi-instance host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386810 (https://phabricator.wikimedia.org/T178553) (owner: 10Marostegui) [09:12:27] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Pool db2084 as multi-instance core host T178553 T178359 (duration: 00m 49s) [09:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:33] T178553: Support multi-instance hosts on mediawiki-config - https://phabricator.wikimedia.org/T178553 [09:12:33] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [09:14:31] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3722466 (10elukey) >>! In T173710#3720358, @EBernhardson wrote: > All jobs have a `requestId` parameter, which is passed down through the execution cha... [09:21:17] (03CR) 10Volans: "thanks for the review, replies inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/386799 (owner: 10Volans) [09:28:30] (03PS1) 10Marostegui: db2084.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/387536 (https://phabricator.wikimedia.org/T178359) [09:29:48] (03CR) 10Marostegui: [C: 032] db2084.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/387536 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:31:36] (03PS1) 10Marostegui: db-eqiad.php: Increase weight db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387537 (https://phabricator.wikimedia.org/T161088) [09:40:33] PROBLEM - HHVM rendering on mw2127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:41:24] RECOVERY - HHVM rendering on mw2127 is OK: HTTP OK: HTTP/1.1 200 OK - 75633 bytes in 0.392 second response time [09:43:27] (03CR) 10Filippo Giunchedi: [C: 032] redis: add stretch support [puppet] - 10https://gerrit.wikimedia.org/r/386870 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [09:44:33] 10Operations, 10Patch-For-Review, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Port redis statistics to Prometheus - https://phabricator.wikimedia.org/T148637#3722477 (10fgiunchedi) [09:49:30] (03PS7) 10Filippo Giunchedi: redis: add stretch support [puppet] - 10https://gerrit.wikimedia.org/r/386870 (https://phabricator.wikimedia.org/T148637) [09:49:30] (03PS8) 10Filippo Giunchedi: prometheus: add redis_exporter class and profile [puppet] - 10https://gerrit.wikimedia.org/r/325466 (https://phabricator.wikimedia.org/T148637) [09:50:22] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add redis_exporter class and profile [puppet] - 10https://gerrit.wikimedia.org/r/325466 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [09:50:33] (03PS8) 10Filippo Giunchedi: redis: add stretch support [puppet] - 10https://gerrit.wikimedia.org/r/386870 (https://phabricator.wikimedia.org/T148637) [09:52:43] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] prometheus: add analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/378716 (https://phabricator.wikimedia.org/T175922) (owner: 10Filippo Giunchedi) [09:52:48] (03PS4) 10Filippo Giunchedi: prometheus: add analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/378716 (https://phabricator.wikimedia.org/T175922) [09:55:49] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/378716 (https://phabricator.wikimedia.org/T175922) (owner: 10Filippo Giunchedi) [09:56:43] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 24 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [10:01:43] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 12 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [10:02:23] (03PS2) 10Giuseppe Lavagetto: contint: move blubber from docker to pipeline profile [puppet] - 10https://gerrit.wikimedia.org/r/385208 (owner: 10Hashar) [10:02:32] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387537 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [10:03:55] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387537 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [10:04:30] (03PS9) 10Filippo Giunchedi: prometheus: add redis_exporter class and profile [puppet] - 10https://gerrit.wikimedia.org/r/325466 (https://phabricator.wikimedia.org/T148637) [10:04:34] (03CR) 10Giuseppe Lavagetto: [C: 032] contint: move blubber from docker to pipeline profile [puppet] - 10https://gerrit.wikimedia.org/r/385208 (owner: 10Hashar) [10:05:03] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add redis_exporter class and profile [puppet] - 10https://gerrit.wikimedia.org/r/325466 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [10:05:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1084 weight - T161088 (duration: 00m 50s) [10:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:12] T161088: Migrate some s4 hosts to file per table - https://phabricator.wikimedia.org/T161088 [10:06:51] (03PS10) 10Filippo Giunchedi: prometheus: add redis_exporter class and profile [puppet] - 10https://gerrit.wikimedia.org/r/325466 (https://phabricator.wikimedia.org/T148637) [10:08:48] (03PS1) 10Giuseppe Lavagetto: profile::docker::builder: switch to using docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/387540 [10:11:49] (03PS2) 10Giuseppe Lavagetto: profile::docker::builder: switch to using docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/387540 [10:14:49] (03CR) 10Gehel: Cumin: fix wmf-style violations (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/386799 (owner: 10Volans) [10:14:53] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] prometheus: add analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/378716 (https://phabricator.wikimedia.org/T175922) (owner: 10Filippo Giunchedi) [10:15:28] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [10:15:35] (03PS5) 10Filippo Giunchedi: prometheus: add analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/378716 (https://phabricator.wikimedia.org/T175922) [10:16:07] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/378716 (https://phabricator.wikimedia.org/T175922) (owner: 10Filippo Giunchedi) [10:16:59] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] prometheus: add analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/378716 (https://phabricator.wikimedia.org/T175922) (owner: 10Filippo Giunchedi) [10:18:04] _joe_: merging your change too [10:18:28] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [10:18:37] <_joe_> godog: oh thanks, sorry [10:19:00] (03PS3) 10Giuseppe Lavagetto: profile::docker::builder: switch to using docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/387540 [10:19:44] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8565/" [puppet] - 10https://gerrit.wikimedia.org/r/387540 (owner: 10Giuseppe Lavagetto) [10:21:19] (03PS1) 10Volans: Add Python 3 support [software/conftool] - 10https://gerrit.wikimedia.org/r/387544 [10:22:29] (03PS1) 10Marostegui: db-eqiad.php: Restore db1084 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387545 (https://phabricator.wikimedia.org/T161088) [10:22:34] (03CR) 10jerkins-bot: [V: 04-1] Add Python 3 support [software/conftool] - 10https://gerrit.wikimedia.org/r/387544 (owner: 10Volans) [10:22:47] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387537 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [10:23:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1084 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387545 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [10:24:04] (03PS2) 10Volans: Add Python 3 support [software/conftool] - 10https://gerrit.wikimedia.org/r/387544 [10:25:10] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1084 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387545 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [10:26:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1084 original weight - T161088 (duration: 00m 48s) [10:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:23] T161088: Migrate some s4 hosts to file per table - https://phabricator.wikimedia.org/T161088 [10:26:30] (03CR) 10Volans: "Quick attempt to add Python3 compatibility to conftool without using any additional dependency. Tests are passing in 2.7, 3.4, 3.5, 3.6, b" [software/conftool] - 10https://gerrit.wikimedia.org/r/387544 (owner: 10Volans) [10:27:58] (03PS2) 10Volans: Cumin: fix wmf-style violations [puppet] - 10https://gerrit.wikimedia.org/r/386799 [10:28:43] (03CR) 10Volans: "Addressed comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/386799 (owner: 10Volans) [10:41:11] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/386799 (owner: 10Volans) [10:43:30] (03PS1) 10Filippo Giunchedi: prometheus: fix port for prometheus/analytics [puppet] - 10https://gerrit.wikimedia.org/r/387546 [10:44:34] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: fix port for prometheus/analytics [puppet] - 10https://gerrit.wikimedia.org/r/387546 (owner: 10Filippo Giunchedi) [10:46:06] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/8564/" [puppet] - 10https://gerrit.wikimedia.org/r/325466 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [10:46:47] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1084 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387545 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [10:50:36] (03PS1) 10Giuseppe Lavagetto: Do not add twice the registry name to images [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/387547 [10:51:09] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Do not add twice the registry name to images [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/387547 (owner: 10Giuseppe Lavagetto) [10:59:59] (03PS1) 10Giuseppe Lavagetto: New release; fixes to the build makefile [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/387550 [11:00:51] (03PS1) 10Filippo Giunchedi: prometheus: add k8s instance [puppet] - 10https://gerrit.wikimedia.org/r/387551 (https://phabricator.wikimedia.org/T177395) [11:01:00] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] New release; fixes to the build makefile [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/387550 (owner: 10Giuseppe Lavagetto) [11:01:24] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add k8s instance [puppet] - 10https://gerrit.wikimedia.org/r/387551 (https://phabricator.wikimedia.org/T177395) (owner: 10Filippo Giunchedi) [11:02:29] !log oblivian@tin Started deploy [docker-pkg/deploy@576c80f]: Fix image_tag filter [11:02:31] !log oblivian@tin Finished deploy [docker-pkg/deploy@576c80f]: Fix image_tag filter (duration: 00m 02s) [11:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:18] !log oblivian@tin Started deploy [docker-pkg/deploy@20908e8]: Actually deploy the fix [11:05:22] !log oblivian@tin Finished deploy [docker-pkg/deploy@20908e8]: Actually deploy the fix (duration: 00m 04s) [11:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:16] 10Operations, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Move deployment-prep redis instances to stretch - https://phabricator.wikimedia.org/T179371#3722645 (10fgiunchedi) [11:21:48] (03Abandoned) 10Marostegui: Draft: Setting a multi-instance host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385152 (https://phabricator.wikimedia.org/T178553) (owner: 10Marostegui) [11:27:41] (03PS1) 10Marostegui: tendril.sql.erb: Tracking tendril users somewhere [puppet] - 10https://gerrit.wikimedia.org/r/387557 (https://phabricator.wikimedia.org/T148955) [11:28:11] (03Abandoned) 10Marostegui: mariadb: Start puppetizing tendril mysql users [puppet] - 10https://gerrit.wikimedia.org/r/348930 (https://phabricator.wikimedia.org/T148955) (owner: 10Marostegui) [11:30:51] (03CR) 10Marostegui: [C: 032] "As expected, this is a NOOP: https://puppet-compiler.wmflabs.org/compiler02/8566/" [puppet] - 10https://gerrit.wikimedia.org/r/387557 (https://phabricator.wikimedia.org/T148955) (owner: 10Marostegui) [11:41:46] (03PS4) 10Arturo Borrero Gonzalez: diamond: nfsiostat update collector to read from arbitrary NFS mount points [puppet] - 10https://gerrit.wikimedia.org/r/387243 (https://phabricator.wikimedia.org/T179024) [11:42:34] godog: ^^ [11:44:55] neat arturo ! [11:44:55] (03CR) 10Filippo Giunchedi: [C: 031] diamond: nfsiostat update collector to read from arbitrary NFS mount points [puppet] - 10https://gerrit.wikimedia.org/r/387243 (https://phabricator.wikimedia.org/T179024) (owner: 10Arturo Borrero Gonzalez) [11:59:06] (03CR) 10BBlack: [C: 031] "If it works, I don't see any reason not to move forward for now. I'm not generally pleased with the underlying mechanism of listing all t" [puppet] - 10https://gerrit.wikimedia.org/r/366812 (owner: 10Muehlenhoff) [12:00:11] 10Operations, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Move deployment-prep redis instances to stretch - https://phabricator.wikimedia.org/T179371#3722762 (10fgiunchedi) AFAIU this is the procedure to commission new redis instances: * add 05 and 06 to `redis::shards` in https://wikitech.wikimedia... [12:10:11] !log core DC backup lvses (1004-6,2004-6): disable LRO,pause on all ethernet interfaces [12:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:30] !log esams+ulsfo backup lvses (3003-4,4003-4): disable LRO,pause on eth0 [12:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:37] !log ulsfo primary lvses (4001-2): disable LRO,pause on eth0 (under pybal stopped briefly) [12:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:27] !log esams primary lvses (3001-2): disable LRO,pause on eth0 (under pybal stopped briefly) [12:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:44] !log cp4023: restart varnish-be for mailbox lag [12:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:52] !log cp4025: restart varnish-be for mailbox lag [12:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:34] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3722819 (10BBlack) Ping @RobH - this hardware needs replacing. I guess diagnostics aren't perfect, and neither is the SEL, but clearly the node crashes out even during a fresh install. [12:31:57] RECOVERY - Check Varnish expiry mailbox lag on cp4025 is OK: OK: expiry mailbox lag is 0 [12:36:12] !log codfw primary lvses (2003-6): disable LRO,pause on all ethernet interfaces (under pybal stopped briefly) [12:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:32] !log eqiad primary lvses (1001-3): disable pause [LRO not possible] on all ethernet interfaces (under pybal stopped briefly) [12:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:20] (03PS8) 10BBlack: LVS: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379800 [12:43:12] (03CR) 10BBlack: [C: 032] LVS: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379800 (owner: 10BBlack) [12:45:44] !log mobrovac@tin Started deploy [restbase/deploy@c0f9dc4]: Cassandra 3 Parsoid stash tables [12:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:50] (03PS5) 10Rush: diamond: nfsiostat update collector to read from arbitrary NFS mount points [puppet] - 10https://gerrit.wikimedia.org/r/387243 (https://phabricator.wikimedia.org/T179024) (owner: 10Arturo Borrero Gonzalez) [12:51:56] (03CR) 10Arturo Borrero Gonzalez: [C: 032] diamond: nfsiostat update collector to read from arbitrary NFS mount points (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/387243 (https://phabricator.wikimedia.org/T179024) (owner: 10Arturo Borrero Gonzalez) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 8 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171031T1300). [13:00:04] Pchelolo: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:27] I can SWAT today [13:00:33] I'm here [13:01:03] !log mobrovac@tin Finished deploy [restbase/deploy@c0f9dc4]: Cassandra 3 Parsoid stash tables (duration: 15m 19s) [13:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:09] Pchelolo: I'm confused, https://gerrit.wikimedia.org/r/#/c/387230/ is already merged [13:01:10] thanks zeljkof [13:01:58] and it's in master branch [13:02:06] oh, I forgot to create one against the current deployed branch.. sorry, one second [13:02:17] Pchelolo: ok, that makes sense :) [13:02:33] !log mobrovac@tin Started deploy [restbase/deploy@c0f9dc4]: Cassandra 3 Parsoid stash tables [13:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:59] mobrovac: are you deploying? EU SWAT has just started. Should I wait? [13:05:30] !log mobrovac@tin Finished deploy [restbase/deploy@c0f9dc4]: Cassandra 3 Parsoid stash tables (duration: 02m 58s) [13:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:27] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:06:41] euh ^ ? [13:06:50] zeljkof: here's the correct gerrit: https://gerrit.wikimedia.org/r/387562 [13:06:53] sorry for the delay [13:07:04] Pchelolo: no problem, it's the only patch :) [13:07:15] please update the deployment calendar [13:07:28] doing [13:07:29] mobrovac: that's probably a false alarm, checking [13:07:56] thnx elukey (i don't have privileges to look at puppet logs there) [13:08:03] Could not retrieve catalog from remote server: Error 400 on SERVER: Attempt to assign to a reserved variable name: 'trusted' on node kafka2001.codfw.wmnet [13:08:56] hm so why is only that node failing? [13:09:23] lol now it succeeded [13:09:33] (just re-ran it manually) [13:09:50] weird [13:09:54] thnx El [13:09:57] thnx elukey [13:10:05] I am pretty sure it was a temporary glitch, nothing serious [13:10:18] puppet brain freeze [13:10:30] 10Operations, 10Traffic: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456#3722972 (10BBlack) Auditing production tlsproxy users for the switch to `light` in https://gerrit.wikimedia.org/r/#/c/386424/ shows no excess modules used on any of them, except for the expected lua+ndk on the cache hos... [13:11:27] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:11:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3722974 (10chasemp) >>! In T171473#3710500, @Cmjohnson wrote: > Dell declined the new system board. We are getting another CPU to since that is the part that seems to be brok... [13:12:13] 10Operations, 10Ops-Access-Requests: Add hoo to perf-roots - https://phabricator.wikimedia.org/T179317#3722975 (10herron) @Krinkle @Gilles since you are current members of `perf-roots` perhaps you could vouch for this request? Because this request involves root level privileges it will need to be reviewed at... [13:12:23] Pchelolo: 387562 is at mwdebug1002, can you test there? [13:13:11] zeljkof: I can't test the patch exactly, lemme just check it doesn't break anything [13:13:27] Pchelolo: sure, let me know when I can deploy [13:14:20] !log caches@ulsfo - upgrade nginx to 1.13.6-2+wmf1~jessie1 [13:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:28] zeljkof: you can deploy, it doesn't break anything [13:15:53] Pchelolo: deploying [13:17:13] !log zfilipin@tin Synchronized php-1.31.0-wmf.4/extensions/EventBus/EventBus.php: SWAT: [[gerrit:387562|Dont log full request to service in case of delivery error (T179280)]] (duration: 00m 51s) [13:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:19] T179280: PHP out of memory error trying to log big events - https://phabricator.wikimedia.org/T179280 [13:17:44] Pchelolo: deployed, please check and thanks for deploying with #releng ;) [13:18:09] !log EU SWAT finished [13:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:37] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:18:37] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:18:37] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:18:38] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:18:43] all looks good, thank you zeljkof [13:18:58] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:19:07] PROBLEM - SSH on stat1005 is CRITICAL: Server answer [13:19:18] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:19:33] HMmm [13:19:38] 10Operations, 10Puppet, 10User-Joe: Puppet4: Catalog failed: Catalog has broken references: service[yhsm-daemon](/etc/puppet/modules/yubiauth/manifests/yhsm_daemon.pp:16 - https://phabricator.wikimedia.org/T179382#3723000 (10herron) [13:19:53] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3700540 (10herron) [13:20:17] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:22:27] !log caches@codfw - upgrade nginx to 1.13.6-2+wmf1~jessie1 [13:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:07] RECOVERY - SSH on stat1005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u1 (protocol 2.0) [13:26:31] 10Operations, 10Puppet, 10User-Joe: Puppet4: Catalog failed: Catalog has broken references: service[yhsm-daemon](/etc/puppet/modules/yubiauth/manifests/yhsm_daemon.pp:16 - https://phabricator.wikimedia.org/T179382#3723037 (10herron) [13:27:49] (03PS1) 10Herron: puppet: fix yubiauth yhsm_daemon file require [puppet] - 10https://gerrit.wikimedia.org/r/387563 (https://phabricator.wikimedia.org/T179382) [13:33:07] RECOVERY - DPKG on stat1005 is OK: All packages OK [13:33:27] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [13:33:37] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [13:33:38] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [13:35:17] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:38:10] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/8567/" [puppet] - 10https://gerrit.wikimedia.org/r/387563 (https://phabricator.wikimedia.org/T179382) (owner: 10Herron) [13:38:15] (03PS1) 10Giuseppe Lavagetto: Remove nodejs-devel [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/387564 [13:38:17] (03PS1) 10Giuseppe Lavagetto: Add python-build images used to build python apps [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/387565 [13:38:57] PROBLEM - puppet last run on db1081 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:39:31] (03PS1) 10Ottomata: Remove temporary rsync access for frack to statistics servers [puppet] - 10https://gerrit.wikimedia.org/r/387567 (https://phabricator.wikimedia.org/T178509) [13:40:21] (03CR) 10Ottomata: [C: 032] Remove temporary rsync access for frack to statistics servers [puppet] - 10https://gerrit.wikimedia.org/r/387567 (https://phabricator.wikimedia.org/T178509) (owner: 10Ottomata) [13:42:15] (03CR) 10Giuseppe Lavagetto: [C: 031] puppet: fix yubiauth yhsm_daemon file require [puppet] - 10https://gerrit.wikimedia.org/r/387563 (https://phabricator.wikimedia.org/T179382) (owner: 10Herron) [13:47:26] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3723150 (10Ottomata) Ok, I've saved halfak's data. This should be ok to wipe. @cmjohnson, stat1003 is still powered up. I can shut it down (`sudo poweroff`?), or let you do it. (I'm... [13:47:32] (03PS2) 10Herron: puppet: change mediawiki updatequerypages::cronjob call to full name [puppet] - 10https://gerrit.wikimedia.org/r/387307 (https://phabricator.wikimedia.org/T179290) [13:47:48] (03PS1) 10Filippo Giunchedi: labs: use new redis servers for locks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387570 (https://phabricator.wikimedia.org/T179371) [13:49:09] (03CR) 10Herron: [C: 032] puppet: fix yubiauth yhsm_daemon file require [puppet] - 10https://gerrit.wikimedia.org/r/387563 (https://phabricator.wikimedia.org/T179382) (owner: 10Herron) [13:49:15] (03CR) 10Elukey: [C: 031] "ip looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387570 (https://phabricator.wikimedia.org/T179371) (owner: 10Filippo Giunchedi) [13:49:18] (03PS2) 10Herron: puppet: fix yubiauth yhsm_daemon file require [puppet] - 10https://gerrit.wikimedia.org/r/387563 (https://phabricator.wikimedia.org/T179382) [13:50:41] (03CR) 10Herron: [C: 032] puppet: change discovery-statefile template to parse under puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/387237 (https://phabricator.wikimedia.org/T179084) (owner: 10Herron) [13:50:46] (03PS2) 10Herron: puppet: change discovery-statefile template to parse under puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/387237 (https://phabricator.wikimedia.org/T179084) [13:51:58] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/387571 [13:52:12] (03CR) 10Hashar: "check experimental" [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/387571 (owner: 10Hashar) [13:52:18] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:01] <_joe_> that's ^^ not tue [13:53:04] <_joe_> *true [13:53:15] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/387572 [13:53:27] (03CR) 10Hashar: "check experimental" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/387572 (owner: 10Hashar) [13:53:39] (03CR) 10jerkins-bot: [V: 04-1] Jenkins job validation (DO NOT SUBMIT) [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/387572 (owner: 10Hashar) [13:53:49] (03CR) 10jerkins-bot: [V: 04-1] Jenkins job validation (DO NOT SUBMIT) [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/387572 (owner: 10Hashar) [13:53:51] (03CR) 10Hashar: "check experimental" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/387572 (owner: 10Hashar) [13:54:10] !log caches@esams - upgrade nginx to 1.13.6-2+wmf1~jessie1 [13:54:15] (03CR) 10jerkins-bot: [V: 04-1] Jenkins job validation (DO NOT SUBMIT) [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/387572 (owner: 10Hashar) [13:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:19] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/387571 (owner: 10Hashar) [13:55:47] PROBLEM - MegaRAID on analytics1029 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [13:55:52] we should put some code in that icinga check for: if ($hostname ~ /puppetmaster/) { $msg = "Someone accidentally ran the puppet agent instead of puppet-merge and then hit ctrl+c"; } [13:57:07] !log caches@eqiad - upgrade nginx to 1.13.6-2+wmf1~jessie1 [13:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:18] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:57:20] (03PS1) 10Hashar: Fix flake8 issue [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/387573 [13:58:16] (03CR) 10Hashar: "check experimental" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/387573 (owner: 10Hashar) [13:59:25] (03CR) 10Hashar: [C: 031] Fix flake8 issue [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/387573 (owner: 10Hashar) [13:59:31] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/387572 (owner: 10Hashar) [13:59:35] elukey: FYI analytics1029 ^^^ [14:01:27] PROBLEM - HHVM rendering on mw2200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:17] RECOVERY - HHVM rendering on mw2200 is OK: HTTP OK: HTTP/1.1 200 OK - 75603 bytes in 0.335 second response time [14:03:57] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Puppet4: Catalog failed: Catalog has broken references: service[yhsm-daemon](/etc/puppet/modules/yubiauth/manifests/yhsm_daemon.pp:16 - https://phabricator.wikimedia.org/T179382#3723227 (10herron) [14:05:42] (03PS1) 10Herron: puppet: fix yubiauth yhsm_yubikey_ksm file require [puppet] - 10https://gerrit.wikimedia.org/r/387575 (https://phabricator.wikimedia.org/T179382) [14:07:54] volans: yeah there is a task, I thik that downtime expired [14:08:09] (03CR) 10Giuseppe Lavagetto: [C: 032] Update my (ori) keys [puppet] - 10https://gerrit.wikimedia.org/r/387301 (owner: 10Ori.livneh) [14:08:09] ah ok, sorry didn't check before pinging ;) [14:08:16] (03PS2) 10Giuseppe Lavagetto: Update my (ori) keys [puppet] - 10https://gerrit.wikimedia.org/r/387301 (owner: 10Ori.livneh) [14:08:17] nono thanks for the ping :) [14:08:57] RECOVERY - puppet last run on db1081 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:09:53] (03CR) 10Ottomata: [C: 032] Fix flake8 issue [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/387573 (owner: 10Hashar) [14:10:05] ottomata: merci ! [14:10:09] :) [14:11:09] !log cpNNNN: upgrade base packages (curl, git, libc-dev, etc) [14:11:10] (03PS7) 10Filippo Giunchedi: hieradata: add redis stretch deployment-prep instances [puppet] - 10https://gerrit.wikimedia.org/r/386869 (https://phabricator.wikimedia.org/T179371) [14:11:12] (03PS1) 10Filippo Giunchedi: hieradata: use deployment-redis05 for labs jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/387579 (https://phabricator.wikimedia.org/T179371) [14:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:22] (03PS1) 10Marostegui: install_server: Reimage db2085 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/387580 (https://phabricator.wikimedia.org/T178359) [14:12:30] (03CR) 10Elukey: "Hello everybody, after a refactoring this file moved to another location in puppet, would you mind to amend this change? I'll merge asap, " [puppet] - 10https://gerrit.wikimedia.org/r/379829 (https://phabricator.wikimedia.org/T175395) (owner: 10Bmansurov) [14:13:02] (03CR) 10Marostegui: [C: 032] install_server: Reimage db2085 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/387580 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [14:13:48] !log lvsNNNN: upgrade base packages (curl, git, libc-dev, etc) [14:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:54] !log authdns (baham, radon, eeden): upgrade base packages (curl, git, libc-dev, etc) [14:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:42] 10Operations, 10Puppet, 10User-Joe: Puppet4: Catalog failed: Catalog has broken references: varnish::wikimedia_vcl[/usr/share/varnish/tests/wikimedia-common_upload-backend.inc.vcl](/etc/puppet/modules/varnish/manifests/instance.pp:98 - https://phabricator.wikimedia.org/T179382#3723302 (10herron) [14:28:24] (03CR) 10Pmiazga: [C: 031] "Elukey - give me couple minutes" [puppet] - 10https://gerrit.wikimedia.org/r/379829 (https://phabricator.wikimedia.org/T175395) (owner: 10Bmansurov) [14:28:48] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3723308 (10herron) [14:29:39] (03PS1) 10Ottomata: Set main-eqiad -> jumbo->eqiad mirror maker heap to 512M [puppet] - 10https://gerrit.wikimedia.org/r/387581 (https://phabricator.wikimedia.org/T177216) [14:30:49] (03PS2) 10Ottomata: Set main-eqiad -> jumbo->eqiad mirror maker heap to 512M [puppet] - 10https://gerrit.wikimedia.org/r/387581 (https://phabricator.wikimedia.org/T177216) [14:32:07] (03CR) 10Ottomata: [C: 032] Set main-eqiad -> jumbo->eqiad mirror maker heap to 512M [puppet] - 10https://gerrit.wikimedia.org/r/387581 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata) [14:33:33] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3723313 (10RobH) Yep, I'll email our Dell reps today explaining our issues and asking for next steps. (Not sure if they'll want to try swapping the mainboard out or just the entire system.) [14:33:36] (03PS4) 10Pmiazga: Implement Schema:Print purging strategy [puppet] - 10https://gerrit.wikimedia.org/r/379829 (https://phabricator.wikimedia.org/T175395) (owner: 10Bmansurov) [14:36:35] (03PS5) 10Pmiazga: Implement Schema:Print purging strategy [puppet] - 10https://gerrit.wikimedia.org/r/379829 (https://phabricator.wikimedia.org/T175395) (owner: 10Bmansurov) [14:37:45] 10Operations, 10monitoring: Cluster puppet variable and ganglia decommission - https://phabricator.wikimedia.org/T179395#3723318 (10fgiunchedi) [14:40:42] (03PS6) 10Pmiazga: Implement Schema:Print purging strategy [puppet] - 10https://gerrit.wikimedia.org/r/379829 (https://phabricator.wikimedia.org/T175395) (owner: 10Bmansurov) [14:41:33] (03PS1) 10Herron: puppet: fix trailing slash on file resource /usr/share/varnish/tests [puppet] - 10https://gerrit.wikimedia.org/r/387584 (https://phabricator.wikimedia.org/T179382) [14:43:00] (03CR) 10Giuseppe Lavagetto: [C: 032] Remove nodejs-devel [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/387564 (owner: 10Giuseppe Lavagetto) [14:43:07] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Remove nodejs-devel [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/387564 (owner: 10Giuseppe Lavagetto) [14:43:20] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add python-build images used to build python apps [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/387565 (owner: 10Giuseppe Lavagetto) [14:44:03] 10Operations, 10Puppet, 10User-Joe: Puppet4:Catalog failed: Catalog has broken references: service[yhsm-daemon](/etc/puppet/modules/yubiauth/manifests/yhsm_daemon.pp:16 - https://phabricator.wikimedia.org/T179382#3723357 (10herron) [14:45:41] 10Operations, 10Puppet, 10User-Joe: Catalog failed: Catalog has broken references: varnish::wikimedia_vcl[/usr/share/varnish/tests/wikimedia-common_upload-backend.inc.vcl](/etc/puppet/modules/varnish/manifests/instance.pp:98 - https://phabricator.wikimedia.org/T179396#3723363 (10herron) [14:45:59] (03PS2) 10Herron: puppet: fix trailing slash on file resource /usr/share/varnish/tests [puppet] - 10https://gerrit.wikimedia.org/r/387584 (https://phabricator.wikimedia.org/T179396) [14:46:22] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3703415 (10herron) [14:47:27] PROBLEM - puppet last run on etcd1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:49:35] 10Operations, 10Traffic: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456#3723386 (10BBlack) And the other side of this audit. Before we try to (carefully) switch to nginx-light, we need them all upgraded to the latest version so the dpkg-level replacement works sanely: ``` bblack@neodymium:... [14:50:32] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Catalog failed: Catalog has broken references: varnish::wikimedia_vcl[/usr/share/varnish/tests/wikimedia-common_upload-backend.inc.vcl](/etc/puppet/modules/varnish/manifests/instance.pp:98 - https://phabricator.wikimedia.org/T179396#3723401 (10herron) [14:52:21] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/8568/" [puppet] - 10https://gerrit.wikimedia.org/r/387584 (https://phabricator.wikimedia.org/T179396) (owner: 10Herron) [14:52:59] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: puppet4: Catalog failed: Catalog has broken references: varnish::wikimedia_vcl[/usr/share/varnish/tests/wikimedia-common_upload-backend.inc.vcl](/etc/puppet/modules/varnish/manifests/instance.pp:98 - https://phabricator.wikimedia.org/T179396#3723435 (1... [14:55:14] 10Operations, 10Puppet, 10User-Joe: puppet4: Error while evaluating a Function Call, Failed to parse template role/prometheus/node_site.yaml.erb - https://phabricator.wikimedia.org/T179400#3723443 (10herron) [14:55:40] 10Operations, 10Puppet, 10User-Joe: Puppet4: hiera() can only be called using the 4.x function API. - https://phabricator.wikimedia.org/T179181#3723462 (10herron) [14:55:42] 10Operations, 10Puppet, 10User-Joe: puppet4: Error while evaluating a Function Call, Failed to parse template role/prometheus/node_site.yaml.erb - https://phabricator.wikimedia.org/T179400#3723461 (10herron) [14:56:07] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3707073 (10herron) [14:56:57] PROBLEM - puppet last run on maps2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:58:19] (03PS1) 10EBernhardson: Setup CirrusSearch AB test on dbn group sizing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387586 [14:59:51] (03CR) 10jerkins-bot: [V: 04-1] Setup CirrusSearch AB test on dbn group sizing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387586 (owner: 10EBernhardson) [15:00:49] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3723497 (10fgiunchedi) cc #cloud-services-team for input on some of these Diamonds we have in use, namely: * nfsiostat.py we could replace it with `... [15:01:27] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received [15:02:21] <_joe_> uhm [15:02:26] <_joe_> just scb1002? [15:02:28] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [15:06:21] (03CR) 10Ottomata: "Q: Do we use this r_lang::cran to install packages from remote R Mirrors in production already? I did some searching around, as far as I " [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [15:10:11] (03PS1) 10Giuseppe Lavagetto: Use python-build images for build [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/387587 [15:10:52] <_joe_> ottomata: tomorrow is a bank holiday here [15:11:01] 10Operations, 10ChangeProp, 10RESTBase, 10Wikimedia-Logstash, and 2 others: RB and CP logs disappeared from Logstash - https://phabricator.wikimedia.org/T179058#3723609 (10Gehel) 05Open>03Resolved a:03Gehel This has been solved by reverting to the previous template (that previous template was applied... [15:11:25] oh! [15:11:26] ok moving it [15:11:35] _joe_: wanna do today same time? in 20 mins? [15:11:56] <_joe_> ottomata: thursday is better maybe [15:11:59] ok [15:12:22] moved [15:15:07] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:15:18] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:15:27] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:15:38] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:15:58] (03CR) 10GoranSMilovanovic: "@Ottomata I am not sure whether r_lang::cran is used in production or not. @Bearloga will know. The WDCM system that @Addshore is trying t" [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [15:17:18] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:17:28] RECOVERY - puppet last run on etcd1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:20:34] (03PS1) 10Herron: puppet: change wmflib get_clusters to use call_function hiera [puppet] - 10https://gerrit.wikimedia.org/r/387588 (https://phabricator.wikimedia.org/T179400) [15:26:57] RECOVERY - puppet last run on maps2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:28:15] (03CR) 10Giuseppe Lavagetto: [C: 031] puppet: change wmflib get_clusters to use call_function hiera [puppet] - 10https://gerrit.wikimedia.org/r/387588 (https://phabricator.wikimedia.org/T179400) (owner: 10Herron) [15:28:27] (03CR) 10Herron: [C: 032] puppet: change wmflib get_clusters to use call_function hiera [puppet] - 10https://gerrit.wikimedia.org/r/387588 (https://phabricator.wikimedia.org/T179400) (owner: 10Herron) [15:30:46] (03PS2) 10EBernhardson: Setup CirrusSearch AB test on dbn group sizing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387586 [15:32:10] (03CR) 10Chad: [C: 032] All but wikidatawiki to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386642 (owner: 10Chad) [15:32:19] (03CR) 10jerkins-bot: [V: 04-1] All but wikidatawiki to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386642 (owner: 10Chad) [15:33:08] PROBLEM - IPMI Sensor Status on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:33:47] RECOVERY - DPKG on stat1005 is OK: All packages OK [15:34:07] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [15:34:18] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [15:34:27] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [15:35:18] 10Operations, 10Puppet, 10User-Joe: Puppet4: hiera() can only be called using the 4.x function API. - https://phabricator.wikimedia.org/T179181#3723713 (10herron) [15:35:21] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3723714 (10herron) [15:35:23] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: puppet4: Error while evaluating a Function Call, Failed to parse template role/prometheus/node_site.yaml.erb - https://phabricator.wikimedia.org/T179400#3723711 (10herron) 05Open>03Resolved a:03herron [15:35:35] (03PS3) 10Chad: All but wikidatawiki to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386642 [15:37:18] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:37:27] (03CR) 10Chad: [C: 032] All but wikidatawiki to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386642 (owner: 10Chad) [15:38:39] (03Merged) 10jenkins-bot: All but wikidatawiki to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386642 (owner: 10Chad) [15:38:49] (03CR) 10jenkins-bot: All but wikidatawiki to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386642 (owner: 10Chad) [15:39:48] (03PS1) 10ArielGlenn: drop check of existence for config file from all utils [dumps] - 10https://gerrit.wikimedia.org/r/387593 [15:40:53] 10Operations, 10media-storage, 10User-fgiunchedi: Deleting file on Commons "Error deleting file: An unknown error occurred in storage backend "local-multiwrite"." - https://phabricator.wikimedia.org/T173374#3723724 (10Jcb) I could delete both files now. Thanks! [15:41:36] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.1 (duration: 03m 15s) [15:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:41] !log demon@tin Synchronized php: Symlink bump (duration: 00m 48s) [15:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:27] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: all but wikidata.org -> wmf.5 [15:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:31] 10Operations, 10Puppet, 10User-Joe: puppet4: Error while evaluating a Resource Statement, Unknown resource type: 'instance' at /etc/puppet/modules/ganglia/manifests/monitor/aggregator/site_instances.pp:4:5 - https://phabricator.wikimedia.org/T179408#3723735 (10herron) [15:49:37] 10Operations, 10Puppet, 10User-Joe: puppet4: Error while evaluating a Resource Statement, Unknown resource type: 'instance' at /etc/puppet/modules/ganglia/manifests/monitor/aggregator/site_instances.pp:4:5 - https://phabricator.wikimedia.org/T179408#3723749 (10herron) [15:49:50] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3710655 (10herron) [15:50:48] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3710672 (10herron) [15:50:51] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Puppet4: Failed to parse template authdns/discovery-statefile.tpl.erb - https://phabricator.wikimedia.org/T179084#3723754 (10herron) 05Open>03Resolved a:03herron [15:51:08] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3710710 (10herron) [15:53:47] 10Operations, 10Puppet, 10User-Joe: puppet4: Error while evaluating a Resource Statement, Unknown resource type: 'instance' at /etc/puppet/modules/ganglia/manifests/monitor/aggregator/site_instances.pp:4:5 - https://phabricator.wikimedia.org/T179408#3723760 (10herron) [15:53:49] 10Operations, 10Puppet, 10User-Joe: puppet4: Error while evaluating a Resource Statement, Unknown resource type: 'instance' at /etc/puppet/modules/ganglia/manifests/monitor/aggregator/site_instances.pp:4:5 - https://phabricator.wikimedia.org/T179291#3723762 (10herron) [15:59:32] (03PS1) 10Chad: group0 to wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387598 [16:00:04] godog, moritzm, and _joe_: It is that lovely time of the day again! You are hereby commanded to deploy Puppet SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171031T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:19] (03CR) 10Chad: [C: 04-2] group0 to wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387598 (owner: 10Chad) [16:02:48] !log demon@tin Started scap: bootstrap wmf.6 [16:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:08] RECOVERY - IPMI Sensor Status on stat1005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [16:03:23] Why was Wikidata reverted? [16:03:27] what's up? [16:03:33] It's been left on wmf.4 like it had been [16:03:36] We never moved it to wmf.5 [16:03:50] I moved it back on Monday morning [16:03:55] jfc nobody tells me anything.... [16:04:39] (03CR) 10Giuseppe Lavagetto: [C: 031] "I think the code is correct and does what's supposed to do; I'm not sure, though, how convenient it is to distinguish between internal and" (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/384616 (https://phabricator.wikimedia.org/T178342) (owner: 10Volans) [16:05:11] It was on the relevant ticket [16:05:31] (03PS1) 10Herron: puppet: change ganglia site_instances to call instance by full name [puppet] - 10https://gerrit.wikimedia.org/r/387601 (https://phabricator.wikimedia.org/T179408) [16:05:37] The ticket has gotten rather long, I perhaps missed it ;-) [16:06:10] that's why I put a checklist in the description [16:06:16] but yes, the ticket got hard to comprehend [16:06:24] _joe_: Got a sec to look at a few quick puppet patches since it's your swat window? [16:06:50] hoo: Ok ok. I'll stop being snarky, it's not your fault. After wmf.6 is done bootstrapping, we'll put wikidata back on wmf.5 [16:07:01] Works for me, tbh. I hate weeks where we run 3 branches @ once [16:07:18] +1 [16:07:55] (03CR) 10ArielGlenn: [C: 032] drop check of existence for config file from all utils [dumps] - 10https://gerrit.wikimedia.org/r/387593 (owner: 10ArielGlenn) [16:08:06] 10Operations, 10Continuous-Integration-Infrastructure (shipyard): wikimedia-jessie & wikimedia-stretch docker images don't have deb-src set for apt.wikimedia.org - https://phabricator.wikimedia.org/T179354#3723800 (10Legoktm) The real reason I was asking is that in our hhvm puppet class we run `apt-get build-d... [16:08:40] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/8570/" [puppet] - 10https://gerrit.wikimedia.org/r/387601 (https://phabricator.wikimedia.org/T179408) (owner: 10Herron) [16:09:26] !log ariel@tin Started deploy [dumps/dumps@fa0583b]: adapt dumpadmin script to use override section in config [16:09:28] !log ariel@tin Finished deploy [dumps/dumps@fa0583b]: adapt dumpadmin script to use override section in config (duration: 00m 02s) [16:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:07] (03PS2) 10Herron: puppet: change ganglia site_instances to call instance by full name [puppet] - 10https://gerrit.wikimedia.org/r/387601 (https://phabricator.wikimedia.org/T179291) [16:13:21] (03CR) 10Volans: "thanks for the review! Replies inline" (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/384616 (https://phabricator.wikimedia.org/T178342) (owner: 10Volans) [16:15:05] hoo: are the unchecked things at the top still in need of un-revert? remex + ores->wikidata? [16:15:13] (03PS2) 10Chad: Fix a ton of errors that were making flake8 freak out [software/conftool] - 10https://gerrit.wikimedia.org/r/387279 [16:15:26] bblack: ores->wikidata is now back online [16:15:33] but I think remex need some stuff [16:15:49] Remex is still off [16:15:51] (03CR) 10jerkins-bot: [V: 04-1] Fix a ton of errors that were making flake8 freak out [software/conftool] - 10https://gerrit.wikimedia.org/r/387279 (owner: 10Chad) [16:16:11] and the flow parsoid limit is still at 50s AFAIK [16:16:48] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/8571/auth1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/387575 (https://phabricator.wikimedia.org/T179382) (owner: 10Herron) [16:18:07] 10Operations, 10ORES, 10Scoring-platform-team, 10Traffic, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3723835 (10hoo) [16:18:34] mostly I just want to close the loop on ensuring that the VCL do_stream=false revert alone was enough to stop the carnage. [16:18:51] (03CR) 10Herron: [C: 032] puppet: fix yubiauth yhsm_yubikey_ksm file require [puppet] - 10https://gerrit.wikimedia.org/r/387575 (https://phabricator.wikimedia.org/T179382) (owner: 10Herron) [16:18:58] (03PS2) 10Herron: puppet: fix yubiauth yhsm_yubikey_ksm file require [puppet] - 10https://gerrit.wikimedia.org/r/387575 (https://phabricator.wikimedia.org/T179382) [16:19:07] (it sure seems like that's it, but we've been surprised multiple times during all of this...) [16:19:45] If the fatals from remex are otherwise acceptable (that's a product decision, I guess), then yes [16:19:52] right, ok [16:21:40] If wanted, we can undo the $wgFlowParsoidTimeout right now [16:22:09] I still find 100s for a synchronous call within an API still scary (50s even), though [16:24:24] (03PS2) 10ArielGlenn: Remove mediawiki::users::mwdeploy_pub_key, unused [puppet] - 10https://gerrit.wikimedia.org/r/386065 (https://phabricator.wikimedia.org/T145495) (owner: 10Chad) [16:24:50] well hhvm is configured with max_execution_time = 60s anyways, shouldn't that kick in before this 100s limit? [16:25:41] (03CR) 10ArielGlenn: [C: 032] Remove mediawiki::users::mwdeploy_pub_key, unused [puppet] - 10https://gerrit.wikimedia.org/r/386065 (https://phabricator.wikimedia.org/T145495) (owner: 10Chad) [16:25:45] that's a very good question… probably [16:25:56] <_joe_> it depends [16:26:18] <_joe_> in our case, I think we have the right ini setting [16:26:47] but does it abort immediately or wait out curl first? [16:26:57] <_joe_> hhvm.timeouts_use_wall_time = true [16:27:09] <_joe_> it should abort immediately AFAIR [16:27:17] <_joe_> but I can do some tests just to be sure [16:27:47] <_joe_> actually, you can too, just write a php script doing curl to a port where you open an nc listener on localhost [16:28:52] yeah, that would work… I'll try it on beta, I guess [16:29:02] <_joe_> also, I'd check the apache timeouts [16:29:30] <_joe_> but I think we set a very high value for ProxyTimeout [16:32:03] (03CR) 10DCausse: Setup CirrusSearch AB test on dbn group sizing (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387586 (owner: 10EBernhardson) [16:32:38] !log ppchelko@tin Started deploy [restbase/deploy@7631ea7]: Enable new parsoid storage on test wikipedias [16:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:47] (03PS1) 10Chad: Revert "Revert "Gerrit: Also set minimum heap size"" [puppet] - 10https://gerrit.wikimedia.org/r/387605 [16:34:08] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "Gerrit: Also set minimum heap size"" [puppet] - 10https://gerrit.wikimedia.org/r/387605 (owner: 10Chad) [16:34:30] (03CR) 10Chad: [C: 031] "Let's go ahead and land this one" [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [16:34:49] (03PS10) 10ArielGlenn: generate one config file for xml/sql dumps for wikis [puppet] - 10https://gerrit.wikimedia.org/r/386388 (https://phabricator.wikimedia.org/T178893) [16:35:04] no_justification for this https://gerrit.wikimedia.org/r/#/c/387605/ you say gerrit wont start on gerrit2001. [16:35:07] (03CR) 10EBernhardson: "Man i made a mess out of this patch, and yet its so small! thanks for the review, fixes incoming." (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387586 (owner: 10EBernhardson) [16:35:14] isen't that because of the firewall on the db servers? [16:35:29] Yeah. Which was unrelated to that actual change ^^ [16:35:32] (03PS3) 10EBernhardson: Setup CirrusSearch AB test on dbn group sizing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387586 [16:35:33] We should go ahead and set it [16:35:35] yep [16:35:57] (03CR) 10Paladox: [C: 031] "Needs rebase then LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/387605 (owner: 10Chad) [16:36:14] I love that gerrit let me create a revert commit that had zero chance of merging :) [16:36:18] Nice bait and switch there :p [16:36:40] heh [16:37:17] heh, i think that's due to the Submit Type we have set on the repo [16:37:58] (03PS11) 10ArielGlenn: generate one config file for xml/sql dumps for wikis [puppet] - 10https://gerrit.wikimedia.org/r/386388 (https://phabricator.wikimedia.org/T178893) [16:38:38] (03CR) 10ArielGlenn: [C: 032] generate one config file for xml/sql dumps for wikis [puppet] - 10https://gerrit.wikimedia.org/r/386388 (https://phabricator.wikimedia.org/T178893) (owner: 10ArielGlenn) [16:39:34] paladox: Heh, also $java_options doesn't do anything anymore [16:39:42] Most of the stuff in gerrit's container.* section is useless [16:39:46] Ah [16:39:49] hmm [16:40:30] aha [16:40:37] it's due to the systemd change we did [16:41:17] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [16:41:22] no_justification i doint think $java_options as an array will work in the systemd template [16:41:23] Yep [16:41:28] possibly want to make it a string [16:41:35] Well, it could. [16:41:38] oh [16:41:41] But not important rn [16:41:44] !log ppchelko@tin Finished deploy [restbase/deploy@7631ea7]: Enable new parsoid storage on test wikipedias (duration: 09m 07s) [16:41:44] ok [16:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:07] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [16:43:58] (03CR) 10DCausse: Setup CirrusSearch AB test on dbn group sizing (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387586 (owner: 10EBernhardson) [16:44:00] I think the mobileapps alert was due to a RB deploy (just for the duration of the deploy) [16:44:42] (03PS2) 10Chad: Revert "Revert "Gerrit: Also set minimum heap size"" [puppet] - 10https://gerrit.wikimedia.org/r/387605 [16:45:30] (03CR) 10Paladox: [C: 031] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/387605 (owner: 10Chad) [16:45:54] (03CR) 10Hashar: "check experimental" [debs/pybal] - 10https://gerrit.wikimedia.org/r/384483 (https://phabricator.wikimedia.org/T178149) (owner: 10Ema) [16:46:04] (03CR) 10Gehel: [C: 031] "It seems that in recent JVM (>1.4) setting -Xms isn't a gain anymore, so +1 to that!" [puppet] - 10https://gerrit.wikimedia.org/r/387605 (owner: 10Chad) [16:46:30] (03CR) 10jenkins-bot: 1.14.2: do not crash on empty runcommand.arguments [debs/pybal] - 10https://gerrit.wikimedia.org/r/384483 (https://phabricator.wikimedia.org/T178149) (owner: 10Ema) [16:47:03] (03PS1) 10ArielGlenn: deplay Nov xml/sql dump run by two days [puppet] - 10https://gerrit.wikimedia.org/r/387607 [16:48:00] (03CR) 10ArielGlenn: [C: 032] deplay Nov xml/sql dump run by two days [puppet] - 10https://gerrit.wikimedia.org/r/387607 (owner: 10ArielGlenn) [16:50:28] !log demon@tin Finished scap: bootstrap wmf.6 (duration: 47m 39s) [16:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:33] (03PS4) 10EBernhardson: Setup CirrusSearch AB test on dbn group sizing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387586 [16:52:57] * apergos is watching and noting those gc config changes, thanks to our awesome teammate! [16:53:14] 10Operations, 10Puppet, 10User-Joe: Puppet4:Catalog failed: Catalog has broken references: service[yhsm-daemon](/etc/puppet/modules/yubiauth/manifests/yhsm_daemon.pp:16 - https://phabricator.wikimedia.org/T179382#3723981 (10herron) 05Open>03Resolved a:03herron [16:53:16] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3723983 (10herron) [16:53:32] (03PS5) 10EBernhardson: Setup CirrusSearch AB test on dbn group sizing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387586 [16:54:55] (03CR) 10DCausse: [C: 031] Setup CirrusSearch AB test on dbn group sizing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387586 (owner: 10EBernhardson) [16:58:06] (03PS1) 10Chad: wikidata.org back to wmf.5, mea culpa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387609 [16:58:25] hoo: ^^ [16:59:21] (03CR) 10EBernhardson: Setup CirrusSearch AB test on dbn group sizing (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387586 (owner: 10EBernhardson) [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: That opportune time is upon us again. Time for a Services – Graphoid / Parsoid / OCG / Citoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171031T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:15] no parsoid deploy today [17:00:16] (03CR) 10Herron: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8572/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/387307 (https://phabricator.wikimedia.org/T179290) (owner: 10Herron) [17:00:25] maybe ORES today. awight what are you thinking? [17:00:30] (03PS3) 10Herron: puppet: change mediawiki updatequerypages::cronjob call to full name [puppet] - 10https://gerrit.wikimedia.org/r/387307 (https://phabricator.wikimedia.org/T179290) [17:01:15] halfak: I’m very much thinking about that, but still burning-in the revscoring 1<->2 upgrade and rollback on beta at the moment... [17:01:49] It looks like we’re ready to go, other than double-checking the virtualenv/pip glitch I hit earlier. [17:02:11] ooh—I realized why it breaks. [17:02:22] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: puppet4 Error while evaluating a Resource Statement, Unknown resource type: 'updatequerypages::cronjob' at /etc/puppet/modules/mediawiki/manifests/maintenance/updatequerypages.pp:12:5 - https://phabricator.wikimedia.org/T179290#3724060 (10herron) 05O... [17:05:02] (03CR) 10Herron: [C: 032] puppet: change ganglia site_instances to call instance by full name [puppet] - 10https://gerrit.wikimedia.org/r/387601 (https://phabricator.wikimedia.org/T179291) (owner: 10Herron) [17:05:12] (03PS3) 10Herron: puppet: change ganglia site_instances to call instance by full name [puppet] - 10https://gerrit.wikimedia.org/r/387601 (https://phabricator.wikimedia.org/T179291) [17:06:54] awight, sounds like we should skip this window and do tomorrow's [17:07:16] halfak: +1, I won’t feel confident for at least 30 min no matter what. [17:07:59] (03CR) 10Hoo man: [C: 031] wikidata.org back to wmf.5, mea culpa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387609 (owner: 10Chad) [17:08:46] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: puppet4: Error while evaluating a Resource Statement, Unknown resource type: 'instance' at /etc/puppet/modules/ganglia/manifests/monitor/aggregator/site_instances.pp:4:5 - https://phabricator.wikimedia.org/T179291#3724088 (10herron) 05Open>03Resolv... [17:08:48] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3724090 (10herron) [17:09:07] PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:10:22] (03PS2) 10ArielGlenn: use separate path for public/other datasets [puppet] - 10https://gerrit.wikimedia.org/r/386161 (https://phabricator.wikimedia.org/T178888) [17:10:37] (03CR) 10jerkins-bot: [V: 04-1] use separate path for public/other datasets [puppet] - 10https://gerrit.wikimedia.org/r/386161 (https://phabricator.wikimedia.org/T178888) (owner: 10ArielGlenn) [17:13:39] (03PS3) 10ArielGlenn: use separate path for public/other datasets [puppet] - 10https://gerrit.wikimedia.org/r/386161 (https://phabricator.wikimedia.org/T178888) [17:19:56] (03CR) 10Chad: [C: 032] wikidata.org back to wmf.5, mea culpa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387609 (owner: 10Chad) [17:21:11] (03Merged) 10jenkins-bot: wikidata.org back to wmf.5, mea culpa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387609 (owner: 10Chad) [17:21:21] (03CR) 10jenkins-bot: wikidata.org back to wmf.5, mea culpa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387609 (owner: 10Chad) [17:23:16] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: wikidata back to wmf.5 [17:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:07] RECOVERY - puppet last run on db2060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:38:18] (03PS2) 10Chad: Get rid of squid-file-labs in favor of new reverse-proxy-staging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384193 (https://phabricator.wikimedia.org/T104148) [17:40:31] halfak, awight: Warning: Invalid argument supplied for foreach() in /srv/mediawiki/php-1.31.0-wmf.5/extensions/ORES/includes/Cache.php on line 56 [17:40:35] Spotted a bit ago [17:40:51] no_justification: ty for the report! [17:40:52] Amir1: ^ [17:41:01] (03CR) 10Jforrester: "Is the idea to do non-labs as a follow-up if this doesn't break things?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384193 (https://phabricator.wikimedia.org/T104148) (owner: 10Chad) [17:41:33] Amir1: Looks like a v1 vs v3 incompatibility. [17:42:01] Actually, maybe not, cos the $scores[$wikiId]['scores'] line would have failed in that case. [17:45:45] (03CR) 10Bearloga: "> @Ottomata I am not sure whether r_lang::cran is used in production or not. @Bearloga will know. The WDCM system that @Addshore is trying" [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [17:48:30] (03Draft2) 10Jayprakash12345: Enable WikiLove Extension on pa.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387618 [17:49:59] (03PS3) 10Jayprakash12345: Enable WikiLove Extension on pa.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387618 (https://phabricator.wikimedia.org/T178919) [17:50:20] (03PS4) 10Jayprakash12345: Enable ShortUrl on pa.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386779 (https://phabricator.wikimedia.org/T178919) [17:54:06] (03CR) 10Zoranzoki21: [C: 031] Enable WikiLove Extension on pa.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387618 (https://phabricator.wikimedia.org/T178919) (owner: 10Jayprakash12345) [17:55:16] (03CR) 10Zoranzoki21: [C: 031] "> A little adventurous to enable both extensions in one change, as" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386779 (https://phabricator.wikimedia.org/T178919) (owner: 10Jayprakash12345) [18:02:31] (03PS1) 10Rush: openstack: move main hiera deployment config to common [puppet] - 10https://gerrit.wikimedia.org/r/387625 (https://phabricator.wikimedia.org/T171494) [18:03:32] (03CR) 10jerkins-bot: [V: 04-1] openstack: move main hiera deployment config to common [puppet] - 10https://gerrit.wikimedia.org/r/387625 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [18:07:34] (03PS2) 10Rush: openstack: move main hiera deployment config to common [puppet] - 10https://gerrit.wikimedia.org/r/387625 (https://phabricator.wikimedia.org/T171494) [18:08:03] (03CR) 10jerkins-bot: [V: 04-1] openstack: move main hiera deployment config to common [puppet] - 10https://gerrit.wikimedia.org/r/387625 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [18:11:33] no_justification: hi! Just merging a CentralNotice update into that extension's deploy branch, in the hopes of train-riding :) [18:12:00] https://gerrit.wikimedia.org/r/#/c/387627/ (Just waiting on CI before +2'ing) [18:12:10] apologies for always last-minuting like this!!!! [18:16:49] AndyRussG: Train was already branched [18:17:11] And initial sync done. We'll need to do a manual sync [18:20:27] (03CR) 10Thcipriani: [C: 031] keyholder: Use systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/386621 (owner: 10Muehlenhoff) [18:21:34] no_justification: oh ooops [18:22:01] no_justification: is it best then if we just book a separate deploy slot for the CN update? [18:22:14] No worries. I started super early this morning because I was playing catch up [18:22:30] We can do it now, prior to swapping the rest of group0 to wmf.6 [18:22:42] Actually, testwiki is already there so that'll make a good canary for you. [18:23:05] So yeah, get it updated in mw-core on the wmf.6 branch and I'll sync it [18:23:05] no_justification: wooohoooo thanks much and apologies for throwning a wrench into your good time organization.... [18:23:20] This is mostly no-op stuff, just a bunch of code cleanup that has made it way into our master branch but that we hadn't gotten around to sending to the main cluster [18:23:35] It's a pretty small wrench. You basically interrupted my coffee between the earlier deploys and version swap @ noon [18:23:55] heheh K I owe u a coffee then :) [18:24:13] Just merged https://gerrit.wikimedia.org/r/#/c/387627/ [18:24:38] So the submodule pointer for CN should now be at b1e320ca1eaca861772914522a9c1dae49400f1b [18:28:07] no_justification: ^ [18:28:28] (just in case u didn't see it..... thanks much!!!!!!!) [18:29:51] Was that merged to mw/core to pick up the dependency? [18:30:26] Oh dur [18:30:32] it tracks a non-wmf branch [18:30:34] doing [18:32:20] !log demon@tin Synchronized php-1.31.0-wmf.6/extensions/CentralNotice: deployyyyy (duration: 00m 52s) [18:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:09] !log awight@tin Started deploy [ores/deploy@437fcf5]: revscoring 2 -> ores1002 (non-production) [18:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:32] !log awight@tin Finished deploy [ores/deploy@437fcf5]: revscoring 2 -> ores1002 (non-production) (duration: 02m 24s) [18:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:45] (03PS4) 10ArielGlenn: use separate path for public/other datasets [puppet] - 10https://gerrit.wikimedia.org/r/386161 (https://phabricator.wikimedia.org/T178888) [18:42:59] !log awight@tin Started deploy [ores/deploy@437fcf5]: revscoring 2 -> ores* (non-production) [18:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:17] (03CR) 10jerkins-bot: [V: 04-1] use separate path for public/other datasets [puppet] - 10https://gerrit.wikimedia.org/r/386161 (https://phabricator.wikimedia.org/T178888) (owner: 10ArielGlenn) [18:46:09] (03PS5) 10ArielGlenn: use separate path for public/other datasets [puppet] - 10https://gerrit.wikimedia.org/r/386161 (https://phabricator.wikimedia.org/T178888) [18:46:53] no_justification: thx!!!!! :) [18:49:18] PROBLEM - HHVM rendering on mw2213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:17] RECOVERY - HHVM rendering on mw2213 is OK: HTTP OK: HTTP/1.1 200 OK - 75532 bytes in 0.310 second response time [18:50:30] awight: Notice: Undefined index: enwiki in /srv/mediawiki/php-1.31.0-wmf.5/extensions/ORES/includes/Cache.php on line 52 too [18:50:54] no_justification: aha that actually makes more sense. En-tasking. [18:53:28] !log awight@tin Finished deploy [ores/deploy@437fcf5]: revscoring 2 -> ores* (non-production) (duration: 10m 29s) [18:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:29] no_justification: T179430 FYI [18:54:29] T179430: ORES extension failing to parse scoring response - https://phabricator.wikimedia.org/T179430 [19:00:04] no_justification: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171031T1900). [19:00:05] No GERRIT patches in the queue for this window AFAICS. [19:07:58] (03CR) 10Chad: [C: 032] group0 to wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387598 (owner: 10Chad) [19:13:28] (03PS2) 10Chad: group0 to wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387598 [19:17:47] (03CR) 10jenkins-bot: group0 to wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387598 (owner: 10Chad) [19:18:26] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.6 [19:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:54] (03PS1) 10BBlack: strongswan: turn on fragmentation of IKE [puppet] - 10https://gerrit.wikimedia.org/r/387648 [19:52:17] (03CR) 10Ayounsi: [C: 031] strongswan: turn on fragmentation of IKE [puppet] - 10https://gerrit.wikimedia.org/r/387648 (owner: 10BBlack) [20:13:08] (03CR) 10Rush: "http://puppet-compiler.wmflabs.org/8578/" [puppet] - 10https://gerrit.wikimedia.org/r/387625 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [20:18:07] (03PS1) 10Nuria: Blacklisting TestSearchSatisfaction2 schema on MySQL [puppet] - 10https://gerrit.wikimedia.org/r/387654 [20:18:21] (03CR) 10jerkins-bot: [V: 04-1] Blacklisting TestSearchSatisfaction2 schema on MySQL [puppet] - 10https://gerrit.wikimedia.org/r/387654 (owner: 10Nuria) [20:18:30] (03PS3) 10Rush: openstack: move main hiera deployment config to common [puppet] - 10https://gerrit.wikimedia.org/r/387625 (https://phabricator.wikimedia.org/T171494) [20:19:22] (03CR) 10jerkins-bot: [V: 04-1] openstack: move main hiera deployment config to common [puppet] - 10https://gerrit.wikimedia.org/r/387625 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [20:24:28] (03PS2) 10Nuria: Blacklisting SearchSatisfaction schema on MySQL [puppet] - 10https://gerrit.wikimedia.org/r/387654 [20:24:30] (03CR) 10Rush: "http://puppet-compiler.wmflabs.org/8579/" [puppet] - 10https://gerrit.wikimedia.org/r/387625 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [20:24:42] (03CR) 10jerkins-bot: [V: 04-1] Blacklisting SearchSatisfaction schema on MySQL [puppet] - 10https://gerrit.wikimedia.org/r/387654 (owner: 10Nuria) [20:26:03] (03CR) 10EBernhardson: Blacklisting SearchSatisfaction schema on MySQL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/387654 (owner: 10Nuria) [20:32:56] (03PS3) 10Nuria: Blacklisting SearchSatisfaction schema on MySQL [puppet] - 10https://gerrit.wikimedia.org/r/387654 [20:36:46] (03CR) 10Ottomata: [C: 032] Blacklisting SearchSatisfaction schema on MySQL [puppet] - 10https://gerrit.wikimedia.org/r/387654 (owner: 10Nuria) [20:49:09] (03PS1) 10Ottomata: Initial debian release (2.1.2-bin-hadoop2.6-1) [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/387663 [20:49:26] (03PS2) 10Ottomata: Initial debian release (2.1.2-bin-hadoop2.6-1) [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/387663 (https://phabricator.wikimedia.org/T158334) [20:56:38] (03CR) 10Ottomata: "Its working! :D" [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/387663 (https://phabricator.wikimedia.org/T158334) (owner: 10Ottomata) [21:00:04] MaxSem: Time to snap out of that daydream and deploy CommTech stuff. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171031T2100). [21:00:04] No GERRIT patches in the queue for this window AFAICS. [21:00:22] !log removing old AMS-IX IPv6 - T167840 [21:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:28] T167840: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840 [21:01:15] (03PS2) 10MaxSem: Enable Unicode section links on Russian projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386553 (https://phabricator.wikimedia.org/T175725) [21:01:24] (03CR) 10MaxSem: [C: 032] Enable Unicode section links on Russian projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386553 (https://phabricator.wikimedia.org/T175725) (owner: 10MaxSem) [21:01:31] (03PS2) 10MaxSem: Enable Unicode section links on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386710 (https://phabricator.wikimedia.org/T175725) [21:01:40] (03CR) 10MaxSem: [C: 032] Enable Unicode section links on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386710 (https://phabricator.wikimedia.org/T175725) (owner: 10MaxSem) [21:02:21] (03PS1) 10Ottomata: Install Spark 2 in Hadoop Cluster [puppet] - 10https://gerrit.wikimedia.org/r/387680 (https://phabricator.wikimedia.org/T158334) [21:02:45] (03CR) 10Ottomata: [V: 032 C: 032] Initial debian release (2.1.2-bin-hadoop2.6-1) [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/387663 (https://phabricator.wikimedia.org/T158334) (owner: 10Ottomata) [21:03:59] (03Merged) 10jenkins-bot: Enable Unicode section links on Russian projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386553 (https://phabricator.wikimedia.org/T175725) (owner: 10MaxSem) [21:04:32] (03Merged) 10jenkins-bot: Enable Unicode section links on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386710 (https://phabricator.wikimedia.org/T175725) (owner: 10MaxSem) [21:05:17] PROBLEM - puppet last run on mw2101 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:02] (03CR) 10jenkins-bot: Enable Unicode section links on Russian projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386553 (https://phabricator.wikimedia.org/T175725) (owner: 10MaxSem) [21:09:46] (03PS2) 10Ottomata: Install Spark 2 for Hadoop clients [puppet] - 10https://gerrit.wikimedia.org/r/387680 (https://phabricator.wikimedia.org/T158334) [21:16:03] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/386553 https://gerrit.wikimedia.org/r/386710 (duration: 00m 52s) [21:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:25] (03CR) 10Ottomata: [C: 032] Install Spark 2 for Hadoop clients [puppet] - 10https://gerrit.wikimedia.org/r/387680 (https://phabricator.wikimedia.org/T158334) (owner: 10Ottomata) [21:25:08] PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:26:27] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Package[spark2] [21:31:27] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:35:17] RECOVERY - puppet last run on mw2101 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:37:56] !log awight@tin Started deploy [ores/deploy@9f361d2]: revscoring 2 -> ores* (non-production) [21:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:47] !log awight@tin Started deploy [ores/deploy@9f361d2]: revscoring 2 -> ores* (non-production) [21:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:08] RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:18:42] (03PS3) 10Ladsgroup: Enable blocking feature of abuse filter in fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384252 (https://phabricator.wikimedia.org/T178227) [22:20:42] (03CR) 10Ladsgroup: "@Marco: Good catch, thank you. Please check if it works just fine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384252 (https://phabricator.wikimedia.org/T178227) (owner: 10Ladsgroup) [22:22:07] PROBLEM - Check Varnish expiry mailbox lag on cp4026 is CRITICAL: CRITICAL: expiry mailbox lag is 2042410 [22:27:59] (03PS1) 10Ladsgroup: Enable NewUserMessage on fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387741 (https://phabricator.wikimedia.org/T179442) [22:33:00] (03CR) 10Huji: [C: 031] Enable blocking feature of abuse filter in fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384252 (https://phabricator.wikimedia.org/T178227) (owner: 10Ladsgroup) [22:38:17] (03PS1) 10Thcipriani: Scap prep: check reference directory exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387743 [22:38:34] (03CR) 10MarcoAurelio: [C: 031] "I think it is good now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384252 (https://phabricator.wikimedia.org/T178227) (owner: 10Ladsgroup) [22:39:47] PROBLEM - puppet last run on mw2120 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:42:07] RECOVERY - Check Varnish expiry mailbox lag on cp4026 is OK: OK: expiry mailbox lag is 0 [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Evening SWAT (Max 8 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171031T2300). [23:00:04] ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:24] \o/ [23:00:30] just me, i can ship [23:01:06] (03CR) 10EBernhardson: [C: 032] Setup CirrusSearch AB test on dbn group sizing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387586 (owner: 10EBernhardson) [23:03:00] (03PS6) 10EBernhardson: Setup CirrusSearch AB test on dbn group sizing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387586 [23:03:05] (03CR) 10EBernhardson: [C: 032] Setup CirrusSearch AB test on dbn group sizing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387586 (owner: 10EBernhardson) [23:04:16] (03Merged) 10jenkins-bot: Setup CirrusSearch AB test on dbn group sizing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387586 (owner: 10EBernhardson) [23:04:47] RECOVERY - puppet last run on mw2120 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [23:11:57] PROBLEM - puppet last run on db2080 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:19:52] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Setup CirrusSearch AB test on dbn group sizing (duration: 00m 53s) [23:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:02] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-common.php: Setup CirrusSearch AB test on dbn group sizing (duration: 00m 50s) [23:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:28] (03CR) 10Chad: [C: 032] Scap prep: check reference directory exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387743 (owner: 10Thcipriani) [23:36:43] (03Merged) 10jenkins-bot: Scap prep: check reference directory exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387743 (owner: 10Thcipriani) [23:36:58] RECOVERY - puppet last run on db2080 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:42:12] 10Puppet, 10Toolforge, 10cloud-services-team (Kanban): Switch Toolforge project hosts to the future parser - https://phabricator.wikimedia.org/T177298#3725127 (10bd808) @Andrew This was called out by @Joe in a recent techops meeting as being an important and useful milestone to hit while were are still runni... [23:45:30] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Switch Cloud VPS puppet default to future parser - https://phabricator.wikimedia.org/T179451#3725149 (10bd808) [23:45:58] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Switch Cloud VPS puppet default to future parser - https://phabricator.wikimedia.org/T179451#3725149 (10bd808) [23:48:31] !log demon@tin Synchronized scap/plugins/prep.py: no-op (duration: 00m 50s) [23:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:23] (03PS1) 10EBernhardson: cirrus interleave config should not be wg prefixed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387752 [23:50:24] (03PS2) 10EBernhardson: cirrus interleave config should not be wg prefixed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387752 [23:50:30] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Switch Cloud VPS puppet default to future parser - https://phabricator.wikimedia.org/T179451#3725179 (10bd808) The feature flag for this is `profile::base::environment: "future"` in hiera. [23:51:03] (03CR) 10EBernhardson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387752 (owner: 10EBernhardson) [23:52:09] (03Merged) 10jenkins-bot: cirrus interleave config should not be wg prefixed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387752 (owner: 10EBernhardson) [23:53:59] 10Operations, 10Ops-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T179452#3725194 (10Mehrdadbot) [23:56:38] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Adjust cirrussearch interleave configuration for AB test (duration: 00m 50s) [23:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log