[00:05:16] (03PS6) 10Dzahn: openstack: unified role wikitech/horizon/striker,apache -> httpd [puppet] - 10https://gerrit.wikimedia.org/r/406954 [00:08:36] (03CR) 10Dzahn: "this kind of thing would get rid of the "include ::apache" issues that jenkins-bot whines about https://gerrit.wikimedia.org/r/#/c/406954/" [puppet] - 10https://gerrit.wikimedia.org/r/405373 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [00:11:35] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3933022 (10Pchelolo) [00:12:50] 10Operations, 10ops-esams: bast3002 sdb broken - https://phabricator.wikimedia.org/T169035#3933033 (10Dzahn) [00:12:52] 10Operations, 10ops-esams, 10Patch-For-Review: install/designate other machine as esams bastion - https://phabricator.wikimedia.org/T184936#3933031 (10Dzahn) 05Open>03stalled blocked on switch config (subtask) which is blocked on the info which port it was connected to [00:14:29] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/9822/bohrium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/406061 (owner: 10Dzahn) [00:17:45] 10Operations, 10Gerrit, 10Release-Engineering-Team (Someday): Reimage cobalt as stretch - https://phabricator.wikimedia.org/T176774#3933076 (10demon) p:05Low>03Lowest [00:20:16] 10Operations, 10TechCom-RFC, 10Traffic, 10Wikipedia-Android-App-Backlog, and 2 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#3933087 (10Krinkle) [00:20:47] 10Operations, 10Gerrit, 10Release-Engineering-Team (Someday): Reimage cobalt as stretch - https://phabricator.wikimedia.org/T176774#3933088 (10Dzahn) I wouldn't call this Lowest. Stretch is stable and we should have both gerrit servers on same distro version. I wonder what our blockers were. [00:21:17] 10Operations, 10Gerrit, 10Release-Engineering-Team (Someday): Reimage cobalt as stretch - https://phabricator.wikimedia.org/T176774#3933090 (10demon) No blockers, other than the time I could ever possibly find to do it. [00:29:15] 10Operations, 10TechCom-RFC, 10Traffic, 10Wikipedia-Android-App-Backlog, and 2 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#3933117 (10Krinkle) [00:31:21] 10Operations, 10TechCom-RFC, 10Traffic, 10Wikipedia-Android-App-Backlog, and 2 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#3933120 (10Krinkle) [00:33:45] 10Operations, 10TechCom-RFC, 10Traffic, 10Wikipedia-Android-App-Backlog, and 2 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#3933134 (10Krinkle) [00:36:23] 10Operations, 10TechCom-RFC, 10Traffic, 10Wikipedia-Android-App-Backlog, and 2 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#1810236 (10Krinkle) [00:37:22] (03PS1) 10Dzahn: piwik: duplicate fake secrets to location for new role name [labs/private] - 10https://gerrit.wikimedia.org/r/406967 [00:37:56] 10Operations, 10TechCom-RFC, 10Traffic, 10Wikipedia-Android-App-Backlog, and 2 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#3933165 (10Krinkle) [00:38:32] (03PS2) 10Dzahn: piwik: duplicate fake secrets to location for new role name [labs/private] - 10https://gerrit.wikimedia.org/r/406967 [00:38:40] (03CR) 10Dzahn: [V: 032 C: 032] piwik: duplicate fake secrets to location for new role name [labs/private] - 10https://gerrit.wikimedia.org/r/406967 (owner: 10Dzahn) [00:42:37] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/9823/bohrium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/406061 (owner: 10Dzahn) [00:45:17] (03PS2) 10Krinkle: Remove unused PhpAutoPrepend.php file for now. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403973 (https://phabricator.wikimedia.org/T180183) [00:45:21] (03PS3) 10Krinkle: Initial profiler for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403974 (https://phabricator.wikimedia.org/T180183) [00:47:26] (03PS3) 10Krinkle: Remove unused PhpAutoPrepend.php file for now. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403973 (https://phabricator.wikimedia.org/T180183) [00:47:31] (03PS4) 10Krinkle: Initial profiler for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403974 (https://phabricator.wikimedia.org/T180183) [00:47:54] (03CR) 10Dzahn: [C: 031] "only thing i'm not sure about is if this role is used on a staging instance or something" [puppet] - 10https://gerrit.wikimedia.org/r/406061 (owner: 10Dzahn) [00:48:52] (03CR) 10Krinkle: [C: 032] "No-op, depends-on from beta-only change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403973 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [00:49:04] (03CR) 10Krinkle: [C: 032] "Testing in Beta." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403974 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [00:50:25] (03Merged) 10jenkins-bot: Remove unused PhpAutoPrepend.php file for now. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403973 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [00:50:37] (03CR) 10jenkins-bot: Remove unused PhpAutoPrepend.php file for now. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403973 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [00:51:11] (03Merged) 10jenkins-bot: Initial profiler for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403974 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [00:51:16] 10Operations, 10RESTBase, 10RESTBase-API, 10Traffic, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3933246 (10mobrovac) [00:51:57] !log krinkle@tin Synchronized wmf-config/profiler.php: no-op (comment-only) (duration: 00m 58s) [00:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:31] 10Operations, 10Cloud-VPS, 10User-bd808, 10cloud-services-team (Kanban): End self-service new Trusty instance creation in Cloud VPS; standardize on Debian base images - https://phabricator.wikimedia.org/T161899#3933253 (10bd808) [00:53:55] (03CR) 10jenkins-bot: Initial profiler for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403974 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [00:55:30] !log krinkle@tin Synchronized wmf-config: no-op, adding files for beta cluster (duration: 00m 59s) [00:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:56] (03PS29) 10Aaron Schulz: [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 [01:01:48] (03CR) 10Elukey: [C: 031] "looks good to me, https://puppet-compiler.wmflabs.org/compiler02/9824/bohrium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/406061 (owner: 10Dzahn) [01:10:26] (03PS1) 10Dzahn: grafana/racktables/iegreview/misc: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/406970 [01:10:55] (03CR) 10jerkins-bot: [V: 04-1] grafana/racktables/iegreview/misc: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/406970 (owner: 10Dzahn) [01:12:24] PROBLEM - HHVM rendering on mw2244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:12:58] (03PS2) 10Dzahn: grafana/racktables/iegreview/misc: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/406970 [01:13:15] RECOVERY - HHVM rendering on mw2244 is OK: HTTP OK: HTTP/1.1 200 OK - 76226 bytes in 0.304 second response time [01:18:00] (03CR) 10Elukey: "Faidon: I had a chat with Arzhel about what timestamp to use to aggregate data in hdfs (hourly partitions) and Druid (hourly indexes for e" [puppet] - 10https://gerrit.wikimedia.org/r/406951 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey) [01:31:53] (03PS30) 10Aaron Schulz: [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 [01:35:09] (03PS4) 10Elukey: profile::analytics::refinery::job::camus: add netflow hourly job [puppet] - 10https://gerrit.wikimedia.org/r/406951 (https://phabricator.wikimedia.org/T181036) [01:35:25] (03CR) 10Elukey: "Correction: we thought to use stamp_inserted since it better represents flows, but everything is open to discussion :)" [puppet] - 10https://gerrit.wikimedia.org/r/406951 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey) [01:40:47] (03PS31) 10Aaron Schulz: [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 [01:51:25] !log catchpoint: recycled gwicke's user and turned it into a user for volans, upgraded him to admin (T162857) [01:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:39] T162857: Some Core availability Catchpoint tests might be more expensive than they need to be - https://phabricator.wikimedia.org/T162857 [01:57:16] (03PS32) 10Aaron Schulz: [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 [02:05:10] for the next hour or so I will be unavailable except for emergencies [02:23:29] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.17) (duration: 05m 46s) [02:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:55] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1517365673 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 2849592 keys, up 4 minutes - replication_delay is 1517365673 [02:29:04] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 2831642 keys, up 5 minutes 4 seconds - replication_delay is 0 [02:35:04] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [02:35:44] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [02:36:05] 10Operations, 10RESTBase, 10Services, 10Wikimedia-Site-requests: Index page https://wikimedia.org/api/ is broken / RESTBase not discoverable - https://phabricator.wikimedia.org/T138848#3933363 (10Krinkle) [02:36:09] 10Operations, 10RESTBase, 10RESTBase-API, 10Traffic, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3933362 (10Krinkle) [02:48:44] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [02:49:05] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [03:12:19] 10Operations, 10Traffic, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#3933395 (10Krinkle) [03:12:29] 10Operations, 10Performance-Team: HTTP responses from app servers sometimes stall for >1s - https://phabricator.wikimedia.org/T164248#3933397 (10Krinkle) Closing in favour of T181315. [03:13:07] 10Operations, 10Performance-Team: HTTP responses from app servers sometimes stall for >1s - https://phabricator.wikimedia.org/T164248#3933400 (10Krinkle) [03:13:10] 10Operations, 10Traffic, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#3786217 (10Krinkle) [03:21:01] 10Operations, 10MediaWiki-Configuration, 10Availability (Multiple-active-datacenters), 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), and 4 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3933407 (10tstarling) Per testing on deploy... [03:25:04] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 753.21 seconds [03:54:04] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 268.06 seconds [03:58:37] (03CR) 10Ayounsi: "I don't know enough about Kafka to +/-1. I left 1 comment though." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/406951 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey) [04:43:55] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [04:44:24] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [04:56:31] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3933478 (10EBernhardson) [06:19:44] !log restart varnish backend on cp4024 - failed fetches / 503s [06:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:18] 503s seems going down in grafana, will wait for the icinga recovery [06:24:04] RECOVERY - Check Varnish expiry mailbox lag on cp4024 is OK: OK: expiry mailbox lag is 0 [06:31:35] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [06:32:14] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [07:04:54] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Remove db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406613 (https://phabricator.wikimedia.org/T184397) [07:07:00] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Remove db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406613 (https://phabricator.wikimedia.org/T184397) (owner: 10Marostegui) [07:08:10] !log Force BBU relearn on db1051 - T186049 [07:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:23] T186049: db1051 database host BBU issues - https://phabricator.wikimedia.org/T186049 [07:08:28] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406613 (https://phabricator.wikimedia.org/T184397) (owner: 10Marostegui) [07:08:40] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406613 (https://phabricator.wikimedia.org/T184397) (owner: 10Marostegui) [07:10:20] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db1030, will be decommissioned - T184397 (duration: 00m 57s) [07:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:31] T184397: Decommission db1030 - https://phabricator.wikimedia.org/T184397 [07:13:06] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db1030, will be decommissioned - T184397 (duration: 00m 56s) [07:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:17] (03PS1) 10Marostegui: mariadb: Remove db1030 [puppet] - 10https://gerrit.wikimedia.org/r/406981 (https://phabricator.wikimedia.org/T184397) [07:36:55] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/9826/" [puppet] - 10https://gerrit.wikimedia.org/r/406981 (https://phabricator.wikimedia.org/T184397) (owner: 10Marostegui) [07:38:05] (03CR) 10Marostegui: [C: 032] mariadb: Remove db1030 [puppet] - 10https://gerrit.wikimedia.org/r/406981 (https://phabricator.wikimedia.org/T184397) (owner: 10Marostegui) [07:41:26] (03CR) 10Marostegui: [C: 032] s6.hosts: Remove db1030 [software] - 10https://gerrit.wikimedia.org/r/406843 (https://phabricator.wikimedia.org/T184397) (owner: 10Marostegui) [07:42:07] (03Merged) 10jenkins-bot: s6.hosts: Remove db1030 [software] - 10https://gerrit.wikimedia.org/r/406843 (https://phabricator.wikimedia.org/T184397) (owner: 10Marostegui) [07:47:34] !log Remove db1030 from tendril - T184397 [07:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:48] T184397: Decommission db1030 - https://phabricator.wikimedia.org/T184397 [07:48:38] !log Stop MySQL on db1030 - T184397 [07:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:06] !log installing libxml security updates [07:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:32] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Decommission db1030 - https://phabricator.wikimedia.org/T184397#3933578 (10Marostegui) a:05Marostegui>03Cmjohnson db1030 is now ready to be fully decommissioned by @Cmjohnson [08:21:37] !log installing clamav security update on fermium [08:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:40] 10Operations, 10Icinga, 10monitoring: Icinga: page in case all MediaWiki are throwing 5xx - https://phabricator.wikimedia.org/T186069#3933605 (10Peachey88) [08:57:10] 10Operations, 10Icinga, 10monitoring, 10Wikimedia-Incident: Icinga: page in case all MediaWiki are throwing 5xx - https://phabricator.wikimedia.org/T186069#3932784 (10Peachey88) [08:57:35] (03PS1) 10Muehlenhoff: Remove access for akrausetud [puppet] - 10https://gerrit.wikimedia.org/r/406984 [08:58:44] (03CR) 10Muehlenhoff: [C: 032] Remove access for akrausetud [puppet] - 10https://gerrit.wikimedia.org/r/406984 (owner: 10Muehlenhoff) [09:10:44] PROBLEM - puppet last run on db1101 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:11:04] PROBLEM - puppet last run on kubernetes1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:11:14] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:12:04] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:12:05] PROBLEM - puppet last run on mw1267 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:12:44] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:13:34] PROBLEM - puppet last run on cp1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:15:00] ^ puppetdb OOMed on nitrogen and got auto-restarted by systemd [09:16:14] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:29:03] ACKNOWLEDGEMENT - MegaRAID on db1051 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough Jcrespo https://phabricator.wikimedia.org/T186049 [09:32:41] !log rolling restart of thumbor/nginx to pick up libxml security update [09:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:34] RECOVERY - puppet last run on cp1073 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:40:44] RECOVERY - puppet last run on db1101 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:41:05] RECOVERY - puppet last run on kubernetes1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:42:05] RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:42:05] RECOVERY - puppet last run on mw1267 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:42:44] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:31:28] via cp3043 frontend, Varnish XID 747420377 [10:31:29] Upstream caches: cp3043 int [10:31:29] Error: 500, Internal Server Error at Wed, 31 Jan 2018 10:31:13 GMT [10:31:45] site's down [10:33:09] jynus / marostegui ^^ [10:34:01] weird, it seems only happens with https://meta.wikimedia.org/wiki/Special:EditMassMessageList/Meta:Administrators/Mass-message_list [10:34:07] :S [10:37:28] (03CR) 10Filippo Giunchedi: Lower Thumbor subprocess timeout to 59 seconds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/405698 (https://phabricator.wikimedia.org/T185479) (owner: 10Gilles) [10:39:08] (03CR) 10Filippo Giunchedi: "LGTM, it would be nice to have PCC output too" [puppet] - 10https://gerrit.wikimedia.org/r/406794 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [10:48:47] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/405887 (https://phabricator.wikimedia.org/T182773) (owner: 10Gehel) [10:49:10] (03CR) 10Filippo Giunchedi: [C: 031] wdqs: remove cleanup code after migrating to prometheus jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/405888 (https://phabricator.wikimedia.org/T182773) (owner: 10Gehel) [10:57:53] (03CR) 10Filippo Giunchedi: [C: 031] Ensure all packages are updated when d-i installs security updates [puppet] - 10https://gerrit.wikimedia.org/r/405026 (owner: 10Muehlenhoff) [10:58:24] PROBLEM - HHVM rendering on mw2253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:58:56] (03CR) 10Filippo Giunchedi: [C: 031] Metrics are exposed by Blazegraph directly [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/405878 (https://phabricator.wikimedia.org/T182857) (owner: 10Gehel) [10:59:15] RECOVERY - HHVM rendering on mw2253 is OK: HTTP OK: HTTP/1.1 200 OK - 75975 bytes in 0.367 second response time [11:06:11] (03CR) 10Lokal Profil: Drop the medlem user group and editallpages user right (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981) (owner: 10Lokal Profil) [11:22:39] (03CR) 10MarcoAurelio: Drop the medlem user group and editallpages user right (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981) (owner: 10Lokal Profil) [11:24:24] PROBLEM - HHVM rendering on mw2206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:25:14] RECOVERY - HHVM rendering on mw2206 is OK: HTTP OK: HTTP/1.1 200 OK - 75975 bytes in 0.305 second response time [11:31:52] <_joe_> Hauskatze: that seems like a MediaWiki bug to me [11:32:12] <_joe_> worth reporting on a task I think [11:32:24] _joe_: done already :) [11:32:30] <_joe_> heh ok :P [11:32:31] it seems it's with MassMessage [11:33:03] unfortunatelly I cannot see the error logs to exactly know what's causing the failure [11:33:20] <_joe_> let me take a look [11:33:27] for reference T186098 _joe_ [11:33:28] T186098: Special:EditMassMessageList serves Error 500 - https://phabricator.wikimedia.org/T186098 [11:33:28] <_joe_> what's the task? [11:33:32] <_joe_> lol [11:33:33] ^^ [11:33:36] heh [11:36:18] <_joe_> Hauskatze: I've reported the error, but I won't touch the code myself right now, I'm too jetlagged to not screw it up [11:36:34] <_joe_> (I should also not be here, but well :P) [11:37:09] _joe_: you've been of big help here :) [11:37:14] thanks [11:37:23] <_joe_> yw :) [11:37:23] and take a nap for the jetlag [11:37:38] <_joe_> actually, I'm trying *not* to fall asleep :P [11:38:05] Nesspreso :P [11:39:01] && MWNamespace::getRestrictionLevels( $this->title->getNamespace() ) !== [ '' ] [11:39:05] hmm [11:39:49] hah [11:39:52] revi: fixed [11:40:03] I unprotected the page and it works now [11:40:16] but I'm a sysop so I should be able to edit it nonetheless [11:40:22] -or- display a nice warning [11:40:30] hmmmmmmm [11:41:10] kinda huh stuff [11:41:15] anyway good to know [11:43:58] MWNamespace [11:44:04] is that deprecated, new? [12:14:15] (03PS1) 10Matthias Mullie: [WIP] Add 3d2png scap targets [puppet] - 10https://gerrit.wikimedia.org/r/406997 [12:15:10] 10Operations, 10DBA, 10hardware-requests, 10Goal: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#2087114 (10Marostegui) This task is now pending the last steps of the decommissioning process for labsdb1001 and labsdb1003 (T184832). I have tried to update the spreadsheet on... [12:25:29] !log Fix replication on labsdb1004 [12:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:59] (03PS2) 10Gilles: Lower Thumbor subprocess timeout to 59 seconds [puppet] - 10https://gerrit.wikimedia.org/r/405698 (https://phabricator.wikimedia.org/T185479) [12:27:04] (03CR) 10Gilles: Lower Thumbor subprocess timeout to 59 seconds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/405698 (https://phabricator.wikimedia.org/T185479) (owner: 10Gilles) [12:43:24] PROBLEM - HHVM rendering on mw2207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:44:15] RECOVERY - HHVM rendering on mw2207 is OK: HTTP OK: HTTP/1.1 200 OK - 75943 bytes in 0.313 second response time [13:12:54] PROBLEM - HHVM rendering on mw2245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:13:44] RECOVERY - HHVM rendering on mw2245 is OK: HTTP OK: HTTP/1.1 200 OK - 75867 bytes in 1.052 second response time [14:29:29] (03PS4) 10Filippo Giunchedi: prometheus: bump global retention to 15 months [puppet] - 10https://gerrit.wikimedia.org/r/404434 (https://phabricator.wikimedia.org/T160677) [14:32:05] (03PS5) 10Filippo Giunchedi: prometheus: bump global retention to 15 months [puppet] - 10https://gerrit.wikimedia.org/r/404434 (https://phabricator.wikimedia.org/T160677) [14:32:47] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: bump global retention to 15 months [puppet] - 10https://gerrit.wikimedia.org/r/404434 (https://phabricator.wikimedia.org/T160677) (owner: 10Filippo Giunchedi) [14:37:13] !log bump prometheus global instance retention to 15 months - T160677 [14:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:27] T160677: Effects on adjusting Prometheus retention - https://phabricator.wikimedia.org/T160677 [14:38:02] 10Operations: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494#3934050 (10MoritzMuehlenhoff) [14:38:04] 10Operations, 10HHVM: Upload hhvm to stretch apt repo in apt.wikimedia.org - https://phabricator.wikimedia.org/T167225#3934046 (10MoritzMuehlenhoff) 05stalled>03Resolved a:03MoritzMuehlenhoff HHVM is available for stretch-wikimedia for quite a while now (used by the video scalers). [14:39:08] 10Operations, 10ops-eqiad, 10hardware-requests, 10HHVM, and 2 others: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004#3934051 (10MoritzMuehlenhoff) a:03Cmjohnson [14:39:33] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Joe: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004#3902917 (10MoritzMuehlenhoff) [14:43:49] (03PS1) 10Filippo Giunchedi: prometheus: default to storage encoding version 2 [puppet] - 10https://gerrit.wikimedia.org/r/407011 (https://phabricator.wikimedia.org/T160677) [14:52:12] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10NewPHP, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#3934066 (10MoritzMuehlenhoff) During the SRE offsite/onsite we came up with the following plan: * We need to remove PHP 5 usage by May 2018 (branch date f... [15:07:51] (03PS1) 10Zoranzoki21: Enable ArticlePlaceholder for Estonian Wikipedia (etwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407017 (https://phabricator.wikimedia.org/T186107) [15:08:39] (03PS2) 10Zoranzoki21: Enable ArticlePlaceholder for Estonian Wikipedia (etwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407017 (https://phabricator.wikimedia.org/T186107) [15:13:41] (03CR) 10Filippo Giunchedi: [C: 032] "PCC https://puppet-compiler.wmflabs.org/compiler03/9828/prometheus1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/407011 (https://phabricator.wikimedia.org/T160677) (owner: 10Filippo Giunchedi) [15:13:46] (03PS2) 10Filippo Giunchedi: prometheus: default to storage encoding version 2 [puppet] - 10https://gerrit.wikimedia.org/r/407011 (https://phabricator.wikimedia.org/T160677) [15:23:12] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10NewPHP, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#3934147 (10Imarlier) @MoritzMuehlenhoff Is someone actively working on dumps? I haven't seen movement on https://phabricator.wikimedia.org/T117534 Wikite... [15:34:55] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851#3934181 (10akosiaris) [15:34:59] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4027 is CRITICAL: connect to address 10.128.0.127 and port 3128: Connection refused [15:35:59] RECOVERY - Varnish HTTP text-backend - port 3128 on cp4027 is OK: HTTP OK: HTTP/1.1 200 OK - 218 bytes in 0.157 second response time [15:38:15] (03PS1) 10Alexandros Kosiaris: Set ores1* as spare::system [puppet] - 10https://gerrit.wikimedia.org/r/407018 (https://phabricator.wikimedia.org/T171851) [15:40:53] (03CR) 10Alexandros Kosiaris: [C: 032] Set ores1* as spare::system [puppet] - 10https://gerrit.wikimedia.org/r/407018 (https://phabricator.wikimedia.org/T171851) (owner: 10Alexandros Kosiaris) [15:44:19] !log reimage ores100{1..9} T171851 [15:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:32] T171851: Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851 [15:54:00] PROBLEM - graphite-labs.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [15:54:09] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [15:54:49] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [15:59:01] <_joe_> uh oh [16:02:28] PROBLEM - kartotherian endpoints health on maps-test2004 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /geoline?getgeojson=1&ids={ids} (Moscow) is CRITICAL: Test Moscow returned the unexpected status 404 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@ [16:02:28] Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile [16:02:28] m-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (expecting: 200) [16:04:08] RECOVERY - graphite-labs.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.024 second response time [16:04:15] (03PS1) 10RobH: depooling ulsfo to swap switches [dns] - 10https://gerrit.wikimedia.org/r/407022 (https://phabricator.wikimedia.org/T185228) [16:10:15] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [16:12:48] (03CR) 10Andrew Bogott: "Thanks for this! I'm surprised that Jenkins doesn't still hate it though, because we now include roles from other roles?" [puppet] - 10https://gerrit.wikimedia.org/r/406954 (owner: 10Dzahn) [16:14:20] 10Operations, 10ops-codfw, 10DBA: rack/setup/install tendril2001 - https://phabricator.wikimedia.org/T186123#3934359 (10RobH) p:05Triage>03Normal [16:15:25] (03CR) 10BBlack: [C: 031] depooling ulsfo to swap switches [dns] - 10https://gerrit.wikimedia.org/r/407022 (https://phabricator.wikimedia.org/T185228) (owner: 10RobH) [16:15:38] (03CR) 10RobH: [C: 032] depooling ulsfo to swap switches [dns] - 10https://gerrit.wikimedia.org/r/407022 (https://phabricator.wikimedia.org/T185228) (owner: 10RobH) [16:15:58] !log depooling ulsfo for https://phabricator.wikimedia.org/T185228 [16:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:19] !log Optimize wbc_entity_usage on s6 on db1102 [16:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:05] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [16:26:57] 10Operations, 10ops-codfw, 10DBA: rack/setup/install tendril2001 - https://phabricator.wikimedia.org/T186123#3934420 (10Papaul) Rack location = D5 [16:36:32] maps-test alert is an expired downtime, I'm re-disabling alerts on maps-test2004 [16:42:22] (03CR) 10Elukey: profile::analytics::refinery::job::camus: add netflow hourly job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/406951 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey) [16:44:33] (03PS5) 10Elukey: profile::analytics::refinery::job::camus: add netflow hourly job [puppet] - 10https://gerrit.wikimedia.org/r/406951 (https://phabricator.wikimedia.org/T181036) [16:47:27] (03PS6) 10Elukey: profile::analytics::refinery::job::camus: add netflow hourly job [puppet] - 10https://gerrit.wikimedia.org/r/406951 (https://phabricator.wikimedia.org/T181036) [16:51:19] 10Operations, 10Cloud-VPS, 10cloud-services-team: wikidumpparse is using 1.2TB of 5T available NFS misc storage - https://phabricator.wikimedia.org/T183970#3934473 (10notconfusing) @madhuvishy Yes indeed, the service is still active and used by many community members. I added `rm -rf /home/maximilianklein/W... [17:01:45] PROBLEM - HHVM rendering on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [17:02:46] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 75279 bytes in 0.589 second response time [17:08:34] (03PS1) 10Aaron Schulz: Move "file:" prefix into CONFIG_FILE variable [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/407030 [17:10:15] _joe_: is fine to just use the enable: option on the puppet Service resource for mcrouter? [17:10:51] <_joe_> AaronSchulz: for now, yes, but if we plan to take it to production, I want to revisit the package [17:11:08] sure [17:21:38] (03PS33) 10Aaron Schulz: [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 [17:22:13] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz) [17:33:29] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10NewPHP, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#3934580 (10MoritzMuehlenhoff) >>! In T176370#3934147, @Imarlier wrote: > @MoritzMuehlenhoff Is someone actively working on dumps? I haven't seen movement... [17:43:45] 10Operations, 10Cloud-VPS, 10cloud-services-team: wikidumpparse is using 1.2TB of 5T available NFS misc storage - https://phabricator.wikimedia.org/T183970#3934616 (10madhuvishy) @notconfusing Great, thank you! [17:43:58] 10Operations, 10Traffic, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#3934617 (10BBlack) TL;DR: The network itself doesn't seem to be at fault. Whatever this is, it probably affects esams more than othe... [17:55:44] (03PS34) 10Aaron Schulz: [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 [17:56:03] (03PS3) 10Dzahn: rename piwik::server to just piwik [puppet] - 10https://gerrit.wikimedia.org/r/406061 [17:56:36] (03CR) 10Dzahn: [C: 032] rename piwik::server to just piwik [puppet] - 10https://gerrit.wikimedia.org/r/406061 (owner: 10Dzahn) [17:57:41] _joe_: can I get a quick CR on https://gerrit.wikimedia.org/r/#/c/407030/1? I can't get the service to start without that. [18:02:21] (03PS1) 10Dzahn: remove hieradata/piwik/server.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/407032 [18:03:07] (03PS2) 10Dzahn: remove hieradata/piwik/server.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/407032 [18:03:16] (03CR) 10Dzahn: [V: 032 C: 032] remove hieradata/piwik/server.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/407032 (owner: 10Dzahn) [18:06:04] (03PS7) 10Andrew Bogott: openstack: unified role wikitech/horizon/striker,apache -> httpd [puppet] - 10https://gerrit.wikimedia.org/r/406954 (owner: 10Dzahn) [18:06:50] (03CR) 10Andrew Bogott: [C: 032] openstack: unified role wikitech/horizon/striker,apache -> httpd [puppet] - 10https://gerrit.wikimedia.org/r/406954 (owner: 10Dzahn) [18:08:22] (03CR) 10Andrew Bogott: "applied and everything looks fine." [puppet] - 10https://gerrit.wikimedia.org/r/406954 (owner: 10Dzahn) [18:08:43] <_joe_> AaronSchulz: uhm that's pretty strange indeed [18:08:49] (03CR) 10Giuseppe Lavagetto: [C: 032] Move "file:" prefix into CONFIG_FILE variable [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/407030 (owner: 10Aaron Schulz) [18:09:27] <_joe_> AaronSchulz: I'll build a new package tomorrow though [18:10:10] 10Operations, 10ops-eqiad, 10User-Eevans: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494#3934714 (10fgiunchedi) @Cmjohnson the machine can be shut at will since it isn't in production. Looks like the intel ssd failed to respond at some point, and/or the controller didn't like i... [18:10:21] !log deactivating bgp session from ulsfo to office [18:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:05] mutante: after merging https://gerrit.wikimedia.org/r/#/c/406954/7/modules/openstack/manifests/horizon/service.pp can I still use notify => Service['apache2'] elsewhere, or is the service name different? [18:16:23] (I'm pretty sure the answer is yes but want to doublecheck) [18:16:30] andrewbogott: the service name stays the same [18:16:37] ok, great [18:16:54] it's mostly a copy of the previous module just organized differently [18:17:18] parsercache misses seem to have incremented a lot lately: [18:17:30] andrewbogott: wow, you already merged that. thanks, nice! [18:17:44] https://grafana.wikimedia.org/dashboard/db/edit-stash?panelId=9&fullscreen&orgId=1&refresh=5m&from=now-30d&to=now [18:17:49] yeah, wanted to get my WIP patch on top if it :) [18:18:46] andrewbogott: yes, it still hates "include role from other role" but it agrees because more other violations are fixed in same change. it checks the "delta" [18:18:47] (03PS2) 10Andrew Bogott: openstack horizon: rough in manifests for source deploy of Horizon 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) [18:19:02] heh, ok :) [18:19:23] (03CR) 10jerkins-bot: [V: 04-1] openstack horizon: rough in manifests for source deploy of Horizon 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [18:20:34] eh, yea, that's the 2 remaining role includes.. uhmm [18:20:55] but the number is greatly reduced,heh [18:22:35] andrewbogott: i understand if you want to override this one, we can fix the memcached thing later probably [18:24:26] andrewbogott: wait, i think we can fix it:) [18:26:17] the ldap bit is easy enough to fix [18:26:39] (03CR) 10Dzahn: openstack horizon: rough in manifests for source deploy of Horizon 'ocata' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [18:26:51] andrewbogott: if we use class { '::memcached': like in modules/openstack/manifests/wikitech/openstack_manager.pp [18:27:01] to replace the "include ::memcached" that should do it [18:29:38] (03PS3) 10Andrew Bogott: openstack horizon: rough in manifests for source deploy of Horizon 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) [18:30:16] (03CR) 10jerkins-bot: [V: 04-1] openstack horizon: rough in manifests for source deploy of Horizon 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [18:34:35] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:36:06] i'll look at silver [18:37:28] (03PS1) 10Andrew Bogott: ldap: move things from ldap::role to profile::ldap [puppet] - 10https://gerrit.wikimedia.org/r/407039 [18:37:58] (03CR) 10jerkins-bot: [V: 04-1] ldap: move things from ldap::role to profile::ldap [puppet] - 10https://gerrit.wikimedia.org/r/407039 (owner: 10Andrew Bogott) [18:38:19] * Hauskatze wonders if we have 'gold' server [18:39:16] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:40:00] awww.. and i just said it should be fine [18:40:27] !log putting all ulsfo servers into maint mode [18:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:41] (03PS2) 10Andrew Bogott: ldap: move things from ldap::role to profile::ldap [puppet] - 10https://gerrit.wikimedia.org/r/407039 [18:42:13] (03CR) 10jerkins-bot: [V: 04-1] ldap: move things from ldap::role to profile::ldap [puppet] - 10https://gerrit.wikimedia.org/r/407039 (owner: 10Andrew Bogott) [18:43:25] (03PS1) 10Framawiki: Rename Project NS on Wikimedia Canada Chapter's wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407040 (https://phabricator.wikimedia.org/T185661) [18:45:24] andrewbogott: i'll create a follow-up for silver/labtestweb .. the service name stays the same but "Service[apache2] doesn't seem to be in the catalog" [18:45:24] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4021.ulsfo.wmnet [18:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:41] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4022.ulsfo.wmnet [18:45:48] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4023.ulsfo.wmnet [18:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:55] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4024.ulsfo.wmnet [18:46:03] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4025.ulsfo.wmnet [18:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:07] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4026.ulsfo.wmnet [18:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:15] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4027.ulsfo.wmnet [18:46:19] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4028.ulsfo.wmnet [18:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:25] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4029.ulsfo.wmnet [18:46:30] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4030.ulsfo.wmnet [18:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:37] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4031.ulsfo.wmnet [18:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:43] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4032.ulsfo.wmnet [18:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:40] (03PS3) 10Andrew Bogott: ldap: move things from ldap::role to profile::ldap [puppet] - 10https://gerrit.wikimedia.org/r/407039 [18:47:45] 10Operations, 10ops-ulsfo, 10Traffic, 10netops, 10Patch-For-Review: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3934835 (10RobH) [18:47:49] mutante: weird, thanks [18:48:31] PROBLEM - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: connect to address 198.35.26.112 and port 443: Connection refused [18:48:51] PROBLEM - LVS HTTP IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: connect to address 2620:0:863:ed1a::2:b and port 80: Connection refused [18:48:57] PROBLEM - LVS HTTP IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: connect to address 198.35.26.112 and port 80: Connection refused [18:49:04] PROBLEM - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: connect to address 2620:0:863:ed1a::2:b and port 443: Connection refused [18:49:11] <_joe_> robh: wtt? [18:49:20] uhhhhh, i depooled all of the cp systems [18:49:20] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: connect to address 2620:0:863:ed1a::1 and port 443: Connection refused [18:49:30] <_joe_> jesus [18:49:30] so this is an alert we should be able to ignore [18:49:33] ulsfo is passive, right? [18:49:34] since its depooled by dns [18:49:39] i just didnt get all the right alerts [18:49:40] <_joe_> ohh ok [18:49:41] correct! [18:49:45] <_joe_> I was just freaked out [18:50:00] PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: connect to address 198.35.26.96 and port 443: Connection refused [18:50:13] me too [18:50:15] andrewbogott: it's because wikitech::openstack_manager class doesn't contain the "include ::apache" anymore. for californium it's fine because the new unified role replaces it, but silver/labtestweb are using it too.. [18:50:20] PROBLEM - LVS HTTP IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: connect to address 2620:0:863:ed1a::1 and port 80: Connection refused [18:50:24] let me ack them in icinga [18:50:28] PROBLEM - LVS HTTP IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: connect to address 198.35.26.96 and port 80: Connection refused [18:50:30] and thanks for confirming those alerts :) [18:50:30] so they stop freaking folks out [18:50:55] <_joe_> robh: not sure depooling all servers is a good idea tbh [18:51:04] <_joe_> but I have no idea what you're doing :) [18:51:08] they're going to lose all networking right now [18:51:10] for two hours [18:51:17] we're replacing the network switch stack [18:51:18] yeah ulsfo is depooled in DNS [18:51:30] so my understanding is my shitty pager storm is annoying but not a critical issue [18:51:32] so this ultimately shouldn't be a problem, but FTR, don't depool all the ulsfo hosts individually :/ [18:51:47] bblack: ok, should i put them back so they arent manually depooled? [18:51:48] that doesn't work, and there are edge cases that break when the total pooled host count is zero [18:51:52] yes [18:51:56] ok, will do now [18:52:08] ok, known, thanks for the ack [18:52:16] !log robh@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4021.ulsfo.wmnet [18:52:20] !log robh@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4022.ulsfo.wmnet [18:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:31] robh: assuming none were intentionally already depooled before you started, that is [18:52:33] RECOVERY - LVS HTTP IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 379 bytes in 0.157 second response time [18:52:34] bblack: so do i need to do anything other than the dns and icinga acks before unplugging [18:52:39] RECOVERY - LVS HTTP IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 392 bytes in 0.157 second response time [18:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:43] none were intentionally depooled, i can see the history of each changing [18:52:45] RECOVERY - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 889 bytes in 0.442 second response time [18:53:17] !log robh@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4023.ulsfo.wmnet [18:53:20] from a depooling perspective, we just depool in DNS in this scenario, not per-host. For icinga, you'll have to downtime all the hosts and then the LVS checks as well. [18:53:22] !log robh@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4024.ulsfo.wmnet [18:53:25] !log robh@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4025.ulsfo.wmnet [18:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:31] !log robh@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4026.ulsfo.wmnet [18:53:31] RECOVERY - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 891 bytes in 0.319 second response time [18:53:35] !log robh@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4028.ulsfo.wmnet [18:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:41] !log robh@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4027.ulsfo.wmnet [18:53:46] !log robh@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4029.ulsfo.wmnet [18:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:51] RECOVERY - LVS HTTP IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 523 bytes in 0.157 second response time [18:53:53] !log robh@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4030.ulsfo.wmnet [18:53:57] RECOVERY - LVS HTTP IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 510 bytes in 0.157 second response time [18:53:58] !log robh@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4031.ulsfo.wmnet [18:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:02] !log robh@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4032.ulsfo.wmnet [18:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:13] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17066 bytes in 0.522 second response time [18:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:49] what's going on? I'm only a little bit here unfortunately [18:54:53] RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 17062 bytes in 0.534 second response time [18:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:56] apergos: ignore it [18:54:59] its me in ulsfo [18:55:04] ok [18:55:06] gtk [18:55:09] ;] [19:00:50] mutante: ok, now regarding https://gerrit.wikimedia.org/r/#/c/407039/... [19:01:03] there are lots of classes that include ::ldap::role::config::labs [19:01:07] (03PS1) 10Dzahn: wmcs/wikitech: follow-up for apache->httpd conversion [puppet] - 10https://gerrit.wikimedia.org/r/407042 [19:01:29] so, should that class be ::ldap::config::labs or profile::ldap::config::labs? [19:01:40] It seems like both will get flagged, since including a profile from a non-profile class is bad [19:01:48] but also including cross-modules is bad, right? [19:04:21] _joe_: ^ do you have advice on that one maybe? [19:04:45] I assume there must be a pattern for 'tiny utility modules that will definitely be included by other modules all the time' [19:04:54] maybe I'm mistaken thinking that that's currently frowned on [19:07:00] andrewbogott: let me fix silver/labtestweb with https://gerrit.wikimedia.org/r/#/c/407042/ [19:07:23] i'm not sure about the ldap one yet either.. hmm [19:11:38] PROBLEM - Host lvs4005 is DOWN: PING CRITICAL - Packet loss = 100% [19:11:48] PROBLEM - Host lvs4006 is DOWN: PING CRITICAL - Packet loss = 100% [19:12:50] yeah thats me. [19:13:17] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0 [19:13:32] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/9830/" [puppet] - 10https://gerrit.wikimedia.org/r/407042 (owner: 10Dzahn) [19:14:17] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0 [19:15:19] (03CR) 10Dzahn: "follow-up https://gerrit.wikimedia.org/r/#/c/407042/" [puppet] - 10https://gerrit.wikimedia.org/r/406954 (owner: 10Dzahn) [19:15:24] (03PS4) 10Andrew Bogott: ldap: move things from ldap::role to profile::ldap [puppet] - 10https://gerrit.wikimedia.org/r/407039 [19:15:27] bblack: mind maint moding the bast system? [19:16:02] andrewbogott: silver/labtestweb2001 fixed now [19:16:07] robh: you already did [19:19:37] mutante: thanks! [19:20:07] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:20:13] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:23:14] mutante: I'm also really confused by things like "profile 'profile::hue' includes non-profile class ldap::config::labs" [19:23:20] isn't the point of a profile to include other classes? [19:23:29] Or is there some reason why we can define classes but not include them? [19:24:36] (03PS5) 10Andrew Bogott: ldap: move things from ldap::role to profile::ldap [puppet] - 10https://gerrit.wikimedia.org/r/407039 [19:26:47] andrewbogott: we are supposed to turn everything into profiles and then we can include the profiles in the role class [19:27:19] mutante: ok, and then profiles will include… what? [19:27:34] mutante: that warning is about me including something in a profile [19:27:46] As though profiles can only include other profiles [19:30:12] andrewbogott: they are supposed to be declaring a class instead of using include. i was not expecting that it _also_ says "declares class from another class" though. but that is because it's inside a module [19:30:23] " No resource should be added to a profile using the include class method, but with explicit class instantiations. Only very specific exceptions are allowed, like global classes like the network::constants class." [19:30:34] ok, I guess that rule is new to me [19:30:37] " If a profile needs another one as a precondition, it must be listed with a require ::profile::foo at the start of the class, but profile cross-dependencies should be mostly avoided." [19:30:47] ^ ah.. "require" [19:30:50] and… I don't understand what the difference is, but ok [19:31:13] I mean, 'include' vs class {classname: } [19:33:19] andrewbogott: i think to make sure each resource exists only once and there can't be "overlapping roles" as in "lets you create overlapping “role” classes where a given node can have more than one role." https://puppet.com/docs/puppet/5.3/lang_classes.html#include-like-vs-resource-like [19:33:50] eh,"each class exists only once" but don't quote me on it, heh [19:34:12] (03PS4) 10BBlack: eqsin: deeper configuration details [puppet] - 10https://gerrit.wikimedia.org/r/392639 (https://phabricator.wikimedia.org/T156027) [19:34:56] (03CR) 10jerkins-bot: [V: 04-1] eqsin: deeper configuration details [puppet] - 10https://gerrit.wikimedia.org/r/392639 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [19:36:51] openstack::horizon::source_deploy would have to become a profile .. then it would be ok that it declares the memcached class. the issue is always the chicken-egg thing that you have to refactor everything and where to start.. i feel you [19:37:26] (03PS5) 10BBlack: eqsin: deeper configuration details [puppet] - 10https://gerrit.wikimedia.org/r/392639 (https://phabricator.wikimedia.org/T156027) [19:37:59] (03CR) 10jerkins-bot: [V: 04-1] eqsin: deeper configuration details [puppet] - 10https://gerrit.wikimedia.org/r/392639 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [19:38:13] i haven't seen "contain" before. like " contain '::openstack::horizon::source_deploy' [19:39:02] (03PS6) 10Andrew Bogott: ldap: move things from ldap::role to profile::ldap [puppet] - 10https://gerrit.wikimedia.org/r/407039 [19:39:31] (03CR) 10BBlack: [C: 031] "Remaining style violations in jenkins' -1 are copypasta violations present at other existing datacenters' identical declarations (in other" [puppet] - 10https://gerrit.wikimedia.org/r/392639 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [19:39:41] 10Operations, 10ops-codfw, 10DBA: rack/setup/install tendril2001 - https://phabricator.wikimedia.org/T186123#3934984 (10Papaul) [19:45:05] (03PS4) 10Paladox: Gerrit: remove libbcprov-java and libbcpkix-java packages [puppet] - 10https://gerrit.wikimedia.org/r/385105 [19:46:14] (03PS5) 10Paladox: Gerrit: Remove velocity templates but keep the ones for its-base [puppet] - 10https://gerrit.wikimedia.org/r/353693 (https://phabricator.wikimedia.org/T158008) [19:46:37] (03CR) 10Paladox: [C: 04-1] "We can do this after the gerrit 2.14 upgrade. As this is not required to do the actual upgrade." [puppet] - 10https://gerrit.wikimedia.org/r/353693 (https://phabricator.wikimedia.org/T158008) (owner: 10Paladox) [19:46:47] hmm [19:46:52] i never -1 it [19:47:11] ah [19:47:12] i did [19:51:16] (03PS3) 10Paladox: Gerrit: Set gitiles configuation to be used as the repo viewer [puppet] - 10https://gerrit.wikimedia.org/r/401799 (https://phabricator.wikimedia.org/T184116) [19:51:31] (03CR) 10Paladox: "(rebased (merge conflict))" [puppet] - 10https://gerrit.wikimedia.org/r/401799 (https://phabricator.wikimedia.org/T184116) (owner: 10Paladox) [19:55:13] (03CR) 10Paladox: "(requires manual cleanup to remove the lib from the libs folder)" [puppet] - 10https://gerrit.wikimedia.org/r/385105 (owner: 10Paladox) [19:55:15] (03CR) 10Andrew Bogott: "this seems to not break labs, although it will require close watching when merged." [puppet] - 10https://gerrit.wikimedia.org/r/407039 (owner: 10Andrew Bogott) [19:56:10] (03PS14) 10Paladox: ircecho: Support ssl when connecting to irc [puppet] - 10https://gerrit.wikimedia.org/r/405591 [19:58:49] (03PS6) 10BBlack: eqsin: deeper configuration details [puppet] - 10https://gerrit.wikimedia.org/r/392639 (https://phabricator.wikimedia.org/T156027) [19:59:28] (03CR) 10jerkins-bot: [V: 04-1] eqsin: deeper configuration details [puppet] - 10https://gerrit.wikimedia.org/r/392639 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [20:00:10] (03CR) 10BBlack: [C: 031] eqsin: deeper configuration details [puppet] - 10https://gerrit.wikimedia.org/r/392639 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [20:00:41] (03CR) 10jerkins-bot: [V: 04-1] eqsin: deeper configuration details [puppet] - 10https://gerrit.wikimedia.org/r/392639 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [20:00:52] whatever jenkins :P [20:05:18] (03CR) 10Framawiki: New throttle rule, clean obsolete rules (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406820 (https://phabricator.wikimedia.org/T186002) (owner: 10Urbanecm) [20:08:59] yea, unfortunately " declares interface::add_ip6_mapped" needs to be ignored [20:09:14] we should fix that globally though [20:11:22] hey people. Are there any areas of the foundation running php7 in production? [20:11:56] (03PS35) 10Aaron Schulz: [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 [20:13:48] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Get rid of "import realm.pp" in manifests/site.pp - https://phabricator.wikimedia.org/T154915#3935082 (10hashar) 05Open>03stalled a:05hashar>03None Pending https://gerrit.wikimedia.or... [20:15:30] jgleeson: yea, for example Continuous integration and Phabricator if on stretch [20:15:34] but not that many yet [20:15:58] (03PS1) 10Chad: Kill WikipediaMobileFirefoxOS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407051 (https://phabricator.wikimedia.org/T171255) [20:16:02] there will be more "misc" services over time. switching the actual Mediawiki appservers is a separate and big project [20:17:01] jgleeson: did you get my reply. the answer is "yes, but not many" [20:17:18] no I dropped off, thanks! [20:18:02] (03PS1) 10Jforrester: Drop archaïc WikipediaMobileFirefoxOS code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407052 [20:19:02] (03Abandoned) 10Jforrester: Drop archaïc WikipediaMobileFirefoxOS code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407052 (owner: 10Jforrester) [20:19:08] (03CR) 10Jforrester: [C: 031] Kill WikipediaMobileFirefoxOS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407051 (https://phabricator.wikimedia.org/T171255) (owner: 10Chad) [20:19:16] mutante, I'm working through some patches to make fundraising tech projects work on php7 due to the recent switch on mediawiki vagrant and it dawned on me that we're potentially updating discrete projects with fixes unnecessary in production and the possibility of that being unwanted until we move to php7 [20:20:12] jgleeson: moving mw appservers to PHP7 is a separate project, but it doesn't block moving other "misc" services to php7 [20:21:05] does it run as an apache module or a daemon like fpm etc? [20:21:06] (03CR) 10Chad: [C: 032] Kill WikipediaMobileFirefoxOS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407051 (https://phabricator.wikimedia.org/T171255) (owner: 10Chad) [20:22:47] (03Merged) 10jenkins-bot: Kill WikipediaMobileFirefoxOS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407051 (https://phabricator.wikimedia.org/T171255) (owner: 10Chad) [20:22:56] (03CR) 10jenkins-bot: Kill WikipediaMobileFirefoxOS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407051 (https://phabricator.wikimedia.org/T171255) (owner: 10Chad) [20:24:39] LOL [20:24:40] Yes, it will break the app for those users who have FirefoxOS, but [20:24:41] let's be completely honest: if they're still running FirefoxOS they're [20:24:41] using an unsupported platform that few people /ever/ used to begin [20:24:41] with. They can deal with it. [20:25:07] cwd: it could be different for each module. phabricator is using the module [20:25:15] !log demon@tin Synchronized docroot/wikimedia.org/: bye bye firefox os. you will (not) be missed (duration: 00m 58s) [20:25:27] eh, sorry, the first "module" = puppet module, the second "module" = apache module in that sentence [20:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:05] mutante: cool, we still use modphp but have been talking about trying out something faster [20:26:28] RECOVERY - Host lvs4006 is UP: PING WARNING - Packet loss = 93%, RTA = 78.44 ms [20:26:37] RECOVERY - Host lvs4005 is UP: PING OK - Packet loss = 0%, RTA = 78.51 ms [20:26:57] cwd: there has been the same suggestion to replace it on phabricator https://phabricator.wikimedia.org/T185644#3930087 [20:27:29] but that hasn't happened yet [20:28:45] ah cool, thanks! [20:28:55] man there are a lot of ways to run php these days [20:29:22] !log demon@tin Synchronized .gitmodules: consistency (duration: 00m 54s) [20:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:40] 10Operations, 10Traffic: varnish 5.1.3 frontend child restarted - https://phabricator.wikimedia.org/T185968#3929788 (10BBlack) In both cases the child was killed with signal 9 by the kernel oom-killer. It may be the case that our memory cache sizing is very tight in general, and that overheads have increased... [20:31:27] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [20:31:27] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [20:38:20] (03PS1) 10Ayounsi: add asw2-ulsfo to dns [dns] - 10https://gerrit.wikimedia.org/r/407058 [20:39:19] (03CR) 10Ayounsi: [C: 032] add asw2-ulsfo to dns [dns] - 10https://gerrit.wikimedia.org/r/407058 (owner: 10Ayounsi) [20:52:08] PROBLEM - puppet last run on mw1336 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:52:30] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:52:46] (03PS1) 10Ayounsi: Renaming asw-ulsfo to asw2-ulsfo in Icinga, Smokeping and Rancid [puppet] - 10https://gerrit.wikimedia.org/r/407059 (https://phabricator.wikimedia.org/T185228) [20:56:12] (03CR) 10Ayounsi: [C: 032] Renaming asw-ulsfo to asw2-ulsfo in Icinga, Smokeping and Rancid [puppet] - 10https://gerrit.wikimedia.org/r/407059 (https://phabricator.wikimedia.org/T185228) (owner: 10Ayounsi) [21:00:05] cscott, arlolra, subbu, bearND, halfak, and Amir1: Time to snap out of that daydream and deploy Services – Parsoid / Citoid / Mobileapps / ORES / …. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180131T2100). [21:00:05] No GERRIT patches in the queue for this window AFAICS. [21:04:57] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [21:06:21] 10Operations, 10Traffic, 10Wikimedia-Site-requests: oudated DjVu file page thumbnail in cache - https://phabricator.wikimedia.org/T186153#3935268 (10bd808) [21:08:38] Nothing for ORES [21:17:37] PROBLEM - puppet last run on cp4021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:22:07] RECOVERY - puppet last run on mw1336 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:22:27] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:23:45] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3935345 (10RobH) [21:25:08] PROBLEM - puppet last run on cp4024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:26:18] PROBLEM - Long running screen/tmux on mwlog1001 is CRITICAL: CRIT: Long running tmux process. (PID: 9884, 1736110s 1728000s). [21:27:37] RECOVERY - puppet last run on cp4021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:32:43] PROBLEM - puppet last run on mw1346 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:34:53] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [21:35:10] !log fixed icinga config for cp4024 parents [21:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:24] the issues on cp4* are not unexpected but the mw1346 is [21:37:43] RECOVERY - puppet last run on mw1346 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:38:33] ^ false alarm [21:42:42] ok [21:48:42] what bot echos phab? [21:48:44] its not running... [21:50:02] robh wikibugs [21:51:22] !log mholloway-shell@tin Started deploy [mobileapps/deploy@18d263a]: Update mobileapps to 3d717fa [21:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:53] (03PS1) 10Krinkle: Improve wmf-config file documentation headers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407152 [21:57:33] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@18d263a]: Update mobileapps to 3d717fa (duration: 06m 11s) [21:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:04] (03PS2) 10Krinkle: Improve wmf-config file documentation headers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407152 [22:10:13] RECOVERY - puppet last run on cp4024 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:16:03] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 23 probes of 305 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [22:21:03] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 0 probes of 305 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [22:41:09] 10Operations, 10Phabricator, 10Patch-For-Review: Switch phabricator from using apache to nginx - https://phabricator.wikimedia.org/T185644#3935575 (10mmodell) p:05High>03Low [22:44:26] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3935586 (10mmodell) a:03mmodell As discussed with @dzahn at all hands, we should probably ju... [22:45:39] Niharika: If you find the t-shirt box, i deserve one too, sadly, from the past :) [22:46:15] matanya: Sadly? :P What did you do to deserve one? :) [22:46:27] I broke the mailing lists [22:47:31] And maybe more stuff, can't remember exactly how low in puppet was my breakage [22:47:51] I am sure though the mailing lists where down for a good few hours [22:48:42] matanya: Ha, wow. Nice. [22:48:53] I'm gonna save you one if I ever find that box. [22:49:04] thanks :D [22:52:37] Niharika: for fun: https://wikitech.wikimedia.org/wiki/Incident_documentation/20140714-Lists [22:53:18] "As of now, 2600 outgoing emails are still in the queue, which is slowly getting to zero." Haha wow! That's a lot! [22:53:20] (03PS1) 10BBlack: Revert "depooling ulsfo to swap switches" [dns] - 10https://gerrit.wikimedia.org/r/407169 [22:53:31] (03PS2) 10BBlack: Revert "depooling ulsfo to swap switches" [dns] - 10https://gerrit.wikimedia.org/r/407169 [22:53:48] 16 hours, phew, i did a good breakage [22:54:01] :D [22:55:25] !log un-downtiming various ulsfo things [22:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:48] Niharika: ya'll couldn't find the box? :( I didn't have time to look around when I was there. [22:59:29] greg-g: I hadn't broken the wikis then! :) I didn't go looking for it yet though. Will ask Robert when I see him. Thank you! [22:59:39] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Onboard bstorm to WMF - https://phabricator.wikimedia.org/T185493#3935662 (10chasemp) [23:00:20] Niharika: I just asked Lani, she knows where it is (she's up here in Sonoma with us) [23:00:24] !log restarting ulsfo varnish-fe processes [23:00:34] Niharika: I told her you get one :) [23:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:58] greg-g: Haha, thanks! :D [23:04:12] (03CR) 10BBlack: [C: 032] Revert "depooling ulsfo to swap switches" [dns] - 10https://gerrit.wikimedia.org/r/407169 (owner: 10BBlack) [23:04:33] !log re-pooling ulsfo in DNS [23:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:07] !log re-pooling ulsfo in DNS - T185228 [23:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:18] T185228: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228 [23:09:52] (03PS1) 10Papaul: DNS: Add mgmt dns entries for tendril2001 [dns] - 10https://gerrit.wikimedia.org/r/407171 [23:11:42] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install tendril2001 - https://phabricator.wikimedia.org/T186123#3935714 (10Papaul) [23:15:46] 10Operations, 10ops-codfw, 10DBA, 10netops: switch port configuration for tendril2001 - https://phabricator.wikimedia.org/T186172#3935719 (10Papaul) p:05Triage>03Normal [23:28:13] PROBLEM - puppet last run on kubestagetcd1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:34:44] (03PS1) 10Papaul: Decom: Remove mgmt DNS entries for db201[6-9],db2023 and db202[8-9] [dns] - 10https://gerrit.wikimedia.org/r/407173 [23:39:23] (03PS2) 10Papaul: Decom: Remove mgmt DNS entries for db201[6-9],db2023 and db202[8-9] [dns] - 10https://gerrit.wikimedia.org/r/407173 [23:47:21] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10NewPHP, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#3935800 (10bd808) >>! In T176370#3934580, @MoritzMuehlenhoff wrote: >>>! In T176370#3934147, @Imarlier wrote: >> Wikitech isn't represented here (currently... [23:48:03] (03PS1) 10Chad: Expose a simple Swagger spec for checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407174 (https://phabricator.wikimedia.org/T136839) [23:51:12] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Onboard bstorm to WMF - https://phabricator.wikimedia.org/T185493#3935805 (10Dzahn) racktables not worth it anymore? almost replaced by netbox. Netbox access should automatically come with the LDAP groups. (https://netbox.wikimedia.org/login... [23:53:55] 10Operations, 10Phabricator, 10Patch-For-Review: Switch phabricator from using apache to nginx - https://phabricator.wikimedia.org/T185644#3935820 (10Dzahn) @Joe and others: also see T182832 now [23:56:17] !log restarting apache on phabricator server, same pattern as described in T182832 [23:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:29] T182832: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 [23:58:13] RECOVERY - puppet last run on kubestagetcd1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:59:35] 10Operations, 10ops-codfw, 10DBA, 10netops: switch port configuration for tendril2001 - https://phabricator.wikimedia.org/T186172#3935719 (10ayounsi) Interface description added, port up and in the private vlan. No MAC seen on the switch side so far.