[00:00:05] twentyafterfour: (Dis)respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171019T0000). Please do the needful. [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:00:14] (03PS1) 10RobH: cp4032 install params [puppet] - 10https://gerrit.wikimedia.org/r/385111 (https://phabricator.wikimedia.org/T178423) [00:00:46] (03CR) 10RobH: [C: 032] cp4032 install params [puppet] - 10https://gerrit.wikimedia.org/r/385111 (https://phabricator.wikimedia.org/T178423) (owner: 10RobH) [00:01:26] !log maxsem@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/385106/2 (duration: 00m 50s) [00:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:58] PROBLEM - Host cp4031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [00:02:05] (03PS1) 10Chad: Swap wikimediafoundation.org over to using standard-docroot [puppet] - 10https://gerrit.wikimedia.org/r/385112 [00:02:10] no its not stupid icinga.... [00:02:19] (03PS1) 10Chad: Remove last vestigates of weird wmfwiki-specific docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385113 [00:02:21] itll clear that alert was a misconfigured idra [00:04:18] cp4031 and cp4032 had the same ip in their config, and i just got lucky when i installed that cp4032 happened not to snag cp4031s ip on the network when i went to install cp4031 =P [00:06:09] (03PS9) 10Dzahn: bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599 [00:06:35] no_justification: preparing the move to Wordpress ?:P [00:06:41] (03CR) 10jerkins-bot: [V: 04-1] bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599 (owner: 10Dzahn) [00:06:50] Nope, just old due diligence I've been slowly finishing off [00:07:01] ah :) [00:11:17] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3695566 (10RobH) [00:12:33] (03PS10) 10Dzahn: bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599 [00:13:35] Error: Evaluation Error: Error while evaluating a Function Call, Failed to parse template prometheus/cluster_config.erb: [00:13:38] hrmm [00:14:03] Detail: Connection refused - connect(2) for "compiler02.puppet3-diffs.eqiad.wmflabs" port 8081 [00:17:51] RECOVERY - Host cp4031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.09 ms [00:42:59] 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3695572 (10RobH) [01:14:29] (03PS11) 10Dzahn: bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599 [01:15:03] (03CR) 10jerkins-bot: [V: 04-1] bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599 (owner: 10Dzahn) [01:25:22] (03PS12) 10Dzahn: bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599 [01:31:48] (03CR) 10Dzahn: "01:24:43 wmf-style: total violations delta -3 now" [puppet] - 10https://gerrit.wikimedia.org/r/353599 (owner: 10Dzahn) [01:42:26] (03PS1) 10Dzahn: cyberbot::db: set datadir, bind_dir, use mysql::server [puppet] - 10https://gerrit.wikimedia.org/r/385115 [01:43:32] (03CR) 10jerkins-bot: [V: 04-1] cyberbot::db: set datadir, bind_dir, use mysql::server [puppet] - 10https://gerrit.wikimedia.org/r/385115 (owner: 10Dzahn) [01:44:16] (03PS2) 10Dzahn: cyberbot::db: set datadir, bind_dir, use mysql::server [puppet] - 10https://gerrit.wikimedia.org/r/385115 [01:46:04] (03PS3) 10Dzahn: cyberbot::db: set datadir, bind_addr, use mysql::server [puppet] - 10https://gerrit.wikimedia.org/r/385115 [01:46:39] (03CR) 10Dzahn: [C: 032] cyberbot::db: set datadir, bind_addr, use mysql::server [puppet] - 10https://gerrit.wikimedia.org/r/385115 (owner: 10Dzahn) [01:58:03] (03PS1) 10Chad: Adding deleteproject @ stable-2.13 [software/gerrit] - 10https://gerrit.wikimedia.org/r/385117 [01:58:03] (03CR) 10Chad: [C: 04-2] "Need to upload to archiva first" [software/gerrit] - 10https://gerrit.wikimedia.org/r/385117 (owner: 10Chad) [02:02:34] (03CR) 10Chad: [C: 031] "Uploaded, but not going to deploy tonight" [software/gerrit] - 10https://gerrit.wikimedia.org/r/385117 (owner: 10Chad) [02:24:41] (03PS1) 10Dzahn: mysql: don't install precise sources.list unless on mariadb-server-5 [puppet] - 10https://gerrit.wikimedia.org/r/385119 [02:29:39] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.3) (duration: 09m 40s) [02:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:18] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.145 second response time [02:52:35] (03PS2) 10Dzahn: mysql: don't install precise sources.list if on stretch [puppet] - 10https://gerrit.wikimedia.org/r/385119 [02:53:03] (03CR) 10jerkins-bot: [V: 04-1] mysql: don't install precise sources.list if on stretch [puppet] - 10https://gerrit.wikimedia.org/r/385119 (owner: 10Dzahn) [02:54:44] (03PS3) 10Dzahn: mysql: don't install precise sources.list if on stretch [puppet] - 10https://gerrit.wikimedia.org/r/385119 [02:54:52] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.4) (duration: 09m 28s) [02:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:00:27] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.282 second response time [03:01:55] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Oct 19 03:01:54 UTC 2017 (duration 7m 2s) [03:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:13:39] (03PS1) 10Dzahn: cyberbot::db: don't use broken mysql::server class [puppet] - 10https://gerrit.wikimedia.org/r/385125 [03:14:07] (03CR) 10jerkins-bot: [V: 04-1] cyberbot::db: don't use broken mysql::server class [puppet] - 10https://gerrit.wikimedia.org/r/385125 (owner: 10Dzahn) [03:15:10] (03PS2) 10Dzahn: cyberbot::db: don't use broken mysql::server class [puppet] - 10https://gerrit.wikimedia.org/r/385125 [03:16:36] (03CR) 10Dzahn: [C: 032] cyberbot::db: don't use broken mysql::server class [puppet] - 10https://gerrit.wikimedia.org/r/385125 (owner: 10Dzahn) [03:26:38] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 884.49 seconds [03:38:06] !log cp3030 - upgraded vhtcpd to 0.1.0 for testing [03:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:38:40] !log cp3034 - upgraded vhtcpd to 0.1.0 for testing [03:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:40:41] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3695608 (10awight) @Halfak Potentially good news: using your Celery 4 patch, I was able to start... [03:49:08] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [04:06:48] PROBLEM - puppet last run on elastic1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:13:48] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 48.75 seconds [04:31:47] RECOVERY - puppet last run on elastic1031 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [05:08:03] !log awight@tin Started deploy [ores/deploy@19b1851]: Push ORES w/ Celery 4 support, T178441 [05:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:13] T178441: Upgrade celery to 4.1.0 for ORES - https://phabricator.wikimedia.org/T178441 [05:08:21] !log awight@tin Finished deploy [ores/deploy@19b1851]: Push ORES w/ Celery 4 support, T178441 (duration: 00m 18s) [05:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:16] FWIW, I’m stress testing the ores100*.eqiad.wmnet cluster for the next hour or so. [05:13:08] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0 [05:13:37] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0 [05:24:34] (03CR) 10Zoranzoki21: [C: 031] Remove last vestigates of weird wmfwiki-specific docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385113 (owner: 10Chad) [05:27:28] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385132 [05:27:31] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385132 [05:29:11] !log awight@tin Started deploy [ores/deploy@19b1851]: Push ORES w/ Celery 4 support (take 2), T178441 [05:29:13] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385132 (owner: 10Marostegui) [05:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:20] T178441: Upgrade celery to 4.1.0 for ORES - https://phabricator.wikimedia.org/T178441 [05:30:18] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385132 (owner: 10Marostegui) [05:30:30] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385132 (owner: 10Marostegui) [05:30:59] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385133 [05:31:11] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385133 [05:31:22] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1056 - T174509 (duration: 00m 50s) [05:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:31] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [05:32:41] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385133 (owner: 10Marostegui) [05:33:45] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385133 (owner: 10Marostegui) [05:34:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1055 - T174509 (duration: 00m 50s) [05:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:08] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385133 (owner: 10Marostegui) [05:36:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [05:36:57] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 [05:39:38] (03PS1) 10Marostegui: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385134 (https://phabricator.wikimedia.org/T174509) [05:41:35] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385134 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:41:54] !log Optimize pagelinks and templatelinks on db1051 - T174509 [05:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:04] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [05:42:40] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385134 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:42:52] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385134 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:44:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1051 - T174509 (duration: 00m 50s) [05:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:12] !log Optimize recentchanges, pagelinks and templatelinks on db1064 - T174509 T177772 [05:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:22] T177772: Purge 90% of rows from recentchanges (and posibly defragment) from commonswiki and ruwiki (the ones with source:wikidata) - https://phabricator.wikimedia.org/T177772 [05:46:43] !log Optimize recentchanges, pagelinks and templatelinks on db1102 for s6 - T174509 T177772 [05:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:37] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0 [05:53:08] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0 [06:02:03] !log Compress revision table across db1044 databases (s3) [06:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:37] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 [06:16:48] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [06:20:12] (03PS1) 10Marostegui: mariadb: Add db2092 to s1 and s3 as rc [puppet] - 10https://gerrit.wikimedia.org/r/385135 (https://phabricator.wikimedia.org/T178359) [06:22:37] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:24:52] (03CR) 10Marostegui: [C: 032] "This looks good: https://puppet-compiler.wmflabs.org/compiler02/8371/" [puppet] - 10https://gerrit.wikimedia.org/r/385135 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:27:48] PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [06:28:48] RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.011 second response time [06:29:18] (03PS1) 10Marostegui: db-codfw.php: Depool db2034 and db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385136 (https://phabricator.wikimedia.org/T178359) [06:30:37] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:32:49] PROBLEM - Check systemd state on install1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:33:15] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2034 and db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385136 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:33:59] PROBLEM - Check systemd state on install2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:34:59] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2034 and db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385136 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:36:13] (03CR) 10jenkins-bot: db-codfw.php: Depool db2034 and db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385136 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:36:33] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2034 and db2036 - T178359 (duration: 01m 08s) [06:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:41] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:37:08] (03CR) 10Paladox: [C: 031] "This would clean disk space up + have performance improvements :)" [software/gerrit] - 10https://gerrit.wikimedia.org/r/385117 (owner: 10Chad) [06:37:26] (03PS1) 10Marostegui: s1,s3.hosts: Add db2092 to s1 and s3 [software] - 10https://gerrit.wikimedia.org/r/385137 (https://phabricator.wikimedia.org/T178359) [06:39:29] PROBLEM - puppet last run on install1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[isc-dhcp-server] [06:40:33] ^ I will fix that [06:41:19] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10User-Joe: Unify production and CI docker image build process - https://phabricator.wikimedia.org/T177276#3695674 (10Joe) >>! In T177276#3689890, @Addshore wrote: > 1 more thing to throw into the mix. > > Right now we h... [06:41:33] <_joe_> is dhcp down? [06:41:41] <_joe_> that can be a serious issue [06:41:43] (03PS1) 10Marostegui: install_server: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/385138 [06:41:45] _joe_: ^ [06:42:09] (03CR) 10Giuseppe Lavagetto: [C: 032] install_server: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/385138 (owner: 10Marostegui) [06:42:22] <_joe_> uhm why is CI so slow [06:42:32] <_joe_> ok arrived [06:43:19] fixed [06:43:25] puppet run finely on install1002 [06:43:53] <_joe_> cool [06:43:57] <_joe_> is dhcp up? [06:43:58] RECOVERY - Check systemd state on install1002 is OK: OK - running: The system is fully operational [06:44:17] yep [06:44:28] RECOVERY - puppet last run on install1002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:48:42] (03PS2) 10Marostegui: s1,s3.hosts: Add db2092 to s1 and s3 [software] - 10https://gerrit.wikimedia.org/r/385137 (https://phabricator.wikimedia.org/T178359) [07:02:18] RECOVERY - Check systemd state on install2002 is OK: OK - running: The system is fully operational [07:14:27] (03PS1) 10Giuseppe Lavagetto: Switch to use docker-pkg for builds [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/385143 [07:19:29] (03CR) 10Marostegui: [C: 032] s1,s3.hosts: Add db2092 to s1 and s3 [software] - 10https://gerrit.wikimedia.org/r/385137 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:20:23] (03Merged) 10jenkins-bot: s1,s3.hosts: Add db2092 to s1 and s3 [software] - 10https://gerrit.wikimedia.org/r/385137 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:30:18] (03CR) 10Filippo Giunchedi: [C: 031] Synchronise jenkins package to thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/384039 (https://phabricator.wikimedia.org/T158583) (owner: 10Muehlenhoff) [07:31:54] !log Stop MySQL on db2034 and db2036 to clone db2092 [07:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:49] (03CR) 10Filippo Giunchedi: "(Belated) +1" [puppet] - 10https://gerrit.wikimedia.org/r/384608 (owner: 10Ottomata) [07:49:11] 10Operations, 10DBA, 10Gerrit, 10Release-Engineering-Team, 10Security: Gerrit: Convert gerrit's db caractor encoding from utf8 to utf8mb4 to prevent truncation of astral characters - https://phabricator.wikimedia.org/T153899#3695742 (10Dzahn) [07:49:16] 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Backlog): Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#3695740 (10Dzahn) 05Open>03declined >>! In T145885#3663481, @Marostegui wrote: > Can this be closed then? I... [07:52:32] (03PS3) 10Elukey: netboot: prevent db110[78] to be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/384979 (https://phabricator.wikimedia.org/T177405) [07:52:35] (03CR) 10Gehel: wdqs: LVS check should reach blazegraph and do a simple query (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384938 (owner: 10Gehel) [07:59:37] 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Backlog): Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#3695772 (10Paladox) But it will be fixed when we migrate to notedb. [08:04:31] (03PS2) 10Filippo Giunchedi: hieradata: enable syslog over tls for esams [puppet] - 10https://gerrit.wikimedia.org/r/383352 (https://phabricator.wikimedia.org/T136312) [08:05:09] (03PS3) 10Filippo Giunchedi: prometheus: add conntrack/entropy/edac collectors [puppet] - 10https://gerrit.wikimedia.org/r/382695 (https://phabricator.wikimedia.org/T177196) [08:06:07] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add conntrack/entropy/edac collectors [puppet] - 10https://gerrit.wikimedia.org/r/382695 (https://phabricator.wikimedia.org/T177196) (owner: 10Filippo Giunchedi) [08:07:34] (03PS1) 10Marostegui: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385149 (https://phabricator.wikimedia.org/T164488) [08:09:07] (03PS2) 10Marostegui: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385149 (https://phabricator.wikimedia.org/T164488) [08:10:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385149 (https://phabricator.wikimedia.org/T164488) (owner: 10Marostegui) [08:11:54] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385149 (https://phabricator.wikimedia.org/T164488) (owner: 10Marostegui) [08:12:09] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385149 (https://phabricator.wikimedia.org/T164488) (owner: 10Marostegui) [08:13:02] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1072 - T164488 (duration: 00m 50s) [08:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:10] T164488: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488 [08:13:27] (03CR) 10Giuseppe Lavagetto: [C: 031] wdqs: LVS check should reach blazegraph and do a simple query [puppet] - 10https://gerrit.wikimedia.org/r/384938 (owner: 10Gehel) [08:13:52] !log Stop replication in sync on db1103 and db1072 - T164488 [08:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:04] 10Operations, 10Traffic, 10Wikimedia-Logstash, 10Patch-For-Review, 10Services (watching): RESTBase logs disappeared from logstash - https://phabricator.wikimedia.org/T178078#3695783 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi No problem @Pchelolo ! The problem has been fixed on the lvs side too... [08:15:36] (03CR) 10Giuseppe Lavagetto: [C: 031] Collect APC info [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/382728 (https://phabricator.wikimedia.org/T177196) (owner: 10Filippo Giunchedi) [08:45:03] (03PS5) 10Muehlenhoff: Create /run/nutcracker on stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457) [08:45:50] (03PS1) 10Marostegui: Draft: Setting a multi-instance host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385152 (https://phabricator.wikimedia.org/T178553) [08:47:04] (03CR) 10jerkins-bot: [V: 04-1] Draft: Setting a multi-instance host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385152 (https://phabricator.wikimedia.org/T178553) (owner: 10Marostegui) [08:47:36] 10Operations, 10media-storage, 10User-fgiunchedi: Deleting file on Commons "Error deleting file: An unknown error occurred in storage backend "local-multiwrite"." - https://phabricator.wikimedia.org/T173374#3525950 (10fgiunchedi) [08:48:18] (03CR) 10Marostegui: [C: 04-2] "Do not merge - this is just a draft and an example" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385152 (https://phabricator.wikimedia.org/T178553) (owner: 10Marostegui) [08:48:39] (03PS1) 10Elukey: prometheus::jmx_exporter_instance: use hostname rather than title [puppet] - 10https://gerrit.wikimedia.org/r/385153 (https://phabricator.wikimedia.org/T175923) [08:49:11] (03PS3) 10Filippo Giunchedi: hieradata: enable syslog over tls for esams [puppet] - 10https://gerrit.wikimedia.org/r/383352 (https://phabricator.wikimedia.org/T136312) [08:50:14] (03CR) 10Filippo Giunchedi: [C: 032] "After this only eqiad is left, which we'll rollout gradually" [puppet] - 10https://gerrit.wikimedia.org/r/383352 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [08:52:18] 10Operations, 10Traffic: Select or Acquire Address Space for Asia Cache DC - https://phabricator.wikimedia.org/T156256#3696012 (10faidon) We now have an APNIC account, and we were assigned today this IP space: - 103.102.166.0/24 - 2001:df2:e500::/48 There is an on-going thread with APNIC about some WHOIS oddi... [08:53:25] 10Operations, 10Traffic: Allocate address space for Singapore (APNIC) - https://phabricator.wikimedia.org/T156256#3696016 (10faidon) [08:54:56] (03CR) 10Jcrespo: [C: 04-1] "the 'hostname' is just a string alias, it needs to be defined down there on the "do not remove" section." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385152 (https://phabricator.wikimedia.org/T178553) (owner: 10Marostegui) [08:56:22] (03CR) 10Marostegui: [C: 04-2] "> the 'hostname' is just a string alias, it needs to be defined down" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385152 (https://phabricator.wikimedia.org/T178553) (owner: 10Marostegui) [08:58:07] (03CR) 10Filippo Giunchedi: [C: 032] Collect APC info [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/382728 (https://phabricator.wikimedia.org/T177196) (owner: 10Filippo Giunchedi) [09:02:57] (03CR) 10Jcrespo: [C: 04-1] "> > the 'hostname' is just a string alias, it needs to be defined" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385152 (https://phabricator.wikimedia.org/T178553) (owner: 10Marostegui) [09:04:50] (03CR) 10Marostegui: [C: 04-2] "> > > the 'hostname' is just a string alias, it needs to be defined" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385152 (https://phabricator.wikimedia.org/T178553) (owner: 10Marostegui) [09:07:48] (03PS1) 10Filippo Giunchedi: Release 0.4 [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/385155 [09:08:14] (03PS2) 10Elukey: prometheus::jmx_exporter_instance: use hostname rather than title [puppet] - 10https://gerrit.wikimedia.org/r/385153 (https://phabricator.wikimedia.org/T175923) [09:08:16] (03CR) 10Filippo Giunchedi: [C: 032] Release 0.4 [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/385155 (owner: 10Filippo Giunchedi) [09:08:57] (03PS2) 10Jcrespo: Draft: Setting a multi-instance host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385152 (https://phabricator.wikimedia.org/T178553) (owner: 10Marostegui) [09:09:50] (03CR) 10Jcrespo: [C: 04-1] "Something like this, but better done (it will not pass linter)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385152 (https://phabricator.wikimedia.org/T178553) (owner: 10Marostegui) [09:10:02] (03CR) 10jerkins-bot: [V: 04-1] Draft: Setting a multi-instance host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385152 (https://phabricator.wikimedia.org/T178553) (owner: 10Marostegui) [09:10:33] (03CR) 10Jcrespo: [C: 04-1] Draft: Setting a multi-instance host (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385152 (https://phabricator.wikimedia.org/T178553) (owner: 10Marostegui) [09:11:55] (03CR) 10Marostegui: [C: 04-2] ">" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385152 (https://phabricator.wikimedia.org/T178553) (owner: 10Marostegui) [09:15:48] (03PS3) 10Marostegui: Draft: Setting a multi-instance host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385152 (https://phabricator.wikimedia.org/T178553) [09:19:30] (03CR) 10Filippo Giunchedi: prometheus::jmx_exporter_instance: use hostname rather than title (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385153 (https://phabricator.wikimedia.org/T175923) (owner: 10Elukey) [09:24:41] (03PS6) 10Muehlenhoff: Create /run/nutcracker on stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457) [09:24:43] (03CR) 10Elukey: prometheus::jmx_exporter_instance: use hostname rather than title (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385153 (https://phabricator.wikimedia.org/T175923) (owner: 10Elukey) [09:26:07] (03PS3) 10Elukey: prometheus::jmx_exporter_instance: use hostname rather than title [puppet] - 10https://gerrit.wikimedia.org/r/385153 (https://phabricator.wikimedia.org/T175923) [09:27:59] (03PS1) 10Muehlenhoff: Add Cumin alias for relforge* [puppet] - 10https://gerrit.wikimedia.org/r/385160 [09:29:49] (03PS2) 10Muehlenhoff: Add Cumin alias for relforge* [puppet] - 10https://gerrit.wikimedia.org/r/385160 [09:30:22] (03CR) 10Muehlenhoff: [C: 032] Add Cumin alias for relforge* [puppet] - 10https://gerrit.wikimedia.org/r/385160 (owner: 10Muehlenhoff) [09:30:57] (03CR) 10Elukey: "PCC: https://puppet-compiler.wmflabs.org/compiler02/8373/" [puppet] - 10https://gerrit.wikimedia.org/r/385153 (https://phabricator.wikimedia.org/T175923) (owner: 10Elukey) [09:32:00] (03PS1) 10Muehlenhoff: Update cumin aliases for restbase [puppet] - 10https://gerrit.wikimedia.org/r/385161 [09:32:35] (03CR) 10Volans: Update cumin aliases for restbase (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385161 (owner: 10Muehlenhoff) [09:33:51] (03CR) 10Muehlenhoff: Update cumin aliases for restbase (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385161 (owner: 10Muehlenhoff) [09:33:54] (03CR) 10Muehlenhoff: [C: 032] Update cumin aliases for restbase [puppet] - 10https://gerrit.wikimedia.org/r/385161 (owner: 10Muehlenhoff) [09:35:38] (03CR) 10Jcrespo: "So this _should_ work, my concern is that there may be some code on mediawiki or outside (e.g. monitoring, TLS handling, etc.) that assume" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385152 (https://phabricator.wikimedia.org/T178553) (owner: 10Marostegui) [09:47:01] !log ppchelko@tin Started deploy [restbase/deploy@2001a66]: Enable summary on Cass 3 for all but wikipedia [09:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:19] (03CR) 10ArielGlenn: [C: 032] recompress bz2 files in batches not restricted to subjobs [dumps] - 10https://gerrit.wikimedia.org/r/385104 (owner: 10ArielGlenn) [09:48:36] (03CR) 10Volans: [C: 031] "LGTM as a first version. Given that there are tests, please enable CI to run tox soon ;)" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) (owner: 10Giuseppe Lavagetto) [09:48:43] !log installing pyjwt security updates [09:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:43] !log ariel@tin Started deploy [dumps/dumps@2b9326b]: improved job batches for rev content 7z recompression [09:50:45] !log ariel@tin Finished deploy [dumps/dumps@2b9326b]: improved job batches for rev content 7z recompression (duration: 00m 02s) [09:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:33] !log ppchelko@tin Finished deploy [restbase/deploy@2001a66]: Enable summary on Cass 3 for all but wikipedia (duration: 07m 32s) [09:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:10] (03CR) 10Filippo Giunchedi: [C: 031] prometheus::jmx_exporter_instance: use hostname rather than title [puppet] - 10https://gerrit.wikimedia.org/r/385153 (https://phabricator.wikimedia.org/T175923) (owner: 10Elukey) [09:56:25] (03PS3) 10Muehlenhoff: Synchronise jenkins package to thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/384039 (https://phabricator.wikimedia.org/T158583) [09:57:53] (03CR) 10Muehlenhoff: [C: 032] Synchronise jenkins package to thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/384039 (https://phabricator.wikimedia.org/T158583) (owner: 10Muehlenhoff) [10:01:40] (03PS4) 10Elukey: prometheus::jmx_exporter_instance: use hostname rather than title [puppet] - 10https://gerrit.wikimedia.org/r/385153 (https://phabricator.wikimedia.org/T175923) [10:02:49] (03CR) 10Elukey: [C: 032] prometheus::jmx_exporter_instance: use hostname rather than title [puppet] - 10https://gerrit.wikimedia.org/r/385153 (https://phabricator.wikimedia.org/T175923) (owner: 10Elukey) [10:02:52] (03CR) 10Hashar: "recheck" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) (owner: 10Giuseppe Lavagetto) [10:03:27] (03CR) 10jerkins-bot: [V: 04-1] Port docker builder [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) (owner: 10Giuseppe Lavagetto) [10:04:09] mobrovac: o/ - merging a change ---^ about prometheus jmx exporter that should fix kafka and be a no-op for cassandra, but please let me know if you see anything weird [10:04:13] (later on) [10:05:41] kk elukey, thnx for the heads-up [10:05:47] (03PS1) 10SimmeD: New logo for se.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385165 [10:06:26] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3696222 (10fgiunchedi) [10:07:19] mobrovac: in fact the prometheus1004's config for restbase-dev is not happy, trying to fix it in a sec [10:07:22] uff [10:07:51] !log mobrovac@tin Started restart [electron-render/deploy@8dd5f13]: Electron hanging - T174916 [10:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:58] T174916: electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 [10:08:12] uh [10:08:31] (03Abandoned) 10Alexandros Kosiaris: chmod uwsgi logs to 0644 [puppet] - 10https://gerrit.wikimedia.org/r/377794 (https://phabricator.wikimedia.org/T175736) (owner: 10Alexandros Kosiaris) [10:09:26] (03PS1) 10Elukey: Revert "prometheus::jmx_exporter_instance: use hostname rather than title" [puppet] - 10https://gerrit.wikimedia.org/r/385166 [10:09:59] (03CR) 10Elukey: [C: 032] Revert "prometheus::jmx_exporter_instance: use hostname rather than title" [puppet] - 10https://gerrit.wikimedia.org/r/385166 (owner: 10Elukey) [10:10:28] (03CR) 10Hashar: Port docker builder (032 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) (owner: 10Giuseppe Lavagetto) [10:11:38] mobrovac: just reverted, also cassandra-ng targets were affected, not sure why.. going to figure it out and re-post another change [10:11:52] weird [10:12:06] (03PS3) 10Gehel: wdqs: LVS check should reach blazegraph and do a simple query [puppet] - 10https://gerrit.wikimedia.org/r/384938 [10:13:17] (03PS1) 10Elukey: prometheus::jmx_exporter_instance: use hostname rather than title [puppet] - 10https://gerrit.wikimedia.org/r/385167 [10:13:27] I'll restart from --^ [10:14:06] pretty sure that targets.push("#{instance['parameters']['hostname']} didn't work as expected :( [10:14:49] because I am probably stupid [10:14:55] !log upgrade prometheus-hhvm-exporter to 0.4-1 [10:14:57] (03PS4) 10Gehel: wdqs: LVS check should reach blazegraph and do a simple query [puppet] - 10https://gerrit.wikimedia.org/r/384938 [10:14:59] (03PS1) 10Gehel: wdqs: add a endpoint to check service health / liveliness [puppet] - 10https://gerrit.wikimedia.org/r/385168 [10:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:42] (03PS2) 10SimmeD: New logo for se.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385165 (https://phabricator.wikimedia.org/T178550) [10:23:09] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3696247 (10fgiunchedi) [10:27:09] 10Operations, 10Patch-For-Review, 10Prometheus-metrics-monitoring: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#3696248 (10fgiunchedi) @dzahn not afaik, though as you said udp2log is on its way out so we can live without packetlosslogtailer IMO [10:33:58] (03PS1) 10Muehlenhoff: Switch use of jenkins for >= stretch to thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/385169 [10:34:27] (03CR) 10jerkins-bot: [V: 04-1] Switch use of jenkins for >= stretch to thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/385169 (owner: 10Muehlenhoff) [10:37:33] (03PS2) 10Muehlenhoff: Switch use of jenkins for >= stretch to thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/385169 [10:38:02] (03CR) 10jerkins-bot: [V: 04-1] Switch use of jenkins for >= stretch to thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/385169 (owner: 10Muehlenhoff) [10:39:28] (03PS3) 10Muehlenhoff: Switch use of jenkins for >= stretch to thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/385169 [10:41:37] (03CR) 10Hashar: Create /run/nutcracker on stretch onwards (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457) (owner: 10Muehlenhoff) [10:48:54] (03CR) 10Hashar: "That change got rebased and no more depended on "Support --bare in git::clone()" https://gerrit.wikimedia.org/r/#/c/383842/" [puppet] - 10https://gerrit.wikimedia.org/r/383843 (https://phabricator.wikimedia.org/T178076) (owner: 10Hashar) [10:49:36] (03CR) 10Hashar: [V: 031 C: 031] "This can be merged any time. That is for the CI Docker hosts and we already cherry picked the patches / confirmed they work fine." [puppet] - 10https://gerrit.wikimedia.org/r/383842 (https://phabricator.wikimedia.org/T178076) (owner: 10Hashar) [10:55:25] (03PS9) 10Giuseppe Lavagetto: Port docker builder [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) [10:55:27] (03PS3) 10Giuseppe Lavagetto: Add the repository to the name of all generated containers [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384992 [10:57:31] (03CR) 10jerkins-bot: [V: 04-1] Port docker builder [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) (owner: 10Giuseppe Lavagetto) [10:57:50] (03CR) 10jerkins-bot: [V: 04-1] Add the repository to the name of all generated containers [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384992 (owner: 10Giuseppe Lavagetto) [10:59:40] <_joe_> ok there is at least one genuine problem in my tests :) [11:00:39] (03CR) 10Muehlenhoff: Create /run/nutcracker on stretch onwards (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457) (owner: 10Muehlenhoff) [11:01:48] (03PS3) 10SimmeD: New logo for se.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385165 (https://phabricator.wikimedia.org/T178550) [11:04:49] (03PS4) 10SimmeD: New logo for se.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385165 (https://phabricator.wikimedia.org/T178550) [11:06:43] (03PS3) 10Hashar: prometheus: make ferm DNS record type configurable [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T176314) [11:06:49] (03PS10) 10Giuseppe Lavagetto: Port docker builder [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) [11:07:32] (03PS4) 10Hashar: prometheus: make ferm DNS record type configurable [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T176314) [11:07:46] (03CR) 10jerkins-bot: [V: 04-1] prometheus: make ferm DNS record type configurable [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T176314) (owner: 10Hashar) [11:08:57] (03CR) 10Marostegui: [C: 031] netboot: prevent db110[78] to be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/384979 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [11:09:11] (03PS5) 10Hashar: prometheus: make ferm DNS record type configurable [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T176314) [11:10:13] (03CR) 10Hashar: "I have added to each of the exporters profile a new parameter:" [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T176314) (owner: 10Hashar) [11:10:37] !log upgrading mw1276-mw1279 (canary servers) to wikidiff2 1.5.1 [11:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:21] (03CR) 10Hashar: [C: 031] "Thank you Moritz :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457) (owner: 10Muehlenhoff) [11:19:34] (03PS11) 10Giuseppe Lavagetto: Port docker builder [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) [11:20:28] (03CR) 10jerkins-bot: [V: 04-1] Port docker builder [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) (owner: 10Giuseppe Lavagetto) [11:20:40] (03PS1) 10Elukey: Introduce mariadb eventlogging profiles for master/replica [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) [11:21:11] (03CR) 10jerkins-bot: [V: 04-1] Introduce mariadb eventlogging profiles for master/replica [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [11:24:30] (03PS12) 10Giuseppe Lavagetto: Port docker builder [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) [11:40:43] !log T177196 upload prometheus-postgres-exporter_0.2.0+ds-2 to apt.wikimedia.org/stretch-wikimedia/main and copied over to apt.wikimedia.org/jessie-wikimedia/main [11:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:50] T177196: Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196 [11:41:40] (03PS2) 10Elukey: Introduce mariadb eventlogging profiles for master/replica [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) [11:42:05] (03PS4) 10Elukey: netboot: prevent db110[78] to be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/384979 (https://phabricator.wikimedia.org/T177405) [11:42:08] (03CR) 10jerkins-bot: [V: 04-1] Introduce mariadb eventlogging profiles for master/replica [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [11:44:31] (03CR) 10Volans: [C: 031] "LGTM" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) (owner: 10Giuseppe Lavagetto) [11:46:04] (03PS3) 10Elukey: Introduce mariadb eventlogging profiles for master/replica [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) [11:46:34] (03CR) 10jerkins-bot: [V: 04-1] Introduce mariadb eventlogging profiles for master/replica [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [11:46:38] (03CR) 10Elukey: [C: 032] netboot: prevent db110[78] to be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/384979 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [11:49:59] (03PS4) 10Elukey: Introduce mariadb eventlogging profiles for master/replica [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) [11:50:25] (03CR) 10Volans: "see inline" (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [11:50:47] ahahhaah [11:51:06] it is still in progress Riccardo, but thanks anyway :D [11:51:44] elukey: there is no WIP in the commit message and I got added to the CR ;) [11:52:14] (03PS2) 10Alexandros Kosiaris: Drop i386 environments from package_builder [puppet] - 10https://gerrit.wikimedia.org/r/384999 [11:52:15] (03PS2) 10Alexandros Kosiaris: package_builder: Change all docs to stretch [puppet] - 10https://gerrit.wikimedia.org/r/385000 [11:52:17] (03PS2) 10Alexandros Kosiaris: package_builder: Switch default distribution to stretch [puppet] - 10https://gerrit.wikimedia.org/r/385001 [11:52:17] my bad then, will add it next time :) [11:52:18] thanks! [11:52:19] (03PS2) 10Alexandros Kosiaris: package_builder: Add buster as an environment [puppet] - 10https://gerrit.wikimedia.org/r/385002 [11:52:30] most of them should have been solved but I'll triple check [11:52:44] np, and just to clarify, I was atutomatically added because of my rules, so not your fault at all ;) [11:55:23] volans: about the multiple roles in site.pp: I know that they are not nice to see but in order to make everything tidy I should move everything to profiles, and I should touch a lot of other roles.. so my idea was to do an initial step and then maybe follow up later on [11:56:36] ok if it's part of a multi-stage refactor ;) [11:56:50] 10Operations, 10ops-eqiad, 10hardware-requests, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3696419 (10Gilles) Is it too late for us to run something on that machine before it's fully decommissioned? We've been trying something in labs for T176361... [11:57:38] I'll add it in the commit notes :) [11:57:59] naming is also to decide, I wanted to have something to start a review with Jaime/Manuel [12:02:52] (03PS5) 10Elukey: [WIP] Introduce mariadb eventlogging profiles for master/replica [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) [12:03:23] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Introduce mariadb eventlogging profiles for master/replica [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [12:06:47] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10User-Joe: Unify production and CI docker image build process - https://phabricator.wikimedia.org/T177276#3696438 (10Addshore) Okay! mediawiki-phan-0.8 as an image name and then we can leave the tag for other versioning... [12:06:53] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Zuul: Migrate zuul-server behind systemd service - https://phabricator.wikimedia.org/T167845#3696439 (10hashar) 05Open>03Resolved There were some follow up patches required b... [12:07:17] (03CR) 10Muehlenhoff: [C: 031] package_builder: Switch default distribution to stretch [puppet] - 10https://gerrit.wikimedia.org/r/385001 (owner: 10Alexandros Kosiaris) [12:07:53] (03CR) 10Alexandros Kosiaris: "haven't reviewed the tests, minor comment inline, rest LGTM" (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) (owner: 10Giuseppe Lavagetto) [12:13:05] PROBLEM - Check systemd state on labsdb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:16:43] (03PS6) 10Elukey: [WIP] Introduce mariadb eventlogging profiles for master/replica [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) [12:17:58] (03CR) 10Muehlenhoff: package_builder: Add buster as an environment (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/385002 (owner: 10Alexandros Kosiaris) [12:29:44] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2034 and db2036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385176 [12:30:19] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2034 and db2036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385176 [12:32:41] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2034 and db2036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385176 (owner: 10Marostegui) [12:33:52] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2034 and db2036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385176 (owner: 10Marostegui) [12:35:03] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2034 and db2036 - T178359 (duration: 00m 50s) [12:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:11] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [12:36:09] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2034 and db2036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385176 (owner: 10Marostegui) [12:36:24] (03CR) 10Jayprakash12345: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385165 (https://phabricator.wikimedia.org/T178550) (owner: 10SimmeD) [12:46:26] (03PS2) 10Filippo Giunchedi: profile: add check_smart to selectively enable ::smart class [puppet] - 10https://gerrit.wikimedia.org/r/383528 (https://phabricator.wikimedia.org/T86552) [12:47:12] (03CR) 10Filippo Giunchedi: [C: 032] profile: add check_smart to selectively enable ::smart class [puppet] - 10https://gerrit.wikimedia.org/r/383528 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [12:47:26] (03PS2) 10Filippo Giunchedi: hieradata: rollout check_smart on a subset of codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/383529 (https://phabricator.wikimedia.org/T86552) [12:47:59] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: rollout check_smart on a subset of codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/383529 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [12:51:46] (03PS1) 10Filippo Giunchedi: smart: require smartmontools for syslog logger [puppet] - 10https://gerrit.wikimedia.org/r/385178 [12:52:15] (03CR) 10jerkins-bot: [V: 04-1] smart: require smartmontools for syslog logger [puppet] - 10https://gerrit.wikimedia.org/r/385178 (owner: 10Filippo Giunchedi) [12:52:17] (03PS2) 10Filippo Giunchedi: smart: require smartmontools for syslog logger [puppet] - 10https://gerrit.wikimedia.org/r/385178 [12:52:40] (03CR) 10jerkins-bot: [V: 04-1] smart: require smartmontools for syslog logger [puppet] - 10https://gerrit.wikimedia.org/r/385178 (owner: 10Filippo Giunchedi) [12:54:38] (03PS3) 10Filippo Giunchedi: smart: require smartmontools for syslog logger [puppet] - 10https://gerrit.wikimedia.org/r/385178 [12:55:16] aligned arrows FTW [12:55:29] (03CR) 10Filippo Giunchedi: [C: 032] smart: require smartmontools for syslog logger [puppet] - 10https://gerrit.wikimedia.org/r/385178 (owner: 10Filippo Giunchedi) [12:55:34] (03PS4) 10Filippo Giunchedi: smart: require smartmontools for syslog logger [puppet] - 10https://gerrit.wikimedia.org/r/385178 [12:55:56] PROBLEM - puppet last run on thumbor2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/smartmontools/run.d/20logger] [13:00:00] jouncebot: next [13:00:00] In 0 hour(s) and 0 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171019T1300) [13:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171019T1300). [13:00:06] No GERRIT patches in the queue for this window AFAICS. [13:00:12] easy [13:00:56] RECOVERY - puppet last run on thumbor2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:02:45] :D [13:02:51] teh best SWAT evaz [13:08:34] mobrovac: second attempt to merge the jmx exporter code change, this time I'll follow a new deploy path [13:09:04] (03CR) 10Elukey: [C: 032] prometheus::jmx_exporter_instance: use hostname rather than title [puppet] - 10https://gerrit.wikimedia.org/r/385167 (owner: 10Elukey) [13:09:09] (03PS2) 10Elukey: prometheus::jmx_exporter_instance: use hostname rather than title [puppet] - 10https://gerrit.wikimedia.org/r/385167 [13:09:27] kk El [13:09:29] elukey: [13:09:35] tab fail [13:12:25] PROBLEM - HHVM rendering on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:13:15] RECOVERY - HHVM rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 74499 bytes in 0.321 second response time [13:20:06] !log upgrading mw2120-mw2147 to wikidiff2 1.5.1 [13:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:51] (03CR) 10Lokal Profil: "Should the size not be 135x135px?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385165 (https://phabricator.wikimedia.org/T178550) (owner: 10SimmeD) [13:23:50] (03CR) 10SimmeD: "I took the images you requested at: https://commons.wikimedia.org/wiki/File:Wikimedia_Sverige_logo_-_vertical_square.svg and that is 580×5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385165 (https://phabricator.wikimedia.org/T178550) (owner: 10SimmeD) [13:28:45] mobrovac: this time it went much better, restbase_ng is ok, but a couple of test hosts are not configured to be polled.. there is surely a puppet tweak that I need to do, let me know if it is a big issue [13:29:01] hosts are restbase-dev1004, xenon-a, praseodymium-a [13:29:49] ahhh maybe puppet didn't run on them [13:30:05] okok fixing in a sec [13:33:19] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Zuul: Migrate zuul-server behind systemd service - https://phabricator.wikimedia.org/T167845#3696585 (10Paladox) Your welcome :) [13:34:06] mobrovac: all good now [13:34:17] let me know if you see any weirdness in your metrics [13:34:53] \o/ [13:36:38] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Request public key change for a research fellow - https://phabricator.wikimedia.org/T177889#3696591 (10Cervisiarius) Thanks so much, Rob. I followed the instructions, and it worked. You may delete the older key now. [13:46:33] elukey: great, thnx! [13:54:12] (03PS1) 10Filippo Giunchedi: smart: install hourly cron [puppet] - 10https://gerrit.wikimedia.org/r/385187 (https://phabricator.wikimedia.org/T86552) [13:54:35] (03CR) 10jerkins-bot: [V: 04-1] smart: install hourly cron [puppet] - 10https://gerrit.wikimedia.org/r/385187 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [13:58:01] (03PS2) 10Filippo Giunchedi: smart: install hourly cron [puppet] - 10https://gerrit.wikimedia.org/r/385187 (https://phabricator.wikimedia.org/T86552) [13:58:29] (03CR) 10jerkins-bot: [V: 04-1] smart: install hourly cron [puppet] - 10https://gerrit.wikimedia.org/r/385187 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [14:00:09] (03PS3) 10Filippo Giunchedi: smart: install hourly cron [puppet] - 10https://gerrit.wikimedia.org/r/385187 (https://phabricator.wikimedia.org/T86552) [14:01:32] (03PS4) 10Filippo Giunchedi: smart: install hourly cron [puppet] - 10https://gerrit.wikimedia.org/r/385187 (https://phabricator.wikimedia.org/T86552) [14:01:36] (03PS1) 10Muehlenhoff: Add new component thirdparty/confluent [puppet] - 10https://gerrit.wikimedia.org/r/385189 [14:02:25] (03CR) 10Filippo Giunchedi: [C: 032] smart: install hourly cron [puppet] - 10https://gerrit.wikimedia.org/r/385187 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [14:03:16] RECOVERY - Check systemd state on labsdb1004 is OK: OK - running: The system is fully operational [14:03:50] (03PS7) 10Elukey: [WIP] Introduce mariadb eventlogging profiles for master/replica [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) [14:04:12] (03PS2) 10Gehel: wdqs: add a endpoint to check service health / liveliness [puppet] - 10https://gerrit.wikimedia.org/r/385168 [14:05:24] (03CR) 10Gehel: [C: 032] wdqs: add a endpoint to check service health / liveliness [puppet] - 10https://gerrit.wikimedia.org/r/385168 (owner: 10Gehel) [14:09:48] (03PS8) 10Elukey: [WIP] Introduce mariadb eventlogging profiles for master/replica [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) [14:11:27] (03PS1) 10Framawiki: Create Appendix NS on Burmese Wiktionary (mywikt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385190 (https://phabricator.wikimedia.org/T178545) [14:16:45] (03PS9) 10Elukey: [WIP] Introduce mariadb eventlogging profiles for master/replica [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) [14:18:31] 10Operations, 10Puppet: Add require_package() variant with repository component to wmflib - https://phabricator.wikimedia.org/T178575#3696648 (10MoritzMuehlenhoff) [14:20:52] (03PS10) 10Elukey: [WIP] Introduce mariadb eventlogging profiles for master/replica [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) [14:24:56] (03CR) 10Elukey: "First PCC: https://puppet-compiler.wmflabs.org/compiler02/8386/" [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [14:31:20] (03PS3) 10Andrew Bogott: compiler-update-facts: restore optional use of PUPPET_MASTER env [puppet] - 10https://gerrit.wikimedia.org/r/383857 (https://phabricator.wikimedia.org/T97081) [14:33:00] !log mforns@tin Started deploy [analytics/refinery@0c9ba04]: (no justification provided) [14:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:29] !log mforns@tin Finished deploy [analytics/refinery@0c9ba04]: (no justification provided) (duration: 03m 29s) [14:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:53] (03PS1) 10ArielGlenn: add option to recompress multistream index files that end in bz2.otherext [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/385194 [14:43:55] (03PS1) 10ArielGlenn: bump version to 0.0.7 [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/385195 [14:50:04] (03CR) 10Zoranzoki21: [C: 031] Create Appendix NS on Burmese Wiktionary (mywikt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385190 (https://phabricator.wikimedia.org/T178545) (owner: 10Framawiki) [14:54:24] (03PS1) 10Muehlenhoff: Add thirdparty/confluent on stretch-based kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/385196 [14:54:58] (03CR) 10jerkins-bot: [V: 04-1] Add thirdparty/confluent on stretch-based kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/385196 (owner: 10Muehlenhoff) [14:57:28] HI. I found problem on srwiki like this https://phabricator.wikimedia.org/T178438 [14:58:13] How to resolve it? [15:02:21] !log mforns@tin Started deploy [analytics/refinery@0c9ba04]: (no justification provided) [15:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:28] !log mforns@tin Finished deploy [analytics/refinery@0c9ba04]: (no justification provided) (duration: 00m 07s) [15:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:17] (03PS2) 10Muehlenhoff: Add thirdparty/confluent on stretch-based kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/385196 [15:04:23] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack and setup db1107 and db1108 - https://phabricator.wikimedia.org/T177405#3657517 (10elukey) a:05Cmjohnson>03elukey [15:07:25] (03CR) 10ArielGlenn: [V: 032 C: 032] add option to recompress multistream index files that end in bz2.otherext [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/385194 (owner: 10ArielGlenn) [15:08:06] 10Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3696790 (10jcrespo) [15:08:08] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack and setup db1107 and db1108 - https://phabricator.wikimedia.org/T177405#3696789 (10jcrespo) [15:09:51] (03CR) 10ArielGlenn: [V: 032 C: 032] bump version to 0.0.7 [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/385195 (owner: 10ArielGlenn) [15:11:10] (03Draft2) 10Zoranzoki21: Corrected name for gadget popup on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385198 [15:11:49] HI, please deploy this patch: https://gerrit.wikimedia.org/r/#/c/385198/ [15:12:07] (03CR) 10Zoranzoki21: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385198 (owner: 10Zoranzoki21) [15:12:43] jouncebot: next [15:12:43] In 0 hour(s) and 47 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171019T1600) [15:13:45] Zoranzoki21: ^ on that link above, see the "Evening SWAT" section? add your patch and IRC nickname to that and it should be deployed later today [15:14:03] This have to be deployed now [15:14:33] !log Stop replication in sync on db1103 and db1072 - https://phabricator.wikimedia.org/T164488 [15:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:09] Can you deploy it now, or I have to wait SWAT? [15:15:14] Zoranzoki21: deploys only happen at specified times, unless it would truly be an emergency [15:15:49] It is small emergency.. Gadget no work [15:15:49] Zoranzoki21: you'll have to wait for SWAT unless it's like "the site is broken"-level important [15:15:55] ok [15:16:33] Zoranzoki21: i am not an mw-deployer myself, so i don't wanna make the call, try asking nicely in #wikimedia-releng [15:16:43] just usually things are on calendar [15:16:51] (03CR) 10SimmeD: "Just tested the image on a wiki, it will not fit at all. It needs to be resized" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385165 (https://phabricator.wikimedia.org/T178550) (owner: 10SimmeD) [15:17:15] 10Operations, 10Release Pipeline, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3692868 (10thcipriani) This also depends on the storag... [15:17:31] 10Operations, 10Release Pipeline, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3696811 (10thcipriani) p:05Triage>03Normal [15:20:16] (03CR) 10Muehlenhoff: "PCC: http://puppet-compiler.wmflabs.org/8388/" [puppet] - 10https://gerrit.wikimedia.org/r/385196 (owner: 10Muehlenhoff) [15:21:44] (03PS1) 10ArielGlenn: fix usage message printing and exit codes in separate 7z batch script [dumps] - 10https://gerrit.wikimedia.org/r/385200 [15:23:47] (03PS2) 10Herron: puppetdb: temporarily allow puppetcompiler1001 to reach puppetdb nginx [puppet] - 10https://gerrit.wikimedia.org/r/384762 (https://phabricator.wikimedia.org/T177843) [15:26:13] (03PS1) 10RobH: add jgleson to ldap users for inclusion in wmf group [puppet] - 10https://gerrit.wikimedia.org/r/385201 (https://phabricator.wikimedia.org/T178557) [15:26:29] (03CR) 10RobH: [C: 032] add jgleson to ldap users for inclusion in wmf group [puppet] - 10https://gerrit.wikimedia.org/r/385201 (https://phabricator.wikimedia.org/T178557) (owner: 10RobH) [15:27:05] (03CR) 10Lokal Profil: "> I took the images you requested at: https://commons.wikimedia.org/wiki/File:Wikimedia_Sverige_logo_-_vertical_square.svg" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385165 (https://phabricator.wikimedia.org/T178550) (owner: 10SimmeD) [15:28:40] (03CR) 10Filippo Giunchedi: [C: 031] compiler-update-facts: restore optional use of PUPPET_MASTER env [puppet] - 10https://gerrit.wikimedia.org/r/383857 (https://phabricator.wikimedia.org/T97081) (owner: 10Andrew Bogott) [15:29:24] (03PS4) 10Andrew Bogott: compiler-update-facts: restore optional use of PUPPET_MASTER env [puppet] - 10https://gerrit.wikimedia.org/r/383857 (https://phabricator.wikimedia.org/T97081) [15:29:53] (03PS5) 10Lokal Profil: New logo for se.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385165 (https://phabricator.wikimedia.org/T178550) (owner: 10SimmeD) [15:30:50] (03CR) 10Andrew Bogott: [C: 032] compiler-update-facts: restore optional use of PUPPET_MASTER env [puppet] - 10https://gerrit.wikimedia.org/r/383857 (https://phabricator.wikimedia.org/T97081) (owner: 10Andrew Bogott) [15:34:14] (03CR) 10Zoranzoki21: [C: 031] "Looks good to me, but someone else must approve" [dumps] - 10https://gerrit.wikimedia.org/r/385200 (owner: 10ArielGlenn) [15:36:04] 10Operations, 10Patch-For-Review, 10Prometheus-metrics-monitoring: Port redis statistics from ganglia to prometheus - https://phabricator.wikimedia.org/T148637#3696835 (10fgiunchedi) 05Resolved>03Open [15:36:06] 10Operations, 10Patch-For-Review, 10Prometheus-metrics-monitoring: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#3696836 (10fgiunchedi) [15:36:30] 10Operations, 10Patch-For-Review, 10Prometheus-metrics-monitoring: Port redis statistics from ganglia to prometheus - https://phabricator.wikimedia.org/T148637#2728497 (10fgiunchedi) [15:36:32] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3696837 (10fgiunchedi) [15:36:40] 10Operations, 10Patch-For-Review, 10Prometheus-metrics-monitoring: Port redis statistics to Prometheus - https://phabricator.wikimedia.org/T148637#2728497 (10fgiunchedi) [15:37:26] (03CR) 10Hoo man: "re Aaron" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384951 (owner: 10Hoo man) [15:38:49] (03CR) 10Hashar: "I tried to make print_wmf_style_violations to accept a custom format but eventually I give up :(" [puppet] - 10https://gerrit.wikimedia.org/r/382716 (owner: 10Hashar) [15:40:34] (03CR) 10SimmeD: "There we go. Now we have the correct image, and it's 135x135 so it fits. Should be ready to be merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385165 (https://phabricator.wikimedia.org/T178550) (owner: 10SimmeD) [15:41:28] (03CR) 10Zoranzoki21: [C: 031] New logo for se.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385165 (https://phabricator.wikimedia.org/T178550) (owner: 10SimmeD) [15:42:19] (03CR) 10ArielGlenn: [C: 032] fix usage message printing and exit codes in separate 7z batch script [dumps] - 10https://gerrit.wikimedia.org/r/385200 (owner: 10ArielGlenn) [15:45:38] AaronSchulz: Re https://gerrit.wikimedia.org/r/384951 [15:45:42] (03PS1) 10ArielGlenn: don't rsync dump files that are incomplete, even to peers [puppet] - 10https://gerrit.wikimedia.org/r/385203 [15:45:59] hello, I have a quick question. This patch says that "Cannot Merge", to rebase and re-submit. Is it safe to use the rebase button or should I do it locally ? (https://gerrit.wikimedia.org/r/#/c/383858/2) [15:46:08] If you want, I can change it so that it uses a replica or errors out if the group is not known [15:46:10] I've never used the rebase button on gerrit [15:46:17] _joe_: I wanted to change puppet-lint format when running 'rake global:wmf_style' but eventually I have to give up. I could use some hint : ( https://gerrit.wikimedia.org/r/382716 [15:46:19] safe to try, if it fails it will just refuse [15:46:40] apergos: are you talking to me? [15:46:42] yes. [15:46:46] cool.. thanks [15:47:12] <_joe_> hashar: uhm what you wrote should work [15:47:26] it will fail if there's a conflict with one of the files you touched, and something in the changes coming after that your change needs to be rebased on top of, otherwise it will "just work" [15:47:39] _joe_: yes it does. But it duplicates code :D [15:47:44] <_joe_> ahah ok [15:47:53] <_joe_> yeah, lemme think about this [15:47:54] _joe_: if that is good enough though, we can merge and forget :] [15:47:58] <_joe_> I'm sure there is a solution [15:48:00] (03CR) 10Elukey: [C: 031] Add new component thirdparty/confluent [puppet] - 10https://gerrit.wikimedia.org/r/385189 (owner: 10Muehlenhoff) [15:48:13] <_joe_> no let's not make that rakefile more monstruous than it already is :P [15:48:29] the idea is to have a Jenkins job that invokes global:wmfstyle_guide and parse the result. But the Jenkins plugin expects a specific format :( [15:48:33] hehe [15:48:56] I think in the old rakefile I was doing some changes based on whether ENV['JENKINS_URL'] is set or not [15:49:37] (03PS2) 10Zoranzoki21: Add new component thirdparty/confluent [puppet] - 10https://gerrit.wikimedia.org/r/385189 (owner: 10Muehlenhoff) [15:49:45] (03CR) 10Zoranzoki21: [C: 031] Add new component thirdparty/confluent [puppet] - 10https://gerrit.wikimedia.org/r/385189 (owner: 10Muehlenhoff) [15:51:21] (03CR) 10ArielGlenn: [C: 032] don't rsync dump files that are incomplete, even to peers [puppet] - 10https://gerrit.wikimedia.org/r/385203 (owner: 10ArielGlenn) [15:51:29] (03PS3) 10Dmaza: Enable $wgAbuseFilterProfile on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383858 (https://phabricator.wikimedia.org/T177641) [15:51:42] (03CR) 10Jcrespo: "> gives us a master connection" [puppet] - 10https://gerrit.wikimedia.org/r/384951 (owner: 10Hoo man) [15:53:51] (03PS13) 10Zoranzoki21: Gerrit: Enable logstash by default for prod gerrit [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [15:54:00] (03CR) 10Zoranzoki21: [C: 031] Gerrit: Enable logstash by default for prod gerrit [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [15:54:29] (03CR) 10Zoranzoki21: [C: 031] Gerrit: remove libbcprov-java and libbcpkix-java packages [puppet] - 10https://gerrit.wikimedia.org/r/385105 (owner: 10Paladox) [15:54:42] (03PS5) 10Zoranzoki21: Gerrit: Switch to the mariadb connector [puppet] - 10https://gerrit.wikimedia.org/r/384588 (https://phabricator.wikimedia.org/T176164) (owner: 10Paladox) [15:54:51] (03CR) 10Zoranzoki21: [C: 031] Gerrit: Switch to the mariadb connector [puppet] - 10https://gerrit.wikimedia.org/r/384588 (https://phabricator.wikimedia.org/T176164) (owner: 10Paladox) [15:55:09] (03PS8) 10Zoranzoki21: Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata) [15:55:34] (03PS3) 10Zoranzoki21: cwdent to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/384991 (https://phabricator.wikimedia.org/T178406) (owner: 10RobH) [15:55:39] (03CR) 10Zoranzoki21: [C: 031] cwdent to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/384991 (https://phabricator.wikimedia.org/T178406) (owner: 10RobH) [15:55:48] (03PS2) 10Zoranzoki21: ci: jenkins, allow access to computer/.*/builds [puppet] - 10https://gerrit.wikimedia.org/r/384960 (https://phabricator.wikimedia.org/T178458) (owner: 10Addshore) [15:55:53] (03CR) 10Zoranzoki21: [C: 031] ci: jenkins, allow access to computer/.*/builds [puppet] - 10https://gerrit.wikimedia.org/r/384960 (https://phabricator.wikimedia.org/T178458) (owner: 10Addshore) [16:00:04] godog, moritzm, and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171019T1600). Please do the needful. [16:00:04] hoo: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:02:18] there seems to be unanswered comments to https://gerrit.wikimedia.org/r/#/c/384951 no? [16:03:30] (03CR) 10Aaron Schulz: Allow specifying --group to sql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384951 (owner: 10Hoo man) [16:03:44] elukey: Yeah [16:03:45] 10Operations, 10DBA: Lost access to x1-analytics-slave - https://phabricator.wikimedia.org/T175970#3696906 (10jcrespo) [16:04:44] 10Operations, 10DBA: Lost access to x1-analytics-slave - https://phabricator.wikimedia.org/T175970#3609511 (10jcrespo) CC @Marostegui Maybe we can setup an unpuppetized copy of x1 from dbstore2002 on dbstore1002? [16:12:14] (03CR) 10Giuseppe Lavagetto: [C: 032] Port docker builder [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) (owner: 10Giuseppe Lavagetto) [16:13:34] (03CR) 10Alexandros Kosiaris: package_builder: Add buster as an environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385002 (owner: 10Alexandros Kosiaris) [16:13:36] 10Operations, 10Traffic: Allocate address space for Singapore (APNIC) - https://phabricator.wikimedia.org/T156256#3696962 (10BBlack) We can do revdns and basic puppet address space commits here or in T156027 as appropriate I think (maybe most of the puppet-level stuff over there). One thing it would be nice t... [16:13:40] (03PS3) 10Alexandros Kosiaris: Drop i386 environments from package_builder [puppet] - 10https://gerrit.wikimedia.org/r/384999 [16:13:42] (03PS3) 10Alexandros Kosiaris: package_builder: Change all docs to stretch [puppet] - 10https://gerrit.wikimedia.org/r/385000 [16:13:44] (03PS3) 10Alexandros Kosiaris: package_builder: Switch default distribution to stretch [puppet] - 10https://gerrit.wikimedia.org/r/385001 [16:13:46] (03PS3) 10Alexandros Kosiaris: package_builder: Add buster as an environment [puppet] - 10https://gerrit.wikimedia.org/r/385002 [16:13:48] (03PS2) 10Hoo man: Allow specifying --group to sql [puppet] - 10https://gerrit.wikimedia.org/r/384951 [16:16:23] One ask: who deploying patches today in swat [16:16:29] which is current [16:18:19] Zoranzoki21: ill take care of the access request, if thats what you are asking about? [16:18:22] its not part of swat [16:18:34] i just saw the emails someone rebased by patch ;] [16:18:43] I ask for https://gerrit.wikimedia.org/r/#/c/385198/ [16:18:48] ahh, nm! [16:18:49] Noone added +2 [16:22:42] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#3696991 (10Johan) Very well. "You may be able to install" is actually fairly complicated to parse (three verbs, not obvious to non-f... [16:23:08] Hmmmm [16:23:30] I can not to believe.. [16:25:12] (03CR) 10RobH: [C: 032] "No objections are noted on T178406, so I'm merging this live." [puppet] - 10https://gerrit.wikimedia.org/r/384991 (https://phabricator.wikimedia.org/T178406) (owner: 10RobH) [16:27:26] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to analytics servers for cwdent - https://phabricator.wikimedia.org/T178406#3697034 (10RobH) 05stalled>03Resolved No objections have been noted after 3 business days, so I've merged this request live. If you need any assistance w... [16:28:21] (03CR) 10Zoranzoki21: "recheck :P" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385198 (owner: 10Zoranzoki21) [16:31:43] 10Operations, 10Puppet, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3697047 (10herron) [16:33:11] AaronSchulz: elukey: Do you see the sql patch going forward now or is there work left to do? [16:33:15] I tested it locally on terbium [16:33:58] (03CR) 10Herron: [C: 032] puppetdb: temporarily allow puppetcompiler1001 to reach puppetdb nginx [puppet] - 10https://gerrit.wikimedia.org/r/384762 (https://phabricator.wikimedia.org/T177843) (owner: 10Herron) [16:34:04] (03PS3) 10Herron: puppetdb: temporarily allow puppetcompiler1001 to reach puppetdb nginx [puppet] - 10https://gerrit.wikimedia.org/r/384762 (https://phabricator.wikimedia.org/T177843) [16:35:11] 10Operations, 10ops-ulsfo, 10Traffic: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3229950 (10RobH) [16:37:12] (03PS1) 10Herron: Revert "puppetdb: temporarily allow puppetcompiler1001 to reach puppetdb nginx" [puppet] - 10https://gerrit.wikimedia.org/r/385205 [16:37:52] (03CR) 10Herron: [C: 032] Revert "puppetdb: temporarily allow puppetcompiler1001 to reach puppetdb nginx" [puppet] - 10https://gerrit.wikimedia.org/r/385205 (owner: 10Herron) [16:37:55] PROBLEM - Check systemd state on nitrogen is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:38:35] (03PS1) 10BBlack: cp4029-32: add to cache::text (but not ipsec yet) [puppet] - 10https://gerrit.wikimedia.org/r/385206 (https://phabricator.wikimedia.org/T178423) [16:38:37] (03PS1) 10BBlack: cp4029-32: add to cache::text::nodes [puppet] - 10https://gerrit.wikimedia.org/r/385207 (https://phabricator.wikimedia.org/T178423) [16:38:55] RECOVERY - Check systemd state on nitrogen is OK: OK - running: The system is fully operational [16:39:32] (03CR) 10BBlack: [C: 032] cp4029-32: add to cache::text (but not ipsec yet) [puppet] - 10https://gerrit.wikimedia.org/r/385206 (https://phabricator.wikimedia.org/T178423) (owner: 10BBlack) [16:39:38] 10Operations, 10ops-ulsfo, 10Traffic: decommission/replace bast4001.wikimedia.org - https://phabricator.wikimedia.org/T178592#3697086 (10RobH) [16:41:16] 10Operations, 10Traffic: Allocate address space for Singapore (APNIC) - https://phabricator.wikimedia.org/T156256#3697109 (10faidon) Yup, that's fine, as is creating the zones in the DNS and puppet repository (but not do the reverse delegation). [16:42:28] (03PS1) 10Hashar: contint: move blubber from docker to pipeline profile [puppet] - 10https://gerrit.wikimedia.org/r/385208 [16:44:36] (03CR) 10Hashar: [V: 031] "https://puppet-compiler.wmflabs.org/compiler02/8389/" [puppet] - 10https://gerrit.wikimedia.org/r/385208 (owner: 10Hashar) [16:47:34] (03PS1) 10Herron: Revert "Revert "puppetdb: temporarily allow puppetcompiler1001 to reach puppetdb nginx"" [puppet] - 10https://gerrit.wikimedia.org/r/385209 [16:48:11] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "puppetdb: temporarily allow puppetcompiler1001 to reach puppetdb nginx"" [puppet] - 10https://gerrit.wikimedia.org/r/385209 (owner: 10Herron) [16:49:29] (03PS1) 10Hashar: contint: include python3 on CI masters [puppet] - 10https://gerrit.wikimedia.org/r/385210 [16:50:16] !log contint1001: installed python3 (pending https://gerrit.wikimedia.org/r/385210 ) [16:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:15] 10Operations, 10Release Pipeline, 10Continuous-Integration-Infrastructure (shipyard): Update docker image docker-registry.wikimedia.org/wikimedia-jessie - https://phabricator.wikimedia.org/T177055#3697130 (10hashar) [16:54:05] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - /var/lib/docker/overlay2/e6ae8b80f4ba373fd89095d97c351cdb36fafbec11592ca825a68d2a37786786/merged is not accessible: Permission denied [16:56:23] ^ there's a ticket for that already [16:56:41] check_disk should probably drop the -A [16:56:47] (03PS2) 10Herron: puppetdb: temporarily allow puppetcompiler1001 to reach puppetdb nginx [puppet] - 10https://gerrit.wikimedia.org/r/385209 (https://phabricator.wikimedia.org/T177843) [16:56:50] and only use it for the special case it was added for [16:57:09] so that it doesnt check ALL mounts including that docker overlay [17:00:05] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear deployers, time to do the Services – Graphoid / Parsoid / OCG / Citoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171019T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:12] (03CR) 10Chad: [C: 031] "We don't really need the disk space savings and the performance impact is pretty minimal. It's mostly for my own sanity so we can make the" [software/gerrit] - 10https://gerrit.wikimedia.org/r/385117 (owner: 10Chad) [17:00:14] (03CR) 10Aaron Schulz: [C: 031] Allow specifying --group to sql [puppet] - 10https://gerrit.wikimedia.org/r/384951 (owner: 10Hoo man) [17:02:03] (03PS2) 10Hashar: contint: include python3 on CI masters [puppet] - 10https://gerrit.wikimedia.org/r/385210 (https://phabricator.wikimedia.org/T178594) [17:05:01] Why this is not deployed? https://gerrit.wikimedia.org/r/#/c/385198/ [17:05:49] Because it hasn't been merged yet, obviously [17:07:51] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385214 (https://phabricator.wikimedia.org/T128546) [17:08:29] Zoranzoki21: it's in one hour [17:08:43] (no ORES deployments today) [17:09:02] Zoranzoki21: the current deploy slot is for services, the one that has your patch starts after it [17:09:12] [19:05] Because it hasn't been merged yet, obviously.. nice reply [17:09:59] I try, I try :) [17:11:28] I know to is not deployed because is not merged [17:12:05] RECOVERY - Disk space on contint1001 is OK: DISK OK [17:12:07] ooops [17:12:25] I thinked to I added in puppet swat.. I now see to it is added for morning swat [17:12:44] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3697234 (10debt) Taking off the #discovery-search tag, as there isn't much we can do, but we'll continue to monitor using the #discovery tag. [17:15:05] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - /var/lib/docker/overlay2/713ae5d770fa652eea12a398a27d6f7fbe926b4da65182b4f02269d4e44f26b6/merged is not accessible: Permission denied [17:17:19] it's actually neither puppet nor morning [17:17:24] it's in 45 minutes from now [17:17:37] i guess it's too late now though .. sigh [17:20:06] (03PS1) 10Chad: group2 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385215 [17:21:31] (03CR) 10Paladox: [C: 031] "> We don't really need the disk space savings and the performance" [software/gerrit] - 10https://gerrit.wikimedia.org/r/385117 (owner: 10Chad) [17:22:05] RECOVERY - Disk space on contint1001 is OK: DISK OK [17:25:06] PROBLEM - HHVM jobrunner on mw1165 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [17:26:06] RECOVERY - HHVM jobrunner on mw1165 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [17:29:05] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - /var/lib/docker/overlay2/b4ca3300c880c818b7096f889e4e14d7ea5c29355bbd767166739299304cf2fb/merged is not accessible: Permission denied [17:31:05] RECOVERY - Disk space on contint1001 is OK: DISK OK [17:43:08] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Set up octocatalog-diff on host with access to puppetmasters and puppetdb - https://phabricator.wikimedia.org/T177843#3697355 (10herron) [17:47:39] (03Abandoned) 10Zoranzoki21: Corrected name for gadget popup on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385198 (owner: 10Zoranzoki21) [17:50:59] (03PS1) 10RobH: remove bob west's old ssh key [puppet] - 10https://gerrit.wikimedia.org/r/385216 (https://phabricator.wikimedia.org/T177889) [17:51:44] (03CR) 10RobH: [C: 032] remove bob west's old ssh key [puppet] - 10https://gerrit.wikimedia.org/r/385216 (https://phabricator.wikimedia.org/T177889) (owner: 10RobH) [17:56:18] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Request public key change for a research fellow - https://phabricator.wikimedia.org/T177889#3697379 (10RobH) 05Open>03Resolved a:05Cervisiarius>03None Older key removed, resolving task! [17:57:06] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - /var/lib/docker/overlay2/0a44b085d526ad80ed40ad34793b3200ae67d5b256ddb502377c455f48e0a784/merged is not accessible: Permission denied [17:58:26] (03PS2) 10Dzahn: udp2log: remove ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/382913 (https://phabricator.wikimedia.org/T177225) [17:58:39] (03CR) 10Dzahn: [C: 032] "https://phabricator.wikimedia.org/T145659#3683792" [puppet] - 10https://gerrit.wikimedia.org/r/382913 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT (Max 8 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171019T1800). [18:00:04] RoanKattouw, Zoranzoki21, DMaza, and jan_drewniak: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:36] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3697388 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp4029.ulsfo.wmnet', 'cp4030.ulsfo.wmn... [18:01:10] here [18:01:12] HEre [18:01:29] Here [18:04:35] I can SWAT [18:06:30] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383858 (https://phabricator.wikimedia.org/T177641) (owner: 10Dmaza) [18:10:07] (03PS7) 10Paladox: Gerrit: Set lfs configuation [puppet] - 10https://gerrit.wikimedia.org/r/384596 (https://phabricator.wikimedia.org/T171758) [18:10:51] (03PS8) 10Paladox: Gerrit: Set lfs configuation [puppet] - 10https://gerrit.wikimedia.org/r/384596 (https://phabricator.wikimedia.org/T171758) [18:11:05] (03Merged) 10jenkins-bot: Enable $wgAbuseFilterProfile on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383858 (https://phabricator.wikimedia.org/T177641) (owner: 10Dmaza) [18:11:16] (03CR) 10jenkins-bot: Enable $wgAbuseFilterProfile on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383858 (https://phabricator.wikimedia.org/T177641) (owner: 10Dmaza) [18:11:18] (03PS1) 10Dzahn: xenon: decom ganglia from mwlog hosts [puppet] - 10https://gerrit.wikimedia.org/r/385218 (https://phabricator.wikimedia.org/T177225) [18:11:33] (03CR) 10Chad: [C: 031] Gerrit: Set lfs configuation [puppet] - 10https://gerrit.wikimedia.org/r/384596 (https://phabricator.wikimedia.org/T171758) (owner: 10Paladox) [18:12:02] DMaza: Enable $wgAbuseFilterProfile on ptwiki is live on mwdebug1002, check please [18:12:12] checking [18:14:49] (03CR) 10Paladox: "puppet passes" [puppet] - 10https://gerrit.wikimedia.org/r/384596 (https://phabricator.wikimedia.org/T171758) (owner: 10Paladox) [18:17:27] RoanKattouw: Echo change for wmf.4 is live on mwdebug1002, check please [18:18:23] thcipriani: Works great [18:18:33] 10Operations, 10ops-eqiad, 10DC-Ops: Multiple servers in eqiad D8 showing PSU failures - https://phabricator.wikimedia.org/T177227#3651165 (10faidon) @Cmjohnson, both analytics1036 and analytics1037 are still showing PSU redundancy errors. analytics1035 is fine now, though. [18:18:35] going live with the echo change [18:20:37] !log thcipriani@tin Synchronized php-1.31.0-wmf.4/extensions/Echo/modules/styles/mw.echo.ui.PageNotificationsOptionWidget.less: SWAT: [[gerrit:385110|mw.echo.ui.PageNotificationsOptionWidget: Fix CSS after changes in OOjs UI]] T178439 (duration: 00m 51s) [18:20:44] ^ RoanKattouw live now [18:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:45] T178439: [1.31.0-wmf.4 regression] Special:Notification page misplaces icons - https://phabricator.wikimedia.org/T178439 [18:20:56] 10Operations, 10monitoring, 10Patch-For-Review: Add PDU redundancy server/router/switch checks in Icinga - https://phabricator.wikimedia.org/T109903#3697494 (10faidon) 05Open>03Resolved For switches/routers we have alerts on Juniper's system/chassis alarms, which we know trips when they lose PDU redundan... [18:21:55] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: Multiple systems in esams OE10 showing PSU failures - https://phabricator.wikimedia.org/T177228#3651185 (10faidon) IIRC, @mark said that the rack in question doesn't have a secondary PDU. New PDUs for esams are in the budget this year, so I guess this is plan... [18:24:55] (03CR) 10Dzahn: [C: 032] Gerrit: Set lfs configuation [puppet] - 10https://gerrit.wikimedia.org/r/384596 (https://phabricator.wikimedia.org/T171758) (owner: 10Paladox) [18:25:07] 10Operations, 10Traffic, 10Patch-For-Review: upload@ulsfo strange ethernet / power / switch issues, etc... - https://phabricator.wikimedia.org/T176386#3697507 (10BBlack) 05Open>03Resolved a:03BBlack Nothing really to do here, except remember it if new power issues arise with the new hosts... [18:25:36] thanks [18:25:54] thcipriani: looks good here [18:26:05] DMaza: ok, going live :) [18:26:12] thank you [18:26:39] i will try testing that ^^ with gerrit 2.14 (as i doint have 2.13 installed) though the plugin should mostly act the same. [18:26:40] paladox: Gerrit::Jetty/File[/var/lib/gerrit2/review_site/etc/lfs.config]/ensure:... done [18:26:43] no_justification ^^ [18:26:44] :) [18:27:01] thanks mutante [18:27:34] (03PS2) 10Dzahn: xenon: decom ganglia from mwlog hosts [puppet] - 10https://gerrit.wikimedia.org/r/385218 (https://phabricator.wikimedia.org/T177225) [18:27:53] * paladox rebuilds gerrit 2.14. [18:28:31] !log thcipriani@tin Synchronized wmf-config/abusefilter.php: SWAT: [[gerrit:383858|Enable $wgAbuseFilterProfile on ptwiki]] T177641 (duration: 00m 50s) [18:28:37] 10Operations, 10ops-eqiad, 10DC-Ops: Multiple servers in eqiad D8 showing PSU failures - https://phabricator.wikimedia.org/T177227#3697518 (10Cmjohnson) They all have the same problem. I swapped PSU's for both an1036 and 1037 yesterday but still show the failure. The new psu's are failing after a new one is... [18:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:39] ^ DMaza live everywhere [18:28:39] T177641: Enable AbuseFilter per-filter profiling on Portuguese Wikipedia & monitor if there is a performance impact - https://phabricator.wikimedia.org/T177641 [18:28:55] thcipriani: thank you [18:29:02] yw :) [18:29:27] thanks for the patch/checking functionality [18:29:38] mutante: That lfs config is safe because we don't use it yet :) [18:29:42] Lets us prep [18:29:47] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385214 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [18:30:10] no_justification: yea, i figured, saw your comments on the ticket about the chromium/large files [18:30:36] Yeah. Ideally I want to use some kind of shared storage like Swift instead of the local FS, but this will do for now and unblocks a couple of teams [18:30:51] Although, we probably want to rsync that folder to the slaves, it won't replicate by default [18:31:05] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385214 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [18:31:13] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385214 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [18:31:24] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3697535 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp4031.ulsfo.wmnet', 'cp4030.ulsfo.wmnet', 'cp4032.ulsfo.wmnet', 'cp4029.ulsfo.wmnet'] `... [18:31:30] no_justification: the whole /srv/gerrit/plugins/ ? [18:31:37] The lfs dir [18:31:47] So /srv/gerrit/plugins/lfs [18:31:49] ok [18:31:56] But plugins/ generally is probably ok :) [18:32:15] jan_drewniak: portals update is live on mwdebug1002, check please [18:33:03] thcipriani: yup, looks good :) [18:33:09] k, going live [18:34:33] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [18:35:31] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets: SWAT: [[gerrit:385214|Bumping portals to master]] T128546 (duration: 00m 50s) [18:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:39] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [18:36:23] !log thcipriani@tin Synchronized portals: SWAT: [[gerrit:385214|Bumping portals to master]] T128546 (duration: 00m 51s) [18:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:31] ^ jan_drewniak all done [18:36:58] thcipriani: yup, thanks! [18:37:05] yw :) [18:38:25] (03CR) 10Dzahn: [C: 032] xenon: decom ganglia from mwlog hosts [puppet] - 10https://gerrit.wikimedia.org/r/385218 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [18:38:27] (03PS20) 10Paladox: gerrit: Ajust scap files (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 [18:38:29] (03PS18) 10Paladox: Gerrit: Upgrading gerrit to 2.14.6-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363734 [18:44:24] RECOVERY - Disk space on contint1001 is OK: DISK OK [18:49:10] (03PS2) 10BBlack: cp4029-32: add to cache::text::nodes [puppet] - 10https://gerrit.wikimedia.org/r/385207 (https://phabricator.wikimedia.org/T178423) [18:49:16] (03CR) 10BBlack: [V: 032 C: 032] cp4029-32: add to cache::text::nodes [puppet] - 10https://gerrit.wikimedia.org/r/385207 (https://phabricator.wikimedia.org/T178423) (owner: 10BBlack) [18:49:23] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - /var/lib/docker/overlay2/4fed7de4567b78fc7bb502bece45d1794120189a2ab399a875ad5dfab5e40eb8/merged is not accessible: Permission denied [18:51:17] 10Operations, 10netops: Find a new PIM RP IP - https://phabricator.wikimedia.org/T167842#3346542 (10hashar) Thanks @Faidon for the link! Jenkins is never out of surprise. We do not rely on that auto discovery feature and I will get it disabled in the daemon. [18:53:34] ACKNOWLEDGEMENT - Disk space on contint1001 is CRITICAL: DISK CRITICAL - /var/lib/docker/overlay2/4fed7de4567b78fc7bb502bece45d1794120189a2ab399a875ad5dfab5e40eb8/merged is not accessible: Permission denied daniel_zahn https://phabricator.wikimedia.org/T178454 [18:54:16] 10Operations, 10Continuous-Integration-Infrastructure, 10netops, 10Jenkins, 10Release-Engineering-Team (Kanban): Disable Jenkins autodiscovery system - https://phabricator.wikimedia.org/T178608#3697616 (10hashar) [18:55:22] 10Operations, 10Release Pipeline, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3692868 (10Dzahn) I disabled Icinga notifications for... [18:56:50] 10Operations, 10ops-ulsfo: WMF7218 missing serial in racktables - https://phabricator.wikimedia.org/T178609#3697637 (10RobH) [18:57:53] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp4031_v4, cp4031_v6 [18:58:37] (03PS1) 10ArielGlenn: bump version to 0.0.7 [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/385221 [18:58:53] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 52 ESP OK [19:00:04] no_justification: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171019T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:02:15] (03CR) 10ArielGlenn: [C: 032] bump version to 0.0.7 [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/385221 (owner: 10ArielGlenn) [19:06:06] 10Operations, 10netops: Find a new PIM RP IP - https://phabricator.wikimedia.org/T167842#3697696 (10ayounsi) @Gehel See Faidon's comment on T167842#3353703. Is there any reasons to have JMX agent autodiscovery enabled? [19:06:30] (03CR) 10Chad: [C: 032] group2 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385215 (owner: 10Chad) [19:07:38] (03Merged) 10jenkins-bot: group2 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385215 (owner: 10Chad) [19:07:57] (03CR) 10jenkins-bot: group2 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385215 (owner: 10Chad) [19:09:37] (03CR) 10Jcrespo: [C: 031] Allow specifying --group to sql [puppet] - 10https://gerrit.wikimedia.org/r/384951 (owner: 10Hoo man) [19:12:50] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.4 [19:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:22] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp4029 is CRITICAL: connect to address 10.128.0.129 and port 3124: Connection refused [19:15:22] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp4030 is CRITICAL: connect to address 10.128.0.130 and port 3124: Connection refused [19:15:22] PROBLEM - Check systemd state on cp4029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:15:22] PROBLEM - Check systemd state on cp4030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:15:22] PROBLEM - Check systemd state on cp4031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:15:23] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp4031 is CRITICAL: connect to address 10.128.0.131 and port 3124: Connection refused [19:15:23] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp4032 is CRITICAL: connect to address 10.128.0.132 and port 3124: Connection refused [19:15:31] PROBLEM - Check systemd state on cp4032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:17:11] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp4029 is CRITICAL: connect to address 10.128.0.129 and port 3125: Connection refused [19:17:11] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp4030 is CRITICAL: connect to address 10.128.0.130 and port 3125: Connection refused [19:17:11] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp4031 is CRITICAL: connect to address 10.128.0.131 and port 3125: Connection refused [19:17:11] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp4032 is CRITICAL: connect to address 10.128.0.132 and port 3125: Connection refused [19:18:08] ^ ignore those, sorry [19:18:21] installation of depooled new hosts, problems have gone past the standard downtime :P [19:18:31] (03PS1) 10Volans: wmf-auto-reimage: use default timeout for uptime check [puppet] - 10https://gerrit.wikimedia.org/r/385228 [19:18:51] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp4029 is CRITICAL: connect to address 10.128.0.129 and port 3126: Connection refused [19:18:52] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp4030 is CRITICAL: connect to address 10.128.0.130 and port 3126: Connection refused [19:18:52] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp4031 is CRITICAL: connect to address 10.128.0.131 and port 3126: Connection refused [19:18:52] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp4032 is CRITICAL: connect to address 10.128.0.132 and port 3126: Connection refused [19:19:23] is it really going to repeat that x8? :P [19:19:46] (03PS1) 10Dzahn: ganglia: also remove systemd unit files on decom [puppet] - 10https://gerrit.wikimedia.org/r/385229 (https://phabricator.wikimedia.org/T177225) [19:20:21] downtimed them [19:20:47] (03CR) 10BBlack: [C: 031] wmf-auto-reimage: use default timeout for uptime check [puppet] - 10https://gerrit.wikimedia.org/r/385228 (owner: 10Volans) [19:20:51] 10Operations, 10ops-eqdfw, 10DC-Ops: add missing asset tag and correct location in rack for cr1-eqdfw - https://phabricator.wikimedia.org/T178613#3697745 (10RobH) [19:22:38] (03CR) 10Volans: [C: 032] wmf-auto-reimage: use default timeout for uptime check [puppet] - 10https://gerrit.wikimedia.org/r/385228 (owner: 10Volans) [19:23:30] 10Operations, 10Continuous-Integration-Infrastructure, 10netops, 10Jenkins, 10Release-Engineering-Team (Kanban): Disable Jenkins autodiscovery system - https://phabricator.wikimedia.org/T178608#3697773 (10hashar) a:03hashar [19:24:35] (03PS1) 10Hashar: jenkins: disable auto-discovery [puppet] - 10https://gerrit.wikimedia.org/r/385230 (https://phabricator.wikimedia.org/T178608) [19:26:54] (03PS2) 10Dzahn: ganglia: also remove systemd unit files on decom [puppet] - 10https://gerrit.wikimedia.org/r/385229 (https://phabricator.wikimedia.org/T177225) [19:27:07] (03CR) 10Dzahn: [C: 032] ganglia: also remove systemd unit files on decom [puppet] - 10https://gerrit.wikimedia.org/r/385229 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [19:27:45] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/compiler02/8390/ looks good" [puppet] - 10https://gerrit.wikimedia.org/r/385230 (https://phabricator.wikimedia.org/T178608) (owner: 10Hashar) [19:29:56] (03PS1) 10BBlack: varnish: integrate sec into main unit file [puppet] - 10https://gerrit.wikimedia.org/r/385231 [19:30:24] (03CR) 10jerkins-bot: [V: 04-1] varnish: integrate sec into main unit file [puppet] - 10https://gerrit.wikimedia.org/r/385231 (owner: 10BBlack) [19:31:00] (03PS2) 10BBlack: varnish: integrate sec into main unit file [puppet] - 10https://gerrit.wikimedia.org/r/385231 [19:31:22] (03CR) 10Herron: [C: 032] puppetdb: temporarily allow puppetcompiler1001 to reach puppetdb nginx [puppet] - 10https://gerrit.wikimedia.org/r/385209 (https://phabricator.wikimedia.org/T177843) (owner: 10Herron) [19:31:27] (03PS3) 10Herron: puppetdb: temporarily allow puppetcompiler1001 to reach puppetdb nginx [puppet] - 10https://gerrit.wikimedia.org/r/385209 (https://phabricator.wikimedia.org/T177843) [19:31:29] (03CR) 10jerkins-bot: [V: 04-1] varnish: integrate sec into main unit file [puppet] - 10https://gerrit.wikimedia.org/r/385231 (owner: 10BBlack) [19:32:15] (03PS3) 10BBlack: varnish: integrate sec into main unit file [puppet] - 10https://gerrit.wikimedia.org/r/385231 [19:32:55] (03PS4) 10BBlack: varnish: integrate sec into main unit file [puppet] - 10https://gerrit.wikimedia.org/r/385231 [19:32:59] (03CR) 10BBlack: [V: 032 C: 032] varnish: integrate sec into main unit file [puppet] - 10https://gerrit.wikimedia.org/r/385231 (owner: 10BBlack) [19:33:10] 10Operations, 10netops: Find a new PIM RP IP - https://phabricator.wikimedia.org/T167842#3697822 (10Gehel) I see no reason to have jolokia even accessible on the network, it should be local only. I'll have a look into our config (to be honest I don't know much about jolokia, but that's a good occasion to dig i... [19:36:44] (03PS1) 10BBlack: varnish: bugfix previous bugfix [puppet] - 10https://gerrit.wikimedia.org/r/385232 [19:37:09] (03CR) 10BBlack: [C: 032] varnish: bugfix previous bugfix [puppet] - 10https://gerrit.wikimedia.org/r/385232 (owner: 10BBlack) [19:40:42] (03PS1) 10BBlack: varnish: remove temporary workaround [puppet] - 10https://gerrit.wikimedia.org/r/385233 [19:48:19] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: codfw: rack frack refresh equipment - https://phabricator.wikimedia.org/T169643#3697889 (10ayounsi) 05Open>03Resolved [19:48:22] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp4029 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.157 second response time [19:48:41] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp4029 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.157 second response time [19:48:41] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp4030 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.157 second response time [19:48:41] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp4031 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.157 second response time [19:49:01] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp4029 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.157 second response time [19:49:01] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp4030 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.157 second response time [19:49:01] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp4031 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.157 second response time [19:49:01] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp4032 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.157 second response time [19:49:21] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp4030 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.157 second response time [19:49:21] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp4031 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.157 second response time [19:49:21] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp4032 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.157 second response time [19:49:41] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp4032 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.163 second response time [19:49:59] (03PS3) 10Dzahn: ganglia: also remove systemd unit files on decom [puppet] - 10https://gerrit.wikimedia.org/r/385229 (https://phabricator.wikimedia.org/T177225) [19:53:34] !log reboot cp4029-32 (initial reboot post-puppetization) [19:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:51] RECOVERY - Check systemd state on cp4029 is OK: OK - running: The system is fully operational [19:56:51] RECOVERY - Check systemd state on cp4031 is OK: OK - running: The system is fully operational [19:56:51] RECOVERY - Check systemd state on cp4032 is OK: OK - running: The system is fully operational [19:56:52] RECOVERY - Check systemd state on cp4030 is OK: OK - running: The system is fully operational [19:58:51] (03CR) 10BBlack: [C: 032] varnish: remove temporary workaround [puppet] - 10https://gerrit.wikimedia.org/r/385233 (owner: 10BBlack) [19:58:55] (03PS2) 10BBlack: varnish: remove temporary workaround [puppet] - 10https://gerrit.wikimedia.org/r/385233 [19:58:57] (03CR) 10BBlack: [V: 032 C: 032] varnish: remove temporary workaround [puppet] - 10https://gerrit.wikimedia.org/r/385233 (owner: 10BBlack) [19:59:43] !log mforns@tin Started deploy [analytics/refinery@0c9ba04]: deploying refinery to fix banner activity false alarms [19:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:14] !log mforns@tin Finished deploy [analytics/refinery@0c9ba04]: deploying refinery to fix banner activity false alarms (duration: 00m 14s) [20:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:23] (03CR) 10Odder: "Many thanks for the quick merge, MaxSem; I can see the logo looks fine on high-density screens now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385094 (https://phabricator.wikimedia.org/T177506) (owner: 10Odder) [20:07:30] 10Operations, 10netops, 10Patch-For-Review, 10Performance-Team (Radar), 10Performance-Team-notice: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3697928 (10ayounsi) As of today, 180 BGP sessions use the old AS# and 216 use the new one. Timeline for decommissioning the old AS# (d... [20:11:36] 10Operations, 10netops, 10Patch-For-Review: Tracking task for network syslog messages - https://phabricator.wikimedia.org/T174397#3697931 (10ayounsi) 05Open>03Resolved Will update that task if needed in the future. [20:18:11] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [20:19:10] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 3 others: Support git-lfs files in gerrit - https://phabricator.wikimedia.org/T171758#3697944 (10Paladox) I have tested the plugin. The configuration above should work ^^ :). then we have to initialise the repo with git lfs install. Then do htt... [20:44:39] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [20:46:49] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1950 bytes in 0.097 second response time [20:47:33] 10Operations, 10monitoring, 10Graphite: Add cobalt to grafana dashboard - https://phabricator.wikimedia.org/T178401#3698005 (10Dzahn) @Paladox I found this https://grafana.wikimedia.org/dashboard/db/server-board?refresh=1m&orgId=1&var-server=cobalt&var-network=eth0 I went to dashboards, saw "Server Board"... [20:47:53] paladox: https://grafana.wikimedia.org/dashboard/db/server-board?refresh=1m&orgId=1&var-server=cobalt&var-network=eth0 [20:48:21] paladox: does that resolve https://phabricator.wikimedia.org/T178401 ? [20:48:28] ah thanks yeh it does [20:48:34] :) [20:48:55] 10Operations, 10Patch-For-Review, 10Prometheus-metrics-monitoring: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#3698009 (10Dzahn) [20:48:57] 10Operations, 10monitoring, 10Graphite: Add cobalt to grafana dashboard - https://phabricator.wikimedia.org/T178401#3698006 (10Dzahn) 05Open>03Resolved a:03Dzahn 16:48 < mutante> paladox: does that resolve https://phabricator.wikimedia.org/T178401 ? 16:48 < paladox> ah thanks yeh it does [20:48:58] :) [20:49:59] you can select all of them, just "hafnium" was pre-seletected for me [20:50:24] :) [21:02:08] PROBLEM - puppet last run on es1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:02:09] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:02:19] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:02:28] PROBLEM - puppet last run on roentgenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:02:29] PROBLEM - puppet last run on mw1261 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:02:38] PROBLEM - puppet last run on labcontrol1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:02:38] PROBLEM - puppet last run on es1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:02:39] PROBLEM - puppet last run on elastic1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:19] PROBLEM - puppet last run on puppetmaster1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:29] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:38] PROBLEM - puppet last run on dbproxy1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:38] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:38] PROBLEM - puppet last run on db1060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:38] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:38] PROBLEM - puppet last run on argon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:39] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:48] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:48] PROBLEM - puppet last run on analytics1059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:48] PROBLEM - puppet last run on labnodepool1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:48] PROBLEM - puppet last run on wtp1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:51] PROBLEM - puppet last run on wtp1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:58] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:59] PROBLEM - puppet last run on mw1317 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:59] PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:59] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:59] PROBLEM - puppet last run on pc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:59] PROBLEM - puppet last run on lvs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:59] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:00] PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:00] PROBLEM - puppet last run on etherpad1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:01] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:09] PROBLEM - puppet last run on mwdebug1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:09] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:09] PROBLEM - puppet last run on conf1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:18] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:19] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:19] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:19] PROBLEM - puppet last run on netmon1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:28] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:28] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:28] PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:28] PROBLEM - puppet last run on mw1188 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:28] PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:29] PROBLEM - puppet last run on analytics1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:29] PROBLEM - puppet last run on es1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:30] PROBLEM - puppet last run on mw1326 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:30] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:31] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:31] PROBLEM - puppet last run on wtp1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:32] PROBLEM - puppet last run on wtp1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:32] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:33] PROBLEM - puppet last run on lithium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:33] PROBLEM - puppet last run on db1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on mw1314 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on lvs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on ms-be1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on mw1286 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on dumpsdata1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on ores1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on mw1192 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on elastic1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on mw1256 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on ms-fe1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on analytics1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on dbproxy1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on db1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on mc1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on kubestagetcd1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on mw1190 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on graphite1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:02] PROBLEM - puppet last run on alsafi is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:08] PROBLEM - puppet last run on rdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:08] PROBLEM - puppet last run on restbase1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:08] PROBLEM - puppet last run on mw1167 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:08] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:08] PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:09] PROBLEM - puppet last run on mc1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:14] ouch [21:06:18] PROBLEM - puppet last run on druid1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:18] PROBLEM - puppet last run on etcd1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:19] PROBLEM - puppet last run on mw1318 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:28] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:29] PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:29] PROBLEM - puppet last run on labvirt1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:29] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:29] PROBLEM - puppet last run on labpuppetmaster1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:29] PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:38] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:38] PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:38] PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:46] is it ok to quiet icinga-wm for a bit? [21:06:48] PROBLEM - puppet last run on db1075 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:48] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:48] PROBLEM - puppet last run on db1092 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:48] PROBLEM - puppet last run on mw1246 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:48] PROBLEM - puppet last run on mx1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:49] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:49] PROBLEM - puppet last run on db1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:58] PROBLEM - puppet last run on pc1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:58] PROBLEM - puppet last run on mw1180 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:58] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:58] PROBLEM - puppet last run on mc1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:59] PROBLEM - puppet last run on db1095 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:59] PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:06:59] PROBLEM - puppet last run on dbmonitor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:00] PROBLEM - puppet last run on dubnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:00] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:08] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:08] PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:08] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:08] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:08] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:09] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:09] PROBLEM - puppet last run on ores1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:10] PROBLEM - puppet last run on mw1201 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:10] PROBLEM - puppet last run on ms1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:11] PROBLEM - puppet last run on mw1168 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:11] PROBLEM - puppet last run on cp1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:28] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:28] PROBLEM - puppet last run on labvirt1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:29] PROBLEM - puppet last run on elastic1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:29] PROBLEM - puppet last run on db1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:29] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:38] PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:38] PROBLEM - puppet last run on db1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:39] PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:39] PROBLEM - puppet last run on mw1309 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:08:58] PROBLEM - puppet last run on rdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:09:09] PROBLEM - puppet last run on uranium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:09:19] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:09:34] mutante: around to take a look why puppet is failing? [21:14:06] or any other wmf op? [21:21:20] Sagan: I'm looking [21:22:01] andrewbogott: thx. just ping me when I can put the echo of icinga on again (or if you prefer, you can turn it off yourself, then I'd remove the quiet, works as well) [21:22:45] !log restarting puppetdb on nitrogen [21:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:50] !log restarting nginx on nitrogen [21:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:46] Sagan: should be fixed, although the recovery alerts will be noisy [21:26:04] andrewbogott: so wait for about 10, 15 minutes? [21:26:10] 15 should do it [21:26:45] ok [21:28:17] 10Operations, 10Availability (Multiple-active-datacenters), 10MediaWiki-Platform-Team (MWPT-Q2-Oct-Dec-2017), 10Patch-For-Review, and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3698108 (10CCicalese_WMF) [21:29:08] 10Operations, 10MediaWiki-Platform-Team, 10Epic, 10Performance-Team (Radar), 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support, Q2 goals - https://phabricator.wikimedia.org/T175213#3698116 (10CCicalese_WMF) [21:29:25] 10Operations, 10MediaWiki-Platform-Team, 10Epic, 10Performance-Team (Radar), 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support - https://phabricator.wikimedia.org/T175206#3698119 (10CCicalese_WMF) [21:34:54] 10Operations, 10Puppet: puppetdb failures - https://phabricator.wikimedia.org/T178625#3698143 (10Andrew) [21:39:59] let's have a look if they are gone [21:40:36] still a lot [21:47:57] 232, down from 650+ [21:48:04] still so much? [21:48:59] what's with the stack overflows in remex-html [21:49:21] Fatal error: Stack overflow in /srv/mediawiki/php-1.31.0-wmf.4/vendor/wikimedia/remex-html/RemexHtml/Serializer/Serializer.php on line 274 [21:49:45] happened 37 times in 15 minutes [21:50:34] these will clear they are temporary. it must be related to the bulk compilation I'm running to diff between puppetmasters [21:51:14] herron: if you're doing… things… is it possible that https://phabricator.wikimedia.org/T178625 was due to an overload you caused and not spontaneous? [21:51:32] yes certainly related [21:52:03] Can you add a note to that effect on the ticket? [21:52:19] you bet adding now [21:52:41] sorry for the interrupts [21:53:40] no worries [21:54:05] :) [21:54:33] andrewbogott: still so much on critical, or do you think I can unquiet without a spamwave? :) [21:55:13] 71 now [21:55:16] so, just about, yeah [21:56:18] ok, I will give it 5 minutes, then unquiet it [21:56:31] since 71 recoverys make some pages of logs ;) [21:56:44] I wonder that it takes longer than 20 minutes [21:56:56] shouldn't puppet get executed on every host within 20 minutes? [21:57:27] Can someone ping me when everything is back to normal? I have so changes to push to routers and don't want to create more mess (or miss alerts) [21:58:03] !log running extra change dispatchers on terbium for wikidatawiki for a short while [21:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:57] 10Operations, 10Puppet: puppetdb failures - https://phabricator.wikimedia.org/T178625#3698143 (10herron) This correlates with running catalog diffs in bulk from puppetcompiler1001 and must be related. I ran a diff across all nodes serially overnight last night waiting 20 seconds in between nodes with issue.... [22:04:51] 10Operations, 10Puppet: puppetdb failures - https://phabricator.wikimedia.org/T178625#3698211 (10herron) By the way the bulk diff is running from a root screen on puppetcompiler1001.eqiad.wmnet. Should these symptoms re-occur this can be checked and if running stopped with ctrl-c. [22:06:29] RECOVERY - puppet last run on cp1062 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [22:07:29] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [22:09:28] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:09:38] RECOVERY - puppet last run on chlorine is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:09:39] RECOVERY - puppet last run on ms-be1031 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:09:59] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:09:59] RECOVERY - puppet last run on mw1254 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:10:24] 10Operations: Improve puppet alerting - https://phabricator.wikimedia.org/T178628#3698220 (10herron) [22:10:39] RECOVERY - puppet last run on wtp1029 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:11:08] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:11:49] RECOVERY - puppet last run on aqs1008 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:12:09] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:21:13] !log stop running extra change dispatchers on terbium for wikidatawiki [22:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:28] RECOVERY - puppet last run on mw1193 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:32:48] PROBLEM - HHVM jobrunner on mw1310 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [22:33:25] 10Operations, 10netops, 10Patch-For-Review, 10Performance-Team (Radar), 10Performance-Team-notice: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3698379 (10BBlack) +1 LGTM! [22:33:48] RECOVERY - HHVM jobrunner on mw1310 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Evening SWAT (Max 8 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171019T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:04:36] (03PS1) 10BBlack: Add log output for persist disc. during idle state [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/385302 [23:04:37] (03PS1) 10BBlack: add -v flag for verbose logging [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/385303 [23:04:40] (03PS1) 10BBlack: Release 0.1.1 [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/385304 [23:06:40] (03CR) 10BBlack: [V: 032 C: 032] Add log output for persist disc. during idle state [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/385302 (owner: 10BBlack) [23:06:45] (03CR) 10BBlack: [V: 032 C: 032] add -v flag for verbose logging [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/385303 (owner: 10BBlack) [23:06:54] (03CR) 10BBlack: [V: 032 C: 032] Release 0.1.1 [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/385304 (owner: 10BBlack) [23:09:04] (03PS1) 10BBlack: Merge branch 'master' into debian [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/385305 [23:09:06] (03PS1) 10BBlack: vhtcpd (0.1.1-1) unstable; urgency=low [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/385306 [23:09:21] (03CR) 10BBlack: [V: 032 C: 032] Merge branch 'master' into debian [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/385305 (owner: 10BBlack) [23:09:26] (03CR) 10BBlack: [V: 032 C: 032] vhtcpd (0.1.1-1) unstable; urgency=low [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/385306 (owner: 10BBlack) [23:09:38] RECOVERY - BGP status on cr1-eqdfw is OK: Use of uninitialized value asn in concatenation (.) or string at /usr/lib/nagios/plugins/check_bgp line 306. [23:14:04] heh, that can't be right [23:18:54] (03PS1) 10BBlack: check_bgp: report an unknown ASN as critical [puppet] - 10https://gerrit.wikimedia.org/r/385308