[01:04:08] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [01:04:38] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1508461472 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4227623 keys, up 4 minutes 30 seconds - replication_delay is 1508461472 [01:05:39] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4227514 keys, up 5 minutes 32 seconds - replication_delay is 0 [01:13:28] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.115 second response time [01:40:39] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.198 second response time [02:17:42] !log uploaded vhtcpd-0.1.1-1 to reprepro [02:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:19:38] !log upgrading vhtcpd to 0.1.1-1 on: cp1008, cp3030, cp3034, cp4029-32 [02:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:39] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 780.52 seconds [03:38:25] (03CR) 10Krinkle: "Was hoping to verify on Beta, but looks like it's not had a puppet run on Varnish there in over 10 days? https://tools.wmflabs.org/sal/log" [puppet] - 10https://gerrit.wikimedia.org/r/381274 (owner: 10Krinkle) [04:06:58] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 99.42 seconds [04:08:09] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [04:33:38] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [05:37:22] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385316 [05:37:29] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385316 [05:39:04] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385316 (owner: 10Marostegui) [05:40:28] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385316 (owner: 10Marostegui) [05:40:32] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385316 (owner: 10Marostegui) [05:41:33] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1051 - T174509 (duration: 00m 47s) [05:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:42] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [05:41:58] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [05:42:19] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0 [05:52:28] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [05:52:34] !log Stop replication in sync on db1103 and db1072 - T164488 [05:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:40] T164488: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488 [05:53:09] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [05:53:11] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: [subtask] How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3698687 (10Joe) [05:59:28] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [06:01:32] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385318 [06:01:35] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385318 [06:02:39] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: [subtask] How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3698689 (10Joe) I think there are a few things at play here: - How do we distribute chromium to the servers in the cluster eff... [06:03:28] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [06:03:51] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385318 (owner: 10Marostegui) [06:05:05] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385318 (owner: 10Marostegui) [06:06:11] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385318 (owner: 10Marostegui) [06:06:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1072 - T164488 (duration: 00m 46s) [06:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:26] T164488: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488 [06:07:40] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: [subtask] How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3698693 (10MoritzMuehlenhoff) Slight race condition here :-) I had just followed up on https://phabricator.wikimedia.org/T1781... [06:11:02] (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 to fix data drifts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385319 (https://phabricator.wikimedia.org/T164488) [06:13:08] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1078 to fix data drifts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385319 (https://phabricator.wikimedia.org/T164488) (owner: 10Marostegui) [06:14:18] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1078 to fix data drifts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385319 (https://phabricator.wikimedia.org/T164488) (owner: 10Marostegui) [06:14:19] 10Operations, 10Release Pipeline, 10Continuous-Integration-Infrastructure (shipyard): Update docker image docker-registry.wikimedia.org/wikimedia-jessie - https://phabricator.wikimedia.org/T177055#3698697 (10Joe) a:03Joe [06:15:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1078 - T164488 (duration: 00m 45s) [06:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:35] T164488: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488 [06:16:11] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1078 to fix data drifts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385319 (https://phabricator.wikimedia.org/T164488) (owner: 10Marostegui) [06:17:42] !log Stop replication in sync on db1103 and db1078 - T164488 [06:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:48] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1078 to fix data drifts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385320 [06:24:13] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1078 to fix data drifts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385320 (owner: 10Marostegui) [06:25:30] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078 to fix data drifts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385320 (owner: 10Marostegui) [06:26:04] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078 to fix data drifts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385320 (owner: 10Marostegui) [06:26:25] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1078 - T164488 (duration: 00m 46s) [06:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:33] T164488: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488 [06:28:33] (03PS1) 10Giuseppe Lavagetto: docker-registry: allow pushing images from other hosts [puppet] - 10https://gerrit.wikimedia.org/r/385321 (https://phabricator.wikimedia.org/T177055) [06:32:57] 10Operations, 10DBA: Lost access to x1-analytics-slave - https://phabricator.wikimedia.org/T175970#3698712 (10Marostegui) >>! In T175970#3696906, @jcrespo wrote: > CC @Marostegui Maybe we can setup an unpuppetized copy of x1 from dbstore2002 on dbstore1002? Yeah, dbstore1002 now has plenty of space (1.9T avai... [06:34:02] 10Operations, 10DBA: Lost access to x1-analytics-slave - https://phabricator.wikimedia.org/T175970#3698713 (10jcrespo) Should we do it on the same instance, is a mysqloader fast enough for x1, do you think? [06:34:14] 10Operations, 10DBA: Lost access to x1-analytics-slave - https://phabricator.wikimedia.org/T175970#3698714 (10jcrespo) a:03jcrespo [06:36:28] 10Operations, 10DBA: Lost access to x1-analytics-slave - https://phabricator.wikimedia.org/T175970#3698715 (10Marostegui) >>! In T175970#3698713, @jcrespo wrote: > Should we do it on the same instance, is a mysqloader fast enough for x1, do you think? I would do it on the same instance, otherwise we will need... [06:37:38] (03CR) 10Giuseppe Lavagetto: [C: 032] docker-registry: allow pushing images from other hosts [puppet] - 10https://gerrit.wikimedia.org/r/385321 (https://phabricator.wikimedia.org/T177055) (owner: 10Giuseppe Lavagetto) [06:43:14] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: [subtask] How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3698733 (10MoritzMuehlenhoff) >>! In T178570#3698689, @Joe wrote: > I think there are a few things at play here: > - How do we... [06:51:01] It would be nice if anyone with root could check on terbium whether we had php crashing there yesterday or the day before (segfault, …)? [06:57:39] hoo: having a look [06:59:05] thanks [07:01:37] hoo: no, can't find a trace of that [07:02:02] oh well… thanks for checking, though :) [07:18:27] <_joe_> !log uploading updated versions of docker base images (wikimedia-jessie, wikimedia-stretch) to docker-registry.wikimedia.org [07:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:24] !log stopping dbstore1002 and dbstore2002 x1 replication in sync [07:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:30] !log stopping db2033, too [07:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:58] (03PS4) 10Giuseppe Lavagetto: Add the repository to the name of all generated containers [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384992 [07:41:09] (03CR) 10Giuseppe Lavagetto: [C: 032] Add the repository to the name of all generated containers [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384992 (owner: 10Giuseppe Lavagetto) [07:45:58] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 55 probes of 294 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [07:48:25] _joe_: good morning! The Docker registry still 403 when trying to push from contint1001. Maybe nginx needs a reload? :) [07:49:19] PROBLEM - MariaDB Slave Lag: x1 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 605.61 seconds [07:49:24] or maybe our credentials do not let us push to the wikimedia/ namespace [07:49:55] mmm. I am sure I downtimed that [07:50:58] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 2 probes of 294 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [07:58:11] (03PS3) 10Muehlenhoff: Add new component thirdparty/confluent [puppet] - 10https://gerrit.wikimedia.org/r/385189 [07:59:11] (03CR) 10Muehlenhoff: [C: 032] Add new component thirdparty/confluent [puppet] - 10https://gerrit.wikimedia.org/r/385189 (owner: 10Muehlenhoff) [08:05:39] !log mwscript createAndPromote.php --wiki=techconductwiki --sysop --bureaucrat 'Ladsgroup' --force [08:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:55] <_joe_> hashar: not really, it should work now [08:09:11] <_joe_> if it does not, lemme check the logs [08:09:16] <_joe_> it works from boron [08:09:27] hmm maybe the credential is wrong so [08:09:27] <_joe_> and it didn't before my change [08:09:29] <_joe_> so [08:09:31] <_joe_> no [08:09:45] <_joe_> the credentials are ok on contint1001, I checked [08:13:13] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3698826 (10Ladsgroup) >>! In T99531#3692384, @Dzahn wrote: > You might hate me for this question but have you considered using an org domain... [08:45:09] (03PS1) 10Elukey: profile::aqs: specify the cassandra_local_dc parameter via hiera [puppet] - 10https://gerrit.wikimedia.org/r/385332 [08:45:33] 10Operations, 10Puppet: Improve puppet alerting - https://phabricator.wikimedia.org/T178628#3698863 (10Peachey88) [08:47:19] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/8391/aqs1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/385332 (owner: 10Elukey) [08:48:45] (03CR) 10Joal: [C: 031] "+1 for me !" [puppet] - 10https://gerrit.wikimedia.org/r/385332 (owner: 10Elukey) [08:50:41] (03CR) 10Elukey: [C: 032] profile::aqs: specify the cassandra_local_dc parameter via hiera [puppet] - 10https://gerrit.wikimedia.org/r/385332 (owner: 10Elukey) [08:51:16] (03PS3) 10Muehlenhoff: Add thirdparty/confluent on stretch-based kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/385196 [09:09:38] (03PS1) 10Jcrespo: mariadb: Add all x1 tables, including ones on %wik% schemas on dbstores [puppet] - 10https://gerrit.wikimedia.org/r/385334 (https://phabricator.wikimedia.org/T175970) [09:16:39] (03CR) 10Marostegui: [C: 031] mariadb: Add all x1 tables, including ones on %wik% schemas on dbstores [puppet] - 10https://gerrit.wikimedia.org/r/385334 (https://phabricator.wikimedia.org/T175970) (owner: 10Jcrespo) [09:17:02] !log upgrading hhvm-wikidiff2 to 1.5.1 on mw2200-mw2223 [09:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:07] (03CR) 10Jcrespo: [C: 032] mariadb: Add all x1 tables, including ones on %wik% schemas on dbstores [puppet] - 10https://gerrit.wikimedia.org/r/385334 (https://phabricator.wikimedia.org/T175970) (owner: 10Jcrespo) [09:19:09] RECOVERY - MariaDB Slave Lag: x1 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [09:19:39] (03CR) 10Hashar: "That patch boils down to adding to the Jenkins java command line:" [puppet] - 10https://gerrit.wikimedia.org/r/385230 (https://phabricator.wikimedia.org/T178608) (owner: 10Hashar) [09:29:02] 10Operations, 10DBA, 10Patch-For-Review: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3699019 (10jcrespo) To investigate if we need to convert some of these away from uncompressed InnoDB on dbstore1001: ``` +--------------------------------... [09:30:59] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 647.17 seconds [09:33:53] 10Operations, 10DBA, 10Patch-For-Review: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3699022 (10jcrespo) On screen at dbstore1001 CC @marostegui : root@dbstore1001[cebwiki]> ALTER TABLE templatelinks ENGINE=TokuDB; probably commonswiki.r... [09:34:50] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "I know some people argue such changes are more distracting than helpful, because they don't change anything on the users end. I, personall" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383999 (owner: 10Hoo man) [09:37:41] 10Operations, 10DBA, 10Patch-For-Review: Lost access to x1-analytics-slave - https://phabricator.wikimedia.org/T175970#3699057 (10jcrespo) I am reimporting x1 on dbstore1002 (including the echo tables), hopefully, that will work and will not break anything else :-/ [09:39:03] lag on dbstore1002:s7, that is unexpected [09:39:10] 10Operations, 10netops, 10Patch-For-Review: Find a new PIM RP IP - https://phabricator.wikimedia.org/T167842#3699063 (10Gehel) Jolokia is only listening on localhost, and we are not using discovery. The patch above will disable discovery. [09:39:49] I think is the existing load [09:39:54] plus the importing load [09:40:27] research queries + eventlogging + maybe purges? + import + regular replication [09:44:35] I think the import is about to finish, so that could help [09:44:39] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "Looks good to me. Just a few questions and suggestions." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/384951 (owner: 10Hoo man) [09:51:07] (03CR) 10Jcrespo: "They are not distracting, when I do a database check, having them in order helps me not have false positives on diffs if I am doing a diff" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383999 (owner: 10Hoo man) [09:55:57] (03PS1) 10Joal: Update aqs config template file [puppet] - 10https://gerrit.wikimedia.org/r/385339 (https://phabricator.wikimedia.org/T178312) [09:56:19] (03PS1) 10Hoo man: Enable description usage tracking on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385340 (https://phabricator.wikimedia.org/T178515) [09:56:25] (03CR) 10jerkins-bot: [V: 04-1] Update aqs config template file [puppet] - 10https://gerrit.wikimedia.org/r/385339 (https://phabricator.wikimedia.org/T178312) (owner: 10Joal) [09:58:08] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 221.65 seconds [09:58:45] (03PS2) 10Elukey: aqs: update config template file [puppet] - 10https://gerrit.wikimedia.org/r/385339 (https://phabricator.wikimedia.org/T178312) (owner: 10Joal) [10:01:04] (03PS1) 10Giuseppe Lavagetto: profile::docker::registry: allow using an external certificate [puppet] - 10https://gerrit.wikimedia.org/r/385342 (https://phabricator.wikimedia.org/T178606) [10:01:15] <_joe_> akosiaris: ^^ [10:01:34] (03CR) 10jerkins-bot: [V: 04-1] profile::docker::registry: allow using an external certificate [puppet] - 10https://gerrit.wikimedia.org/r/385342 (https://phabricator.wikimedia.org/T178606) (owner: 10Giuseppe Lavagetto) [10:01:42] <_joe_> wat [10:02:34] <_joe_> interestingly, there is something that happens in CI and not running the same rakefile on my computer [10:03:31] <_joe_> and it's also a bug [10:03:47] <_joe_> anyways, fixing [10:04:29] (03CR) 10Elukey: [C: 04-2] "Currently testing this code in Labs, a deployment needs to happen in prod before merging." [puppet] - 10https://gerrit.wikimedia.org/r/385339 (https://phabricator.wikimedia.org/T178312) (owner: 10Joal) [10:05:52] (03CR) 10Alexandros Kosiaris: [C: 04-1] profile::docker::registry: allow using an external certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385342 (https://phabricator.wikimedia.org/T178606) (owner: 10Giuseppe Lavagetto) [10:06:01] (03PS2) 10Giuseppe Lavagetto: profile::docker::registry: allow using an external certificate [puppet] - 10https://gerrit.wikimedia.org/r/385342 (https://phabricator.wikimedia.org/T178606) [10:06:56] <_joe_> akosiaris: already fixed :P [10:07:11] <_joe_> but before this can go, we need DNS anyways [10:13:41] (03PS1) 10Giuseppe Lavagetto: Add entry for docker-registry.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/385343 (https://phabricator.wikimedia.org/T178606) [10:13:50] (03CR) 10Giuseppe Lavagetto: profile::docker::registry: allow using an external certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385342 (https://phabricator.wikimedia.org/T178606) (owner: 10Giuseppe Lavagetto) [10:14:07] (03CR) 10Alexandros Kosiaris: [C: 031] profile::docker::registry: allow using an external certificate [puppet] - 10https://gerrit.wikimedia.org/r/385342 (https://phabricator.wikimedia.org/T178606) (owner: 10Giuseppe Lavagetto) [10:14:32] (03PS1) 10Alexandros Kosiaris: Upload my new public SSH key based on my new yubikey [puppet] - 10https://gerrit.wikimedia.org/r/385344 (https://phabricator.wikimedia.org/T178361) [10:14:44] !log upgrading hhvm-wikidiff2 to 1.5.1 on mw2150/mw2151/mw2244/mw2245 [10:14:49] <_joe_> akosiaris: are you ok with the dns change? [10:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:53] _joe_: so a question. I haven't seen at all the docker registry code... it stores no state anywhere ? [10:15:58] aside from swift that is [10:16:05] <_joe_> only on its datastore, it seems [10:16:12] <_joe_> I just looked into that [10:16:38] so.. here's the interesting gotcha [10:16:43] <_joe_> I've seen reports of people using the s3 backend and then running multiple registries on top of it [10:16:44] with swift being eventually consistent [10:17:00] <_joe_> akosiaris: atm the containers are hosted in codfw only, btw [10:17:04] when a new image is pushed into eqiad [10:17:13] oh really ? [10:17:15] <_joe_> it's going to codfw anyways [10:17:16] <_joe_> yeah [10:17:18] <_joe_> :D [10:17:28] damn I am missing parts of the picture [10:17:50] so ok then.. I was about to point out we would be having different versions of containers per DC [10:18:06] <_joe_> well only during the push, yeah [10:18:15] which btw means we need to ban things like "latest", "stable", "testing" etc as tags [10:18:20] <_joe_> but since we will not using LATEST [10:18:21] <_joe_> ever [10:18:31] <_joe_> yeah we need to ban that anyways [10:18:39] <_joe_> for kubernetes-deployed containers [10:18:49] yeah.. probably only epochs or version numbers [10:18:52] <_joe_> everyone realized that was a pattern to destruction some time ago [10:18:58] depending on how BOFH we want to become [10:19:38] <_joe_> akosiaris: for base images I though of this: debian-like versions, with "nightly" builds with the scheme +YYYYMMDD [10:19:56] yeah I am not worried about the base images [10:20:20] (03CR) 10Alexandros Kosiaris: [C: 031] Add entry for docker-registry.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/385343 (https://phabricator.wikimedia.org/T178606) (owner: 10Giuseppe Lavagetto) [10:20:41] I 'll update the credentials in puppet to point out the new DNS [10:20:52] to use* [10:21:00] <_joe_> ? [10:21:07] (03PS1) 10ArielGlenn: add in-progress files to excludes for rsync of dumps between peers [puppet] - 10https://gerrit.wikimedia.org/r/385345 [10:21:16] they use darmstadtium [10:21:22] <_joe_> oh you mean for the builders [10:21:27] <_joe_> yeah [10:21:30] for contint1001, yes [10:21:39] <_joe_> also boron/copper [10:21:51] (03CR) 10Giuseppe Lavagetto: [C: 032] Add entry for docker-registry.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/385343 (https://phabricator.wikimedia.org/T178606) (owner: 10Giuseppe Lavagetto) [10:22:27] (03CR) 10ArielGlenn: [C: 032] add in-progress files to excludes for rsync of dumps between peers [puppet] - 10https://gerrit.wikimedia.org/r/385345 (owner: 10ArielGlenn) [10:26:28] (03PS2) 10Alexandros Kosiaris: Upload my new public SSH key based on my new yubikey [puppet] - 10https://gerrit.wikimedia.org/r/385344 (https://phabricator.wikimedia.org/T178361) [10:26:32] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Upload my new public SSH key based on my new yubikey [puppet] - 10https://gerrit.wikimedia.org/r/385344 (https://phabricator.wikimedia.org/T178361) (owner: 10Alexandros Kosiaris) [10:31:44] 10Operations, 10DBA, 10Patch-For-Review: Lost access to x1-analytics-slave - https://phabricator.wikimedia.org/T175970#3699146 (10jcrespo) 05Open>03Resolved @Etonkovidova, @Catrope sorry for the delay. We how this access to echo tables on dbstore1002 will be relatively reliable. I added grants to all x1... [10:33:00] (03PS1) 10Giuseppe Lavagetto: Add secret for docker-registry,discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/385347 [10:33:54] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add secret for docker-registry,discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/385347 (owner: 10Giuseppe Lavagetto) [10:37:42] (03CR) 10Alexandros Kosiaris: [C: 032] profile::docker::registry: allow using an external certificate [puppet] - 10https://gerrit.wikimedia.org/r/385342 (https://phabricator.wikimedia.org/T178606) (owner: 10Giuseppe Lavagetto) [10:37:46] (03PS3) 10Alexandros Kosiaris: profile::docker::registry: allow using an external certificate [puppet] - 10https://gerrit.wikimedia.org/r/385342 (https://phabricator.wikimedia.org/T178606) (owner: 10Giuseppe Lavagetto) [10:37:58] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] profile::docker::registry: allow using an external certificate [puppet] - 10https://gerrit.wikimedia.org/r/385342 (https://phabricator.wikimedia.org/T178606) (owner: 10Giuseppe Lavagetto) [11:13:51] (03PS1) 10Muehlenhoff: Add thirdparty/k8s component [puppet] - 10https://gerrit.wikimedia.org/r/385351 [11:58:29] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3551232 (10elukey) Updated list (showjob1.txt contains group1, showjob.txt group2) ``` elukey@terbium:~$ awk '{if ($3 > 100000) print $_}' showjob1.tx... [12:11:29] (03Draft2) 10Jayprakash12345: Enable local Upload in jvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385359 [12:12:09] (03PS3) 10Jayprakash12345: Enable local Upload in jvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385359 (https://phabricator.wikimedia.org/T178660) [12:19:51] (03CR) 10Muehlenhoff: [C: 032] Add thirdparty/k8s component [puppet] - 10https://gerrit.wikimedia.org/r/385351 (owner: 10Muehlenhoff) [12:29:50] (03CR) 10Filippo Giunchedi: "LGTM in principle, I wonder if we could use package_builder::pbuilder_base instead to also install a sources.list.d file (in the host) wit" [puppet] - 10https://gerrit.wikimedia.org/r/382160 (owner: 10Muehlenhoff) [12:32:28] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: [subtask] How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3699513 (10phuedx) Shower thought: It seems like the use of [the puppeteer library](https://www.npmjs.com/package/puppeteer)... [12:33:30] (03CR) 10Filippo Giunchedi: "> I see that " osm sync lag from /srv/osmosis/state.txt" is checked" [puppet] - 10https://gerrit.wikimedia.org/r/382905 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [12:35:51] (03PS4) 10Hashar: Rakefile: tweak wmf_style output for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/382716 [12:36:52] (03CR) 10Hashar: "Finally I found a way to avoid repeating myself at the expense of switching based on $JENKINS_URL being set." [puppet] - 10https://gerrit.wikimedia.org/r/382716 (owner: 10Hashar) [12:37:38] 10Operations, 10ops-eqiad, 10hardware-requests, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3699515 (10fgiunchedi) AFAICT the machine is online with `spare::system` role applied. Once a role giving performance team access is applied it should be u... [12:41:37] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: [subtask] How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3699517 (10bmansurov) > How do we ensure security upgrades happen in a timely manner for this component? The team maintaining... [12:45:26] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: [subtask] How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3699518 (10bmansurov) > OTOH there's nothing to stop us from launching a Chromium process ourselves and using command line swi... [12:51:04] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Various small comments. As far as Filippo's suggestion goes there's at least 2 exceptions that make it a tad more difficult to reuse pbuil" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/382160 (owner: 10Muehlenhoff) [13:00:37] (03PS1) 10Gehel: wdqs: garbage collection tuning [puppet] - 10https://gerrit.wikimedia.org/r/385364 (https://phabricator.wikimedia.org/T175919) [13:03:29] (03PS1) 10Muehlenhoff: Add systemd::tmpfile to configure /etc/tmpfiles.d configuation files [puppet] - 10https://gerrit.wikimedia.org/r/385365 [13:06:01] (03PS1) 10Faidon Liambotis: Add reverse DNS zones for APNIC space [dns] - 10https://gerrit.wikimedia.org/r/385366 (https://phabricator.wikimedia.org/T156256) [13:06:37] (03CR) 10Faidon Liambotis: [C: 032] Add reverse DNS zones for APNIC space [dns] - 10https://gerrit.wikimedia.org/r/385366 (https://phabricator.wikimedia.org/T156256) (owner: 10Faidon Liambotis) [13:10:31] (03PS1) 10ArielGlenn: clean up partially written files from previous 7z runs of same wiki and date [dumps] - 10https://gerrit.wikimedia.org/r/385368 [13:10:54] 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3695340 (10daniel) **Last Call**: as per the RFC discussion on IRC on October 18[1], this enters the Last Call period. If no pertinent concerns remain unaddressed by... [13:12:33] 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3699580 (10Paladox) +1 to 5.6. CI side i belive we can go with 5.6+ as hashar removed trusty :). [13:14:50] _joe_: yesterday I wanted to change the puppet-lint format output. I might have found a more elegant way without repeating myself ! https://gerrit.wikimedia.org/r/#/c/382716/ :) [13:15:26] _joe_: and eventually once that is merged, we will have a jenkins job that reports issues left per modules [13:15:49] (03CR) 10Filippo Giunchedi: ">" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/382160 (owner: 10Muehlenhoff) [13:16:21] 10Operations, 10Traffic, 10Patch-For-Review: Allocate address space for Singapore (APNIC) - https://phabricator.wikimedia.org/T156256#3699583 (10faidon) OK, so APNIC fixed the "57 duplicate objects" situation, so I proceeded with the rest and specifically: - Updated our objects for the new office address - U... [13:17:18] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: [subtask] How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3699585 (10MoritzMuehlenhoff) >>! In T178570#3699513, @phuedx wrote: > OTOH there's nothing to stop us from launching a Chromi... [13:20:35] 10Operations, 10netops, 10Patch-For-Review, 10Performance-Team (Radar), 10Performance-Team-notice: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3699587 (10faidon) Sounds fine to me. Before we resolve this task, let's not forget that we'll need to cleanup our RIPE objects by remo... [13:23:23] (03CR) 10Muehlenhoff: Provide deb-src entries for older distros on package builders (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/382160 (owner: 10Muehlenhoff) [13:23:45] (03PS4) 10Muehlenhoff: Provide deb-src entries for older distros on package builders [puppet] - 10https://gerrit.wikimedia.org/r/382160 [13:27:33] <_joe_> hashar: nice, it seems it should work as-is [13:28:03] _joe_: it seems to work locally. I am not happy about the ENV['JENKINS_URL'] hack but we use that trick in various other repos [13:28:13] and at least that keep a sane output for humans [13:38:21] _joe_: if you are willing to merge it, I can then trigger the jenkins job and we would have a "nice" dashboard :D [13:40:06] <_joe_> hashar: will do in a few [13:42:04] (03CR) 10Elukey: "pcc looks good, https://puppet-compiler.wmflabs.org/compiler02/8394/ I left a couple of comments but nothing really important. Not sure if" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/385196 (owner: 10Muehlenhoff) [13:44:23] <_joe_> hashar: I am ok with that change, but I was thinking [13:44:43] <_joe_> we can use the linter statistics if you just need counts [13:45:05] the plugin tracks counts per modules as well :) [13:46:57] Jenkins would parse the output, keep track of errors per files/modules [13:47:07] and on each build show the progress compared to the previous one [13:49:18] <_joe_> ok [13:49:37] <_joe_> seems good enough :P [13:49:37] I cant find the build report anymore though :( [13:50:02] (03CR) 10Giuseppe Lavagetto: [C: 032] Rakefile: tweak wmf_style output for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/382716 (owner: 10Hashar) [13:54:29] the format trick works :D [13:54:37] but of course the plugin cant parse it bah [13:57:24] (03CR) 10Alexandros Kosiaris: [C: 031] Provide deb-src entries for older distros on package builders [puppet] - 10https://gerrit.wikimedia.org/r/382160 (owner: 10Muehlenhoff) [13:57:39] (03PS1) 10Odder: Working Class Movement Library (Salford) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385377 (https://phabricator.wikimedia.org/T178689) [14:01:08] (03CR) 10Muehlenhoff: Add thirdparty/confluent on stretch-based kafka brokers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/385196 (owner: 10Muehlenhoff) [14:04:37] (03PS1) 10Ema: cache: set varnish runtime parameters via hiera [puppet] - 10https://gerrit.wikimedia.org/r/385378 (https://phabricator.wikimedia.org/T159429) [14:05:07] (03CR) 10jerkins-bot: [V: 04-1] cache: set varnish runtime parameters via hiera [puppet] - 10https://gerrit.wikimedia.org/r/385378 (https://phabricator.wikimedia.org/T159429) (owner: 10Ema) [14:06:08] (03PS1) 10Hashar: Rakefile: wmf_style format now support KIND [puppet] - 10https://gerrit.wikimedia.org/r/385379 [14:06:22] _joe_: https://gerrit.wikimedia.org/r/385379 Rakefile: wmf_style format now support KIND (sorry) [14:07:07] (03PS2) 10Ema: cache: set varnish runtime parameters via hiera [puppet] - 10https://gerrit.wikimedia.org/r/385378 (https://phabricator.wikimedia.org/T159429) [14:07:38] (03CR) 10jerkins-bot: [V: 04-1] cache: set varnish runtime parameters via hiera [puppet] - 10https://gerrit.wikimedia.org/r/385378 (https://phabricator.wikimedia.org/T159429) (owner: 10Ema) [14:08:36] (03PS3) 10Ema: cache: set varnish runtime parameters via hiera [puppet] - 10https://gerrit.wikimedia.org/r/385378 (https://phabricator.wikimedia.org/T159429) [14:12:09] (03PS1) 10Jgreen: remove unused jgreen@neo ssh public key [puppet] - 10https://gerrit.wikimedia.org/r/385380 [14:15:29] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: [subtask] How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3699664 (10Joe) >>! In T178570#3699585, @MoritzMuehlenhoff wrote: >>>! In T178570#3699513, @phuedx wrote: >> OTOH there's noth... [14:15:52] (03CR) 10Jgreen: [C: 032] remove unused jgreen@neo ssh public key [puppet] - 10https://gerrit.wikimedia.org/r/385380 (owner: 10Jgreen) [14:16:05] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Set up octocatalog-diff on host with access to puppetmasters and puppetdb - https://phabricator.wikimedia.org/T177843#3699666 (10herron) [14:16:08] 10Operations, 10Puppet, 10User-Joe: Set up octocatalog-diff on host with access to puppetmasters and puppetdb - https://phabricator.wikimedia.org/T177843#3672181 (10herron) [14:17:24] (03PS1) 10Mobrovac: CP-JobQueue: Use the Special:RunSingleJob page to execute jobs [puppet] - 10https://gerrit.wikimedia.org/r/385382 (https://phabricator.wikimedia.org/T175146) [14:17:31] (03CR) 10jerkins-bot: [V: 04-1] CP-JobQueue: Use the Special:RunSingleJob page to execute jobs [puppet] - 10https://gerrit.wikimedia.org/r/385382 (https://phabricator.wikimedia.org/T175146) (owner: 10Mobrovac) [14:18:59] (03PS4) 10Ema: cache: set varnish runtime parameters via hiera [puppet] - 10https://gerrit.wikimedia.org/r/385378 (https://phabricator.wikimedia.org/T159429) [14:19:42] (03PS2) 10Mobrovac: CP-JobQueue: Use the Special:RunSingleJob page to execute jobs [puppet] - 10https://gerrit.wikimedia.org/r/385382 (https://phabricator.wikimedia.org/T175146) [14:19:49] (03CR) 10jerkins-bot: [V: 04-1] CP-JobQueue: Use the Special:RunSingleJob page to execute jobs [puppet] - 10https://gerrit.wikimedia.org/r/385382 (https://phabricator.wikimedia.org/T175146) (owner: 10Mobrovac) [14:20:35] (03PS3) 10Mobrovac: CP-JobQueue: Use the Special:RunSingleJob page to execute jobs [puppet] - 10https://gerrit.wikimedia.org/r/385382 (https://phabricator.wikimedia.org/T175146) [14:20:42] (03CR) 10jerkins-bot: [V: 04-1] CP-JobQueue: Use the Special:RunSingleJob page to execute jobs [puppet] - 10https://gerrit.wikimedia.org/r/385382 (https://phabricator.wikimedia.org/T175146) (owner: 10Mobrovac) [14:22:32] (03CR) 10Mobrovac: [C: 04-1] "This is not ready to go. We first need to ensure the EventBus code has been deployed, and then we need to switch the LVS to point to the m" [puppet] - 10https://gerrit.wikimedia.org/r/385382 (https://phabricator.wikimedia.org/T175146) (owner: 10Mobrovac) [14:24:03] (03CR) 10Zoranzoki21: [C: 031] Enable local Upload in jvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385359 (https://phabricator.wikimedia.org/T178660) (owner: 10Jayprakash12345) [14:25:07] (03PS1) 10Alexandros Kosiaris: prometheus: Add a service label for OTRS [puppet] - 10https://gerrit.wikimedia.org/r/385385 [14:27:09] 10Operations, 10Puppet, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3699687 (10herron) [14:27:11] 10Operations, 10Puppet, 10User-Joe: Set up octocatalog-diff on host with access to puppetmasters and puppetdb - https://phabricator.wikimedia.org/T177843#3699686 (10herron) 05Open>03Resolved [14:27:34] 10Operations, 10Puppet, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3652273 (10herron) [14:29:04] 10Operations, 10monitoring: Better organization for ops grafana dashboards - https://phabricator.wikimedia.org/T178690#3699692 (10fgiunchedi) [14:29:16] (03PS5) 10Ema: cache: set varnish runtime parameters via hiera [puppet] - 10https://gerrit.wikimedia.org/r/385378 (https://phabricator.wikimedia.org/T159429) [14:30:09] (03PS2) 10Giuseppe Lavagetto: Rakefile: wmf_style format now support KIND [puppet] - 10https://gerrit.wikimedia.org/r/385379 (owner: 10Hashar) [14:32:44] (03CR) 10Giuseppe Lavagetto: [C: 032] Rakefile: wmf_style format now support KIND [puppet] - 10https://gerrit.wikimedia.org/r/385379 (owner: 10Hashar) [14:35:28] (03CR) 10Ema: "pcc output looks good https://puppet-compiler.wmflabs.org/compiler02/8397/" [puppet] - 10https://gerrit.wikimedia.org/r/385378 (https://phabricator.wikimedia.org/T159429) (owner: 10Ema) [14:37:14] 10Operations, 10Puppet: Improve puppet alerting - https://phabricator.wikimedia.org/T178628#3698220 (10fgiunchedi) Great idea @herron ! We do have puppet failure metrics in Prometheus now too so we could for example (braindump territory) have individual host puppet alerts depend (and being inhibited by) a site... [14:39:09] !log updating power firmware analytics1037 [14:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:36] (03CR) 10Filippo Giunchedi: [C: 031] Provide deb-src entries for older distros on package builders [puppet] - 10https://gerrit.wikimedia.org/r/382160 (owner: 10Muehlenhoff) [14:44:27] 10Operations, 10ops-eqiad, 10DC-Ops: Multiple servers in eqiad D8 showing PSU failures - https://phabricator.wikimedia.org/T177227#3699736 (10Cmjohnson) 05Open>03Resolved @herron @faidon I updated the f/w on both servers and the issue has been resolved. [14:48:00] (03PS5) 10Muehlenhoff: Provide deb-src entries for older distros on package builders [puppet] - 10https://gerrit.wikimedia.org/r/382160 [14:48:34] (03CR) 10Muehlenhoff: [C: 032] Provide deb-src entries for older distros on package builders [puppet] - 10https://gerrit.wikimedia.org/r/382160 (owner: 10Muehlenhoff) [14:48:38] (03CR) 10Filippo Giunchedi: "Note that class_config is already tagging the resulting hosts with $cluster variable (from puppet), the idea being that cluster is the can" [puppet] - 10https://gerrit.wikimedia.org/r/385385 (owner: 10Alexandros Kosiaris) [14:49:13] (03CR) 10Ema: "Minor comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/385365 (owner: 10Muehlenhoff) [14:56:04] (03CR) 10Muehlenhoff: Add systemd::tmpfile to configure /etc/tmpfiles.d configuation files (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/385365 (owner: 10Muehlenhoff) [14:56:14] (03PS2) 10Muehlenhoff: Add systemd::tmpfile to configure /etc/tmpfiles.d configuation files [puppet] - 10https://gerrit.wikimedia.org/r/385365 [14:56:48] PROBLEM - Nginx local proxy to apache on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time [14:56:49] PROBLEM - HHVM rendering on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [14:57:01] moritzm: missing an "r" there [14:57:48] RECOVERY - Nginx local proxy to apache on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.046 second response time [14:57:49] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 77150 bytes in 0.094 second response time [15:02:15] (03CR) 10Filippo Giunchedi: Add systemd::tmpfile to configure /etc/tmpfiles.d configuation files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/385365 (owner: 10Muehlenhoff) [15:05:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3699779 (10Cmjohnson) Still no sign of failure from the h/w log....it took awhile last time [15:09:17] good point [15:09:51] (03PS3) 10Muehlenhoff: Add systemd::tmpfile to configure /etc/tmpfiles.d configuration files [puppet] - 10https://gerrit.wikimedia.org/r/385365 [15:13:15] 10Operations, 10ops-eqiad, 10DC-Ops: Multiple servers in eqiad D8 showing PSU failures - https://phabricator.wikimedia.org/T177227#3699799 (10herron) Oddly the icinga alerts still haven't cleared https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=analytics1036&service=IPMI+Sensor+Status htt... [15:14:08] PROBLEM - Apache HTTP on labweb1002 is CRITICAL: connect to address 10.64.48.170 and port 80: Connection refused [15:14:18] PROBLEM - nutcracker process on labweb1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nutcracker), command name nutcracker [15:14:20] PROBLEM - HHVM rendering on labweb1001 is CRITICAL: connect to address 10.64.16.200 and port 80: Connection refused [15:14:29] PROBLEM - puppet last run on labweb1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 16 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[apache2] [15:14:38] PROBLEM - Check systemd state on labweb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:14:48] PROBLEM - nutcracker port on labweb1002 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [15:14:48] PROBLEM - HHVM rendering on labweb1002 is CRITICAL: connect to address 10.64.48.170 and port 80: Connection refused [15:14:49] PROBLEM - nutcracker process on labweb1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nutcracker), command name nutcracker [15:14:49] PROBLEM - Apache HTTP on labweb1001 is CRITICAL: connect to address 10.64.16.200 and port 80: Connection refused [15:14:59] PROBLEM - nutcracker port on labweb1001 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [15:14:59] PROBLEM - Check systemd state on labweb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:30:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3699830 (10Cmjohnson) yeah, I see that they are failed again. I don't know why...i tried swapping another spare from the decom ms-be servers and it lights up but still show... [15:31:31] <_joe_> andrewbogott: ^^ should I worry? [15:31:56] _joe_: not about labweb, that's probably just a downtime expiring [15:31:58] <_joe_> oh a stretch appserver [15:32:01] (03CR) 10BBlack: [C: 031] cache: set varnish runtime parameters via hiera [puppet] - 10https://gerrit.wikimedia.org/r/385378 (https://phabricator.wikimedia.org/T159429) (owner: 10Ema) [15:32:03] (03PS4) 10Alexandros Kosiaris: Drop i386 environments from package_builder [puppet] - 10https://gerrit.wikimedia.org/r/384999 [15:32:04] <_joe_> how fancy :P [15:32:05] (03PS4) 10Alexandros Kosiaris: package_builder: Change all docs to stretch [puppet] - 10https://gerrit.wikimedia.org/r/385000 [15:32:07] (03PS4) 10Alexandros Kosiaris: package_builder: Switch default distribution to stretch [puppet] - 10https://gerrit.wikimedia.org/r/385001 [15:32:09] (03PS4) 10Alexandros Kosiaris: package_builder: Add buster as an environment [puppet] - 10https://gerrit.wikimedia.org/r/385002 [15:32:33] (03CR) 10Ayounsi: [C: 031] jenkins: disable auto-discovery [puppet] - 10https://gerrit.wikimedia.org/r/385230 (https://phabricator.wikimedia.org/T178608) (owner: 10Hashar) [15:32:45] (03CR) 10Alexandros Kosiaris: package_builder: Add buster as an environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385002 (owner: 10Alexandros Kosiaris) [15:32:54] (03CR) 10Alexandros Kosiaris: [C: 032] Drop i386 environments from package_builder [puppet] - 10https://gerrit.wikimedia.org/r/384999 (owner: 10Alexandros Kosiaris) [15:32:55] _joe_: do you think i can delete all the ganglia stuff from appservers, apache and hhvm? hopefully what is in grafana is at least just as good? [15:33:01] (03CR) 10Alexandros Kosiaris: [C: 032] package_builder: Change all docs to stretch [puppet] - 10https://gerrit.wikimedia.org/r/385000 (owner: 10Alexandros Kosiaris) [15:33:12] (03CR) 10Alexandros Kosiaris: [C: 032] package_builder: Switch default distribution to stretch [puppet] - 10https://gerrit.wikimedia.org/r/385001 (owner: 10Alexandros Kosiaris) [15:33:16] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3699835 (10Andrew) I installed 20 VMs, and ran stress-ng on each of them, like this: andrew@labpuppetmaster1001:~$ sudo cumin "name:labvirt1015stresstest*" "stress-ng --cpu... [15:33:30] (03CR) 10Alexandros Kosiaris: [C: 032] package_builder: Add buster as an environment [puppet] - 10https://gerrit.wikimedia.org/r/385002 (owner: 10Alexandros Kosiaris) [15:33:36] <_joe_> mutante: check if it is, but yeah, in general I think so [15:33:40] <_joe_> mutante: also ask elenah [15:33:42] <_joe_> err [15:33:46] cmjohnson1: that was me running a stress test on labvirt1015. It didn't hold up very well :( [15:33:46] (03CR) 10Alexandros Kosiaris: [C: 032] "Comments addressed. Thanks moritz. I am merging" [puppet] - 10https://gerrit.wikimedia.org/r/385002 (owner: 10Alexandros Kosiaris) [15:33:47] <_joe_> I meant elukey [15:33:54] _joe_: alright :) [15:34:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3699838 (10Andrew) >>! In T171473#3699830, @Cmjohnson wrote: > yeah, I see that they are failed again. I don't know why...i tried swapping another spare from the decom ms-b... [15:35:02] andrewbogott: yep looks like they sent us a bad CPU...the error moved with it [15:35:13] I will put in a ticket for a new one now [15:35:21] hm, ok [15:35:36] Is there some point where we can just stamp LEMON! on the case and send the whole damn thing back for replacement? [15:35:44] the parts they send are refurb...so it doesn't happen often but it does happen [15:35:44] Because I'm well past that point, emotionally [15:35:58] no, it doesn't work that way while it's under warranty [15:36:09] I will request a new system board as well just in case [15:36:35] ok, thanks [15:36:44] I'm sure you're just as tired of this as I am :) [15:38:57] no, not yet but close [15:45:29] RECOVERY - IPMI Sensor Status on analytics1036 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:46:43] herron ^ swapped that PSU...looks like it's up [15:47:38] nice! [15:48:09] RECOVERY - IPMI Sensor Status on db1054 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:51:05] marostegui ^ replaced the psu and updated f/w [15:51:19] herron: trying analytics1037 one more time [15:51:50] fingers crossed [15:53:20] (03CR) 10Dzahn: "+2 from elukey. link to dashboard as already pasted by Filippo https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?" [puppet] - 10https://gerrit.wikimedia.org/r/382909 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [15:54:04] (03CR) 10Elukey: Add thirdparty/confluent on stretch-based kafka brokers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/385196 (owner: 10Muehlenhoff) [15:54:16] (03CR) 10Ayounsi: [C: 031] "Aa pulling is done via SNMP, it's possible that part of the reply got dropped (or never sent), etc... Hard to tel without more debug outpu" [puppet] - 10https://gerrit.wikimedia.org/r/385308 (owner: 10BBlack) [15:54:56] (03CR) 10Dzahn: apache: remove ganglia monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/382909 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [15:58:59] (03PS1) 10Herron: puppet: depool codfw puppetmaster for upgrade [dns] - 10https://gerrit.wikimedia.org/r/385393 (https://phabricator.wikimedia.org/T177254) [15:59:11] (03CR) 10jerkins-bot: [V: 04-1] puppet: depool codfw puppetmaster for upgrade [dns] - 10https://gerrit.wikimedia.org/r/385393 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [15:59:39] (03PS1) 10Gehel: maps: move kartotherian and tilerator sources to puppet [puppet] - 10https://gerrit.wikimedia.org/r/385394 (https://phabricator.wikimedia.org/T160215) [15:59:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3699912 (10Cmjohnson) @andrew..wrote that in the wrong ticket [15:59:42] (03PS1) 10Gehel: maps: use puppet generated configs for sources. [puppet] - 10https://gerrit.wikimedia.org/r/385395 (https://phabricator.wikimedia.org/T160215) [16:00:14] (03CR) 10jerkins-bot: [V: 04-1] maps: move kartotherian and tilerator sources to puppet [puppet] - 10https://gerrit.wikimedia.org/r/385394 (https://phabricator.wikimedia.org/T160215) (owner: 10Gehel) [16:01:24] (03PS2) 10Dzahn: apache: remove ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/382909 (https://phabricator.wikimedia.org/T177225) [16:01:49] (03PS2) 10Gehel: maps: move kartotherian and tilerator sources to puppet [puppet] - 10https://gerrit.wikimedia.org/r/385394 (https://phabricator.wikimedia.org/T160215) [16:01:51] (03PS2) 10Gehel: maps: use puppet generated configs for sources. [puppet] - 10https://gerrit.wikimedia.org/r/385395 (https://phabricator.wikimedia.org/T160215) [16:05:20] !log removing "policy-options policy-statement LVS_import term secondary" on cr routers [16:05:25] !log krinkle@tin Synchronized php-1.31.0-wmf.4/extensions/WikimediaMaintenance/getJobQueueLengths.php: I8737795e6 (duration: 00m 47s) [16:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:17] 10Operations, 10ops-eqiad, 10DC-Ops: Multiple servers in eqiad D8 showing PSU failures - https://phabricator.wikimedia.org/T177227#3699920 (10Cmjohnson) replaced both psu's in analtyics1037, the psu in an1036 cleared the error and is functioning normally as far as I can tell. [16:14:13] cmjohnson1: nice, last time db1054 failed a bit after the replacement, let's hope this time it doesn't! [16:14:15] chasemp: ok with this? https://gerrit.wikimedia.org/r/#/c/384895/ it's exactly like the one you merged, just that it's the DISK check vs the PROC check. i should have done both right away [16:14:45] (03PS3) 10Gehel: maps: move kartotherian and tilerator sources to puppet [puppet] - 10https://gerrit.wikimedia.org/r/385394 (https://phabricator.wikimedia.org/T160215) [16:14:46] (03PS3) 10Gehel: maps: use puppet generated configs for sources. [puppet] - 10https://gerrit.wikimedia.org/r/385395 (https://phabricator.wikimedia.org/T160215) [16:16:18] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3699944 (10Dzahn) [16:16:20] 10Operations, 10ops-eqiad, 10hardware-requests, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3699942 (10Dzahn) 05Open>03stalled setting ticket to stalled so it doesn't get fully decom'ed/wiped yet. [16:21:38] (03PS1) 10Dzahn: osmium: tmp restore access for perf-roots admins [puppet] - 10https://gerrit.wikimedia.org/r/385399 (https://phabricator.wikimedia.org/T175093) [16:23:32] (03PS2) 10Dzahn: osmium: tmp restore access for perf-roots admins [puppet] - 10https://gerrit.wikimedia.org/r/385399 (https://phabricator.wikimedia.org/T175093) [16:24:11] (03CR) 10Dzahn: [C: 032] osmium: tmp restore access for perf-roots admins [puppet] - 10https://gerrit.wikimedia.org/r/385399 (https://phabricator.wikimedia.org/T175093) (owner: 10Dzahn) [16:25:57] gilles: Krinkle: you have access to osmium again now ^ [16:28:37] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3699977 (10Dzahn) @Gilles Your access has been (temp) restored. You should be able to login again (and @Krinkle as well). Note that i... [16:29:19] (03PS4) 10Gehel: maps: move kartotherian and tilerator sources to puppet [puppet] - 10https://gerrit.wikimedia.org/r/385394 (https://phabricator.wikimedia.org/T160215) [16:29:21] (03PS4) 10Gehel: maps: use puppet generated configs for sources. [puppet] - 10https://gerrit.wikimedia.org/r/385395 (https://phabricator.wikimedia.org/T160215) [16:32:50] (03PS1) 10BBlack: ulsfo revdns: cleanup commentary [dns] - 10https://gerrit.wikimedia.org/r/385401 [16:32:52] (03PS1) 10BBlack: eqsin revdns: strawman subnet plan [dns] - 10https://gerrit.wikimedia.org/r/385402 (https://phabricator.wikimedia.org/T156256) [16:33:36] (03CR) 10BBlack: [C: 032] ulsfo revdns: cleanup commentary [dns] - 10https://gerrit.wikimedia.org/r/385401 (owner: 10BBlack) [16:37:00] (03PS2) 10BBlack: eqsin revdns: strawman subnet plan [dns] - 10https://gerrit.wikimedia.org/r/385402 (https://phabricator.wikimedia.org/T156256) [16:37:51] !log aaron@tin Synchronized php-1.31.0-wmf.4/extensions/OATHAuth: c16a94e17b9981 (duration: 00m 47s) [16:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:28] !log aaron@tin Synchronized php-1.31.0-wmf.4/extensions/TwoColConflict: 5c3224980fe (duration: 00m 46s) [16:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:43] (03PS2) 10BBlack: check_bgp: report an unknown ASN as critical [puppet] - 10https://gerrit.wikimedia.org/r/385308 [16:39:47] heh [16:40:23] (03CR) 10BBlack: [C: 032] check_bgp: report an unknown ASN as critical [puppet] - 10https://gerrit.wikimedia.org/r/385308 (owner: 10BBlack) [16:42:02] greg-g: didn't even notice the merge wasn't for master, so may as well ;) [16:42:57] Krinkle: https://gerrit.wikimedia.org/r/385248 was interesting. A type hint saying "array" for a string by mistake ($x = 's') made phan completely bomb out [16:43:23] (e.g. string param with a default) [16:44:13] AaronSchulz: Ha, that's odd. No error report. [16:44:18] Although the cli output does show [16:44:18] https://integration.wikimedia.org/ci/job/mediawiki-core-php70-phan-docker/168/console [16:44:35] Not sure what made it not produce an XML file for Jenkins [16:49:45] 10Operations, 10ops-eqiad, 10DC-Ops: Multiple servers in eqiad D8 showing PSU failures - https://phabricator.wikimedia.org/T177227#3700037 (10herron) One down! [16:52:16] (03PS2) 10Herron: puppet: depool codfw puppetmaster for upgrade [dns] - 10https://gerrit.wikimedia.org/r/385393 (https://phabricator.wikimedia.org/T177254) [16:53:03] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3700038 (10herron) [16:55:43] 10Operations, 10Pybal, 10Traffic, 10netops, 10Patch-For-Review: Deploy pybal with BGP MED support (for primary/backup) in production - https://phabricator.wikimedia.org/T165584#3700045 (10ayounsi) 05Open>03Resolved a:03ema Done! [16:55:46] Krinkle: so it's not phan but are own script? [16:57:30] AaronSchulz: It is phan, with we have a wrapper entry point in mediawiki-core/tests/phan/bin/phan that sets up some parameters and configs, including the stdout format and 'issues/latest' subtree that should be created. [16:57:36] thanks AaronSchulz [16:57:38] But it seems it didn't find any files for this run. [16:57:45] Maybe because it's a warning and not an error [17:00:36] anything in autoloader.log in the last 30 minutes? [17:01:51] no, but the hits take a while to come historically [17:01:57] it will take some hours to tell [17:05:31] (03PS3) 10Dzahn: apache: remove ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/382909 (https://phabricator.wikimedia.org/T177225) [17:05:31] nonsense, prognosticating from small sample sizes is what we do in the internet economy [17:07:28] (03PS4) 10Dzahn: apache: remove ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/382909 (https://phabricator.wikimedia.org/T177225) [17:10:35] (03PS2) 10Dzahn: hhvm: remove ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/382915 (https://phabricator.wikimedia.org/T177225) [17:14:38] (03CR) 10Ayounsi: [C: 031] "Small typo, otherwise LGTM." (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/385402 (https://phabricator.wikimedia.org/T156256) (owner: 10BBlack) [17:24:33] (03CR) 10Dzahn: "grafana-admin also has other HHVM related metrics that could be added to dashboards but are not necessarily on the current dashboard" [puppet] - 10https://gerrit.wikimedia.org/r/382915 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [17:27:55] (03CR) 10Dzahn: [C: 032] apache: remove ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/382909 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [17:28:51] (03PS3) 10Dzahn: hhvm: remove ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/382915 (https://phabricator.wikimedia.org/T177225) [17:29:44] (03CR) 10Dzahn: [C: 032] hhvm: remove ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/382915 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [17:33:15] PROBLEM - puppet last run on mw1314 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/usr/lib/ganglia/python_modules/hhvm_mem.py],File[/usr/lib/ganglia/python_modules/hhvm_health.py],File[/usr/lib/ganglia/python_modules/apache_status.py] [17:33:49] should be transient, already confirmed on other mw servers [17:34:25] PROBLEM - puppet last run on mw1180 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/usr/lib/ganglia/python_modules/hhvm_mem.py],File[/etc/ganglia/conf.d/hhvm_health.pyconf],File[/usr/lib/ganglia/python_modules/apache_status.py] [17:34:26] double checks and yes, it's fine after running it again [17:35:51] is ready to temp stop the bot if there are more, but it's fine [17:36:34] PROBLEM - puppet last run on mw2185 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/usr/lib/ganglia/python_modules/hhvm_mem.py],File[/usr/lib/ganglia/python_modules/hhvm_health.py],File[/usr/lib/ganglia/python_modules/apache_status.py] [17:38:15] RECOVERY - puppet last run on mw1314 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:38:57] !log removing Apache and HHVM Ganglia stats from all appservers, part of retiring Ganglia (T177225) - some transient puppet issues while purging package and remnants - use grafana dashboard at https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats [17:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:05] T177225: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225 [17:39:25] RECOVERY - puppet last run on mw1180 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:40:47] (03CR) 10BBlack: eqsin revdns: strawman subnet plan (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/385402 (https://phabricator.wikimedia.org/T156256) (owner: 10BBlack) [17:41:34] RECOVERY - puppet last run on mw2185 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:42:29] 10Operations, 10PAWS, 10Pywikibot-Commons, 10Traffic: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3700199 (10Chicocvenancio) [17:42:38] (03PS3) 10BBlack: eqsin revdns: strawman subnet plan [dns] - 10https://gerrit.wikimedia.org/r/385402 (https://phabricator.wikimedia.org/T156256) [17:46:25] 10Operations, 10PAWS, 10Pywikibot-Commons, 10Traffic: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3696395 (10BBlack) Most likely the error is just inconsistent over the time domain at the backend (swift -> (MW || Thumbor)). If Varnis... [17:51:09] 10Operations, 10PAWS, 10Pywikibot-Commons, 10Traffic: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3700231 (10zhuyifei1999) I don't think thumbor should be involved here. The script fetch original versions of the files, not the thumbna... [17:54:16] (03PS1) 10Dzahn: ganglia::decom: also purge package libganglia1 [puppet] - 10https://gerrit.wikimedia.org/r/385412 (https://phabricator.wikimedia.org/T177225) [17:56:50] (03PS1) 10BBlack: htcppurger: move puppetization to base profile + hieradata [puppet] - 10https://gerrit.wikimedia.org/r/385413 [17:56:52] (03PS1) 10BBlack: htcppurger: vhtcpd-0.1.x specifics [puppet] - 10https://gerrit.wikimedia.org/r/385414 [17:56:54] (03PS1) 10BBlack: htcppurger: per-dc/cluster delay data [puppet] - 10https://gerrit.wikimedia.org/r/385415 [17:57:15] (03CR) 10jerkins-bot: [V: 04-1] htcppurger: move puppetization to base profile + hieradata [puppet] - 10https://gerrit.wikimedia.org/r/385413 (owner: 10BBlack) [17:58:58] (03PS2) 10BBlack: htcppurger: move puppetization to base profile + hieradata [puppet] - 10https://gerrit.wikimedia.org/r/385413 [17:59:00] (03PS2) 10BBlack: htcppurger: vhtcpd-0.1.x specifics [puppet] - 10https://gerrit.wikimedia.org/r/385414 [17:59:02] (03PS2) 10BBlack: htcppurger: per-dc/cluster delay data [puppet] - 10https://gerrit.wikimedia.org/r/385415 [18:06:20] (03PS2) 10Dzahn: mysql/icinga/labtest: no pages if on labtest, pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/384895 (https://phabricator.wikimedia.org/T178008) [18:07:42] (03CR) 10Dzahn: [C: 032] mysql/icinga/labtest: no pages if on labtest, pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/384895 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [18:09:01] (03PS2) 10Dzahn: ganglia::decom: also purge package libganglia1 [puppet] - 10https://gerrit.wikimedia.org/r/385412 (https://phabricator.wikimedia.org/T177225) [18:11:12] (03CR) 10Dzahn: [C: 032] ganglia::decom: also purge package libganglia1 [puppet] - 10https://gerrit.wikimedia.org/r/385412 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [18:11:17] (03CR) 10BBlack: [C: 031] "PCC says functional no-op, just notes the +classparam diffs" [puppet] - 10https://gerrit.wikimedia.org/r/385413 (owner: 10BBlack) [18:11:32] (03CR) 10BBlack: [C: 031] "PCC just shows the expected drop of "-l 1024" everywhere" [puppet] - 10https://gerrit.wikimedia.org/r/385414 (owner: 10BBlack) [18:13:54] (03CR) 10Dzahn: [C: 031] "wmf-style: total violations delta 0 :)" [puppet] - 10https://gerrit.wikimedia.org/r/385413 (owner: 10BBlack) [18:13:56] (03CR) 10BBlack: [C: 031] "PCC shows expected argument changes for +delay: https://puppet-compiler.wmflabs.org/compiler02/8402/" [puppet] - 10https://gerrit.wikimedia.org/r/385415 (owner: 10BBlack) [18:15:48] (03CR) 10Dzahn: [C: 04-1] "wait until pgsql part has been migrated ( or?)" [puppet] - 10https://gerrit.wikimedia.org/r/382905 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [18:16:50] (03CR) 10Dzahn: [C: 04-1] "wait until pgsql metrics have been migrated?" [puppet] - 10https://gerrit.wikimedia.org/r/382906 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [18:16:59] (03PS3) 10BBlack: htcppurger: move puppetization to base profile + hieradata [puppet] - 10https://gerrit.wikimedia.org/r/385413 [18:17:29] (03CR) 10Dzahn: "Herron, do you (still? ever?) use Ganglia to look at any exim things?" [puppet] - 10https://gerrit.wikimedia.org/r/382916 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [18:17:32] (03CR) 10BBlack: [C: 032] htcppurger: move puppetization to base profile + hieradata [puppet] - 10https://gerrit.wikimedia.org/r/385413 (owner: 10BBlack) [18:19:33] (03PS4) 10Dzahn: Support --bare in git::clone() [puppet] - 10https://gerrit.wikimedia.org/r/383842 (https://phabricator.wikimedia.org/T178076) (owner: 10Hashar) [18:26:10] (03CR) 10Herron: "Nope, no objection here!" [puppet] - 10https://gerrit.wikimedia.org/r/382916 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [18:28:34] !log upgrading vhtcpd to 0.1.1-1 on the cp* fleet [18:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:30] (03PS3) 10BBlack: htcppurger: vhtcpd-0.1.x specifics [puppet] - 10https://gerrit.wikimedia.org/r/385414 [18:38:32] (03PS3) 10BBlack: htcppurger: per-dc/cluster delay data [puppet] - 10https://gerrit.wikimedia.org/r/385415 (https://phabricator.wikimedia.org/T133821) [18:39:56] (03CR) 10BBlack: [C: 032] htcppurger: vhtcpd-0.1.x specifics [puppet] - 10https://gerrit.wikimedia.org/r/385414 (owner: 10BBlack) [18:40:03] (03CR) 10BBlack: [C: 032] htcppurger: per-dc/cluster delay data [puppet] - 10https://gerrit.wikimedia.org/r/385415 (https://phabricator.wikimedia.org/T133821) (owner: 10BBlack) [18:41:24] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active [18:52:25] !log vhtcpd upgrade + queue delay puppetization deploy ( https://gerrit.wikimedia.org/r/385415 ) done on cp* fleet - T133821 [18:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:35] T133821: Content purges are unreliable - https://phabricator.wikimedia.org/T133821 [19:01:00] mutante: hello support --bare for git::clone can be merged ( https://gerrit.wikimedia.org/r/383842 ) :D [19:01:57] hasharAway: well, i wasnt so sure before compiling it on bromine [19:02:03] but i did now [19:02:52] mutante: should be a noop unless one pass $bare [19:04:08] (03PS1) 10BBlack: Set q_mem + q_max_mem stats correctly at startup [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/385428 [19:04:23] (03CR) 10BBlack: [V: 032 C: 032] Set q_mem + q_max_mem stats correctly at startup [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/385428 (owner: 10BBlack) [19:06:46] hasharAway: except the new parameter gets added, almost [19:07:03] (03CR) 10Dzahn: [C: 032] Support --bare in git::clone() [puppet] - 10https://gerrit.wikimedia.org/r/383842 (https://phabricator.wikimedia.org/T178076) (owner: 10Hashar) [19:07:06] herron: Is it safe to assume that a 4.8 puppet client won't be able to talk to a 3.x master? [19:07:18] But that the reverse (an old client talking to a new master) is fine? [19:07:20] mutante: danke :) [19:07:40] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/8403/" [puppet] - 10https://gerrit.wikimedia.org/r/383842 (https://phabricator.wikimedia.org/T178076) (owner: 10Hashar) [19:07:51] (03PS5) 10Dzahn: Support --bare in git::clone() [puppet] - 10https://gerrit.wikimedia.org/r/383842 (https://phabricator.wikimedia.org/T178076) (owner: 10Hashar) [19:08:32] yeah, generally the master version should be greater [19:08:37] or equal [19:09:31] herron: ok, that makes me think that I definitely can't upgrade the wmcs puppetmasters until after prod is upgraded (since labpuppetmaster1001 et. al. are clients of the prod puppetmaster) [19:09:46] unless it's possible to have different versions for client/master on the same host? Which… seems unlikely [19:10:39] hasharAway: no problem, and no-op confirmed, enjoy the weekend [19:10:54] makes sense. fwiw I'm planning to depool and upgrade the codfw master next week [19:11:03] maybe there's an opportunity to do some testing there? [19:11:58] mutante: I have another one for jenkins if you dont mind. It adds a few start up parameters to jenkins : https://gerrit.wikimedia.org/r/#/c/385230/ :) [19:12:03] mutante: I will baby sit it on the hosts [19:12:36] mutante: the context is to stop Jenkins from listening to multicast (we dont need the feature) [19:12:51] hasharAway: ah the multicast thing:) and we dont use it, saw the parent ticket. ok [19:13:00] untested [19:13:02] herron: yeah, once you have a puppetmaster working I can point some of my test boxes at it. [19:13:08] but I am trusting the jenkins doc :] [19:13:17] let me link that, actually it isnt [19:13:20] sounds good I'll ping you next week [19:13:38] nvm, it is [19:13:46] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3700540 (10Andrew) [19:14:13] (03PS2) 10Dzahn: jenkins: disable auto-discovery [puppet] - 10https://gerrit.wikimedia.org/r/385230 (https://phabricator.wikimedia.org/T178608) (owner: 10Hashar) [19:15:10] (03CR) 10Dzahn: [C: 032] jenkins: disable auto-discovery [puppet] - 10https://gerrit.wikimedia.org/r/385230 (https://phabricator.wikimedia.org/T178608) (owner: 10Hashar) [19:15:45] hasharAway: let the baby-sitting begin [19:15:49] bblack: is there a way to force upload.wm.o varnish to miss? [19:16:07] x-wikimedia-debug does not seem to do [19:16:19] yeah X-Wikimedia-Debug is specific to text traffic->MW [19:16:49] but, easier than forcing a miss is just to query swift internally for the same URL [19:17:33] * zhuyifei1999_ don't have access [19:17:38] right [19:17:48] but we don't really have a "force a miss" option on upload, intentionally heh [19:17:56] sigh [19:18:06] is there even a test URL we can try? I don't see one in the ticket [19:18:43] (you could also purge the URL, but that would only get you one miss, if the next req is a success) [19:19:18] !log restarting Jenkins - T178608 [19:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:26] T178608: Disable Jenkins autodiscovery system - https://phabricator.wikimedia.org/T178608 [19:20:01] bblack: http://paws-public.wmflabs.org/paws-public/User:zhuyifei1999/urls [19:20:46] usually on paws at least one of them returns 500 with pywikibot, but I haven't observed a single case with curl [19:21:17] (03PS2) 10Zoranzoki21: Working Class Movement Library (Salford) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385377 (https://phabricator.wikimedia.org/T178689) (owner: 10Odder) [19:21:23] (03CR) 10Zoranzoki21: [C: 031] Working Class Movement Library (Salford) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385377 (https://phabricator.wikimedia.org/T178689) (owner: 10Odder) [19:21:59] (03PS3) 10Zoranzoki21: Working Class Movement Library (Salford) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385377 (https://phabricator.wikimedia.org/T178689) (owner: 10Odder) [19:22:34] (03CR) 10Zoranzoki21: [C: 031] Working Class Movement Library (Salford) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385377 (https://phabricator.wikimedia.org/T178689) (owner: 10Odder) [19:23:01] zhuyifei1999_: I just ran through them all direct-to-swift internally, and they all returned 200 OK [19:23:09] huh [19:23:13] I can iterate some more and see if I can get a failure [19:23:50] (03CR) 10Zoranzoki21: [C: 031] "Now is all ok.. I removed one expired rule to file be clean" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385377 (https://phabricator.wikimedia.org/T178689) (owner: 10Odder) [19:24:32] I wonder if this could be somhow related to %-encoding [19:24:52] I doubt that would cause a 500 though, more likely to cause a 404 [19:25:02] 10Operations, 10netops, 10Patch-For-Review: Find a new PIM RP IP - https://phabricator.wikimedia.org/T167842#3700551 (10hashar) [19:25:04] 10Operations, 10Continuous-Integration-Infrastructure, 10netops, 10Jenkins, and 2 others: Disable Jenkins autodiscovery system - https://phabricator.wikimedia.org/T178608#3700549 (10hashar) 05Open>03Resolved `netstat -gn` seems to indicate there is no more any multicast. [19:26:19] zhuyifei1999_: can you repro your 500 response easily? (and what's the X-Cache line on the current repro you can do?) I could watch manually to observe the 500 [19:26:31] yeah [19:26:49] there's like 1 in 2 chance each time I iterate through the list [19:26:53] (03CR) 10Dzahn: [C: 032] "no-op on contint1001 where this is hosted http://puppet-compiler.wmflabs.org/8404/" [puppet] - 10https://gerrit.wikimedia.org/r/382140 (owner: 10Hashar) [19:27:02] 10Operations, 10netops, 10Patch-For-Review: Find a new PIM RP IP - https://phabricator.wikimedia.org/T167842#3346542 (10hashar) >>! In T167842#3352923, @BBlack wrote: > ... these were the surprises: > ``` > ===== NODE GROUP =====... [19:27:14] (03PS4) 10Dzahn: contint::website to a profile [puppet] - 10https://gerrit.wikimedia.org/r/382140 (owner: 10Hashar) [19:27:14] mutante: the jenkins patch worked just fine. !!! [19:27:18] so yeah, just let me know what the X-Cache line is (on any success/fail) from your attempts, so I know which frontend to monitor [19:27:34] hasharAway: great! meanwhile i merged another one [19:27:36] and then we can repro again while I watch verbose logging of the 500 [19:28:15] last one was cp1074 miss, cp1074 pass on https://upload.wikimedia.org/wikipedia/commons/a/ad/Constitui%C3%A7%C3%A3o_da_Rep%C3%BAblica_dos_Estados_Unidos_do_Brasil_de_1937_p._02.jpg [19:28:22] ok [19:28:52] the one before that was cp1049 miss, cp1074 pass https://upload.wikimedia.org/wikipedia/commons/2/2a/Constitui%C3%A7%C3%A3o_da_Rep%C3%BAblica_dos_Estados_Unidos_do_Brasil_de_1937_p._16.jpg [19:28:56] did you just trigger one? [19:29:02] not yet [19:29:03] mutante: wmf-style: total violations delta -2 !! [19:29:04] a sec [19:29:05] I just saw a live one on cp1074 frontend, for /wikipedia/commons/thumb/1/10/Map_of_Alabama_highlighting_St_Clair_County.svg/63px-Map_of_Alabama_highlighting_St_Clair_County.svg.png [19:29:13] - RespStatus 500 [19:29:13] - RespReason Internal Server Error [19:29:13] - RespHeader engine: wikimedia_thumbor.engine.svg [19:29:17] hasharAway: lol, i had that exact string in clipboard ::)) [19:29:39] zhuyifei1999_: got the req [19:29:40] - ReqURL /wikipedia/commons/1/16/Constitui%C3%A7%C3%A3o_da_Rep%C3%BAblica_dos_Estados_Unidos_do_Brasil_de_1937_p._34.jpg [19:29:47] yeah [19:30:00] - RespStatus 500 [19:30:00] - RespReason Internal Error [19:30:12] - RespHeader X-Trans-Id: tx5f7ada82081944279b1bb-0059ea4e96 [19:30:19] cp1064 miss, cp1074 pass [19:30:20] mutante: with Giuseppe we now have a Jenkins job that report the wmfstyle_guide errors https://integration.wikimedia.org/ci/view/operations/job/operations-puppet-wmf-style-guide/133/warnings47Result/ :D [19:30:22] I think swift injects that X-Trans-Id [19:30:43] mutante: runs per commit more or less. I will craft a mail to the ops list next week if I am happy with the result [19:31:26] let me dig a little... [19:31:39] thanks [19:31:51] hasharAway: getting slowed down by bad wifi [19:32:19] hasharAway: yes, i know that:0 i have already used it quite a bit and removed a bunch of violations [19:32:27] mutante: basically the job runs rake global:wmf_styleguide , then process the puppet-lint output and build some dashboard per file / modules :) [19:32:47] very nice [19:33:03] and networking people will like that the multicast thing is gone [19:33:25] (03CR) 10Dzahn: "wmf-style: total violations delta -2" [puppet] - 10https://gerrit.wikimedia.org/r/382140 (owner: 10Hashar) [19:33:36] finally i get to run puppet [19:33:39] :] [19:33:50] total no-op :) [19:34:13] hasharAway: we need the highscore list :) [19:34:24] he said there is a price, hehe [19:34:58] hmm [19:35:13] run the delta task on each patch one by one [19:35:19] and record the commiter / delta? :] [19:35:32] HIGH SCORE YOU FIXED 6 WARNINGS PUPPET [19:35:43] COMBO x 8 !!!!! [19:36:00] yes:) gamification ftw [19:36:54] it needs the total at the end of the year [19:37:04] then it is hard to assert the real score [19:37:15] a single fix can prove to be very hard / lot of work [19:37:40] anyway bathroom time! [19:37:59] it's ok, we just count total number of removed violations and then there is an incentive to start early and get the low-hanging fruit :p [19:38:38] at the end it gets harder like mining on a blockchain.. j/k [19:40:20] I will try my best to polish up the contint/beta etc related stuff [19:40:20] gets lunch [19:40:24] happy meal! [19:40:40] hasharAway: :) ok, cool,thx [19:41:17] the apache module is a problem, btw [19:41:41] in some cases i can remove a whole bunch of violations but also add new ones, so the delta is small or none [19:42:15] for all the things using apache module, like misc micro sites [19:42:28] it can only be fixed by converting the whole module to defines, afaict [19:42:37] ok, be back in a while [19:43:15] PROBLEM - MegaRAID on analytics1029 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [19:56:08] so.. how are the api servers these days in terms of load ? [20:00:02] as in.. i'm considering switching navigation popups to start making use the api instead of action=raw requests, and suddenly realize that with a gadget with that many users, it might actually put quite some extra load on the api servers... [20:00:07] (03CR) 10MarcoAurelio: [C: 04-1] "Per Task. No EDP/IUP apparently." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385359 (https://phabricator.wikimedia.org/T178660) (owner: 10Jayprakash12345) [20:02:26] thedj: it is unlikely you will get a response at this hour. It is probably just a slice of the total api traffic :) [20:03:15] RECOVERY - MegaRAID on analytics1029 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [20:03:59] thedj: https://grafana.wikimedia.org/dashboard/db/api-requests offers some insights [20:04:55] 10Operations, 10PAWS, 10Pywikibot-Commons, 10Traffic: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3700598 (10BBlack) So, I did some varnishlog tracing on the frontend @zhuyifei1999 was hitting with a reproduction of this. I caught on... [20:05:16] zhuyifei1999_: https://phabricator.wikimedia.org/T178567#3700598 [20:06:09] 10Operations, 10PAWS, 10Pywikibot-Commons, 10Traffic, 10media-storage: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3700601 (10BBlack) [20:09:10] hasharAway: thx. will keep an eye out. [20:17:40] (03PS1) 10BBlack: cache_upload: wipe client Authorization headers on ingress [puppet] - 10https://gerrit.wikimedia.org/r/385439 (https://phabricator.wikimedia.org/T178567) [20:24:06] Api-User-Agent: Navigation popups/1.0 (en.wikipedia.org) [20:24:29] there, about time it started sending that header. [20:34:04] AaronSchulz: still clean? [20:35:14] * thedj hugs ori [20:36:13] bblack: sorry was away, will look [20:36:24] back at you, thedj! :) [20:39:13] ori: yep [20:39:45] * AaronSchulz was temped to make a joke about showering [20:40:11] hehe [20:40:45] well, thedj, api app server monitoring is ordinarily here: https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1 [20:40:51] but it is freshly broken [20:41:05] still, you can look at the history [20:42:18] seems like a couple thousands of extra requests won't kill it... [20:42:39] now to check with analytics [20:43:42] (03PS1) 10Hashar: beta: migrate autoupdater to a profile [puppet] - 10https://gerrit.wikimedia.org/r/385461 [20:47:47] hi ori [20:48:05] hi Nemo_bis :) [20:52:19] thedj: just do a gradual rollout. var ROLLOUT_PERCENT = 10; if ( mw.config.get( 'wgUserId' ) % 100 < ROLLOUT_PERCENT ) useApi() [20:52:43] and increment periodically if everything looks ok [20:53:00] ori: it's an idea... [20:56:31] ori: i'm slowly deciphering navigation popups into something that is readable [20:57:28] first proper maintenance this gadget is getting in 7 years time. if not 10... [20:57:58] (03PS6) 10Paladox: Gerrit: Switch to the mariadb connector [puppet] - 10https://gerrit.wikimedia.org/r/384588 (https://phabricator.wikimedia.org/T176164) [20:59:04] (03PS11) 10Paladox: Gerrit: Remove ldap user and password from secure.config [puppet] - 10https://gerrit.wikimedia.org/r/366910 [20:59:07] (03PS16) 10Paladox: Gerrit: Add wmf branding to PolyGerrit [puppet] - 10https://gerrit.wikimedia.org/r/368547 [21:00:56] (03PS4) 10Paladox: Gerrit: Replace certificates with tokens for its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/384901 (https://phabricator.wikimedia.org/T178385) [21:02:40] (03PS1) 10Hashar: Migrate contint::firewall to a profile [puppet] - 10https://gerrit.wikimedia.org/r/385472 [21:05:09] (03CR) 10Paladox: Gerrit: Set gitBasicAuthPolicy to HTTP [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [21:05:16] (03PS14) 10Paladox: Gerrit: Set gitBasicAuthPolicy to HTTP [puppet] - 10https://gerrit.wikimedia.org/r/350484 [21:05:57] thedj: navigation popups, isn't it a MediaWiki extension nowadays? [21:06:14] oh https://en.wikipedia.org/wiki/Wikipedia:Tools/Navigation_popups [21:06:15] weird [21:06:49] that's page previews aka popups aks hovercards... (cause we suck at naming) [21:07:15] (03CR) 10Hashar: "Looks like it might well be a noop :]" [puppet] - 10https://gerrit.wikimedia.org/r/385472 (owner: 10Hashar) [21:20:55] (03PS1) 10Hashar: Add profile::labs::lvm::srv [puppet] - 10https://gerrit.wikimedia.org/r/385475 [21:23:56] (03PS1) 10Hashar: contint: use profile::labs::lvm::srv instead of role [puppet] - 10https://gerrit.wikimedia.org/r/385476 [21:24:48] (03PS2) 10Hashar: Add profile::labs::lvm::srv [puppet] - 10https://gerrit.wikimedia.org/r/385475 [21:25:03] (03PS2) 10Hashar: contint: use profile::labs::lvm::srv instead of role [puppet] - 10https://gerrit.wikimedia.org/r/385476 [21:27:38] (03PS1) 10Hashar: extdist: use profile::labs::lvm::srv instead of role [puppet] - 10https://gerrit.wikimedia.org/r/385477 [21:30:54] (03CR) 10Hashar: "I think that one is solely on labs." [puppet] - 10https://gerrit.wikimedia.org/r/385477 (owner: 10Hashar) [21:31:39] (03PS1) 10Hashar: graphite: use profile::labs::lvm::srv instead of role [puppet] - 10https://gerrit.wikimedia.org/r/385478 [21:33:37] (03CR) 10Hashar: "Included from role::labs::graphite but it seems unused according to https://tools.wmflabs.org/openstack-browser/puppetclass/role::labs::gr" [puppet] - 10https://gerrit.wikimedia.org/r/385478 (owner: 10Hashar) [21:36:42] (03PS1) 10Hashar: quarry: use profile::labs::lvm::srv instead of role [puppet] - 10https://gerrit.wikimedia.org/r/385479 [21:38:57] (03CR) 10Hashar: "That would impact the quary labs project and the instances:" [puppet] - 10https://gerrit.wikimedia.org/r/385479 (owner: 10Hashar) [21:39:53] (03CR) 10Legoktm: "Yes, this is just used on labs :)" [puppet] - 10https://gerrit.wikimedia.org/r/385477 (owner: 10Hashar) [21:40:48] (03PS1) 10Hashar: prometheus: use profile::labs::lvm::srv instead of role [puppet] - 10https://gerrit.wikimedia.org/r/385480 [21:42:40] (03CR) 10Smalyshev: [C: 031] wdqs: LVS check should reach blazegraph and do a simple query [puppet] - 10https://gerrit.wikimedia.org/r/384938 (owner: 10Gehel) [21:42:52] (03CR) 10Andrew Bogott: [C: 032] Add profile::labs::lvm::srv [puppet] - 10https://gerrit.wikimedia.org/r/385475 (owner: 10Hashar) [21:43:58] (03CR) 10Hashar: "For some role -> profile refactoring, I need it to be a profile :]" [puppet] - 10https://gerrit.wikimedia.org/r/385475 (owner: 10Hashar) [21:44:14] andrewbogott: awesome :] [21:44:57] andrewbogott: you merged it before I had a chance to justify the role wrapper ( it is everywhere https://tools.wmflabs.org/openstack-browser/puppetclass/role::labs::lvm::srv ) [21:45:21] yeah, we need to keep the role, probably indefinitely [21:45:35] that or search/replace throughout the puppet catalog for all instances :( [21:47:36] hasharAway: I don't understand about "require would ensure all resources it contains (eg Mount['/srv']) are released before the rest" [21:47:49] andrewbogott: I am not sure really :( [21:47:53] but in theory if you do: [21:47:57] class foo { [21:47:59] require bar [21:48:06] include stuff [21:48:07] } [21:48:17] then puppet will apply bar before [21:48:21] then process the class [21:48:34] ah, I see [21:48:37] so anything inside bar get realized before the content of foo [21:48:41] but I am not so sure about that [21:48:47] so you think the require Mount['/srv'] is redundant? [21:48:53] probably better to get a review from one of the experts [21:48:56] yeah [21:49:11] I think. But I am not sure [21:50:14] I'm pretty sure that anytime you think that things in order in a puppet class will get executed in order you're going to be disappointed [21:50:51] Hm, nope, you're right [21:50:54] https://docs.puppet.com/puppet/2.7/lang_relationships.html#the-require-function [21:52:18] (03CR) 10Andrew Bogott: [C: 032] quarry: use profile::labs::lvm::srv instead of role [puppet] - 10https://gerrit.wikimedia.org/r/385479 (owner: 10Hashar) [21:53:15] PROBLEM - MegaRAID on analytics1029 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [21:56:32] andrewbogott: the order is semi random indeed. Apparently it is based on the md5sum of the resources titles/names [21:56:35] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [21:56:42] yeah [21:56:57] but requires (in this context) means it's traversed before the class [21:56:58] so we're good [21:57:24] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 579 bytes in 0.627 second response time [21:57:37] yeah hopefully :] [21:57:45] to double check one can provision a fresh instance with the class [21:57:49] but I feel lazy :( [21:58:22] on the good news /mnt is barely used anymore. role::labs::lvm::mnt is only on three instances :) [22:02:46] andrewbogott: thank you!! It is bed time for me :] [22:08:05] * andrewbogott waves [22:12:43] (03PS1) 10Dzahn: Revert "hhvm: remove ganglia monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/385482 [22:13:12] (03CR) 10jerkins-bot: [V: 04-1] Revert "hhvm: remove ganglia monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/385482 (owner: 10Dzahn) [22:13:42] (03PS2) 10Dzahn: Revert "hhvm: remove ganglia monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/385482 [22:14:10] (03CR) 10jerkins-bot: [V: 04-1] Revert "hhvm: remove ganglia monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/385482 (owner: 10Dzahn) [22:15:40] (03CR) 10Dzahn: [V: 032 C: 032] Revert "hhvm: remove ganglia monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/385482 (owner: 10Dzahn) [22:16:34] has to do it like that unfortunately [22:17:38] (03PS1) 10Dzahn: Revert "apache: remove ganglia monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/385484 [22:18:00] (03CR) 10jerkins-bot: [V: 04-1] Revert "apache: remove ganglia monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/385484 (owner: 10Dzahn) [22:19:00] (03CR) 10Dzahn: [V: 032 C: 032] "sorry jenkins, i have to revert and re-add the historically existing style violation" [puppet] - 10https://gerrit.wikimedia.org/r/385484 (owner: 10Dzahn) [22:19:15] (03PS2) 10Dzahn: Revert "apache: remove ganglia monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/385484 [22:20:10] (03CR) 10jerkins-bot: [V: 04-1] Revert "apache: remove ganglia monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/385484 (owner: 10Dzahn) [22:21:19] (03CR) 10Dzahn: [V: 032 C: 032] Revert "apache: remove ganglia monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/385484 (owner: 10Dzahn) [22:23:44] sigh, it seemed such a nice achievement but the grafana stats disappeared after this, i didnt expect them to depend on ganglia anymore at all [22:30:59] jouncebot seemed to have quit ^^ [22:31:50] jouncebot: resurrect [22:33:02] @tools-bastion-03:~$ become jouncebot [22:33:02] You are not a member of the group tools.jouncebot. [22:33:06] tried.. [22:33:12] I am. Hang on. [22:33:15] RECOVERY - MegaRAID on analytics1029 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [22:33:16] cool, thanks [22:34:42] thanks [22:34:53] jouncebot: refresh [22:34:57] I refreshed my knowledge about deployments. [22:37:24] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [22:37:31] I need to talk to someone about a reported security breach. I don't have enough technnical knowledge to deem it credibe or not, but someone has taken the time to write to us at OTRS about it [22:37:49] (perhaps wrong channel...) [22:38:29] Josve05a: please forward/email security@wikimedia.org [22:38:41] Thanks! [22:39:14] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 579 bytes in 7.711 second response time [22:39:19] greg-g: You would't know of a handy list of these emails? Such as to answers, legal, secuity etc.? [22:39:27] thinks.... resurrect-o-bot that monitors jouncebot and restarts it when needed. then let jouncebot monitor and restart resurrect-o-bot... [22:40:11] AI maintaining AI. What could go wrong? :P [22:40:21] Josve05a: legal@ and security@ both work [22:40:49] i mean, they are aliases, depends on the purpose. feel free to PM me to if you want [22:40:56] i can lookup other mail addresses [22:41:13] Niharika: hehe:) [22:41:13] mutante: I know they both work, but it would be nice to have a list of all possible departments of WMF stored on a help pagge on OTRS-wiki or so [22:41:42] Josve05a: I read "A Monster Calls" and it was amazing! Thank you. :) [22:42:17] Josve05a: https://wikimediafoundation.org/wiki/Contact_us [22:42:45] it should probably have more, yea.. hmm [22:43:00] there is a section for email addresses though we could extend [22:43:24] but there is legal, donate, info ..etc [22:44:04] and https://wikimediafoundation.org/wiki/Staff_and_contractors [22:44:10] I'll try and copy these and create a table-cheat sheet n OTRS wiki later :) [22:44:44] Niharika: Given that I've never heard of that book, I'm not sure why you'r thanking me - but good that you liked it! :) [22:45:15] Josve05a: It was at the book-sharing at Wikimania and it has your name on the front page. :P [22:46:13] oooh [22:46:26] right,I did buy and read that book at the airpor [22:46:30] airport* [22:48:08] Niharika: bring it to the next Wikimania, and let someone else read it :p [22:48:26] Of course. :) [22:51:57] greg-g: I emailed that adress, but now got an auto-reply from Katie Horn saying that she is away, and who to contact regarding fundraising... [22:52:02] Not sure what to do [22:52:58] Josve05a: it's a mail alias, it goes to a lot of people, some of those people are sometimes on vacation [22:53:01] Josve05a: the email has arrived at people's inboxes [22:53:20] ok, thanks [22:53:33] I recieved it [22:54:22] most likely nothing, but it is not something I could handle myself [22:54:39] thanks for asking [22:55:50] ^ hmm, that didn't last long. [22:56:42] Niharika: jouncebot left again already, i wonder if toollabs has more general issues [22:56:58] there was also this short alert and recovery about the toollabs webpage [22:57:56] Hmm, possible. Log has this- "2017-10-20T22:33:16Z ib3.mixins WARNING : type: error, source: None, target: Closing Link: wikimedia/bot/jouncebot (Ping timeout: 258 seconds), arguments: [], tags: []" [22:58:08] Not very informative. [22:59:25] jouncebot: Stay with us, buddy. [22:59:50] :) thx [23:00:44] and yea, not informative, i agree, could as well be "2017-10-20T22:33:16Z - WARN btw, i left" [23:01:20] well, ping timeout . hrmm [23:02:16] mutante: Want me to add you as a bot account member? [23:02:39] I'm totally always here but just in case I'm not and jouncebot misbehaves. [23:02:47] Niharika: ok, yes [23:03:53] mutante: Your LDAP username is the same, right? [23:04:42] Niharika: oh, actually no, i'm Dzahn [23:04:57] Done. [23:05:12] confirmed working. thx [23:06:20] "./jouncebot.sh restart" and "./jouncebot.sh tail" are pretty much all you want. [23:09:14] ok, cool, i found https://wikitech.wikimedia.org/wiki/Tool:Jouncebot#Running_the_bot [23:12:35] (03PS5) 10Dzahn: gerrit: dont let sshd listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/354074 [23:14:44] (03CR) 10Paladox: "This will fail on labs due to it using ipv6 here and this will break ssh." [puppet] - 10https://gerrit.wikimedia.org/r/354074 (owner: 10Dzahn) [23:14:48] (03CR) 10Paladox: [C: 04-1] gerrit: dont let sshd listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/354074 (owner: 10Dzahn) [23:15:07] (03CR) 10Paladox: [C: 04-1] gerrit: dont let sshd listen on all interfaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354074 (owner: 10Dzahn) [23:15:37] paladox: one day .. labs needs to have IPv6.. sigh.. always that part makes it an issue [23:15:46] but thx for reminding me, yea [23:15:54] mutante also it needs to be accessed by the domain. [23:16:04] using an ip will break ssh for the domain. [23:16:29] as it's using gerrit's internal ssh and not the servers ssh. [23:16:48] paladox: i think there is a misunderstand there, i'm not changing that [23:17:09] oh [23:17:11] listenAddress = <%= @ipv4 %>:29418 <%= @ipv6 %>:29418 [23:17:12] before: ssh is told to listen on any IP [23:17:23] after: it should just use the 2 correct ones [23:17:25] total it has 4 [23:17:30] yeh, but * can also get the domain if it's listening on the port. [23:19:09] paladox: how does that line currently look for you in labs config? [23:19:35] did we try this before? [23:20:00] hmm, haven't try it. [23:23:24] PROBLEM - MegaRAID on analytics1029 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [23:23:50] paladox: what do you really mean by "the domain" here [23:23:59] we have the server IP and the service IP [23:24:00] gerrit.wikimedia.org. [23:25:09] so $ipv4 [23:25:09] role/eqiad/gerrit/server.yaml:gerrit::service::ipv4: '208.80.154.85' [23:25:17] gerrit.wikimedia.org has address 208.80.154.85 [23:25:29] (03CR) 10Chad: "Yeah, you're right about ipv4/6. Ideally labs would support v6 as well, but barring that we should be a little smarter in this template an" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354074 (owner: 10Dzahn) [23:28:04] paladox: looks like we need "if $realm" check but we dont like these [23:28:08] no_justification mutante, testing the change broke it. [23:28:18] it changes the domain to an ip in the download command box [23:28:22] and fails cloning too [23:28:30] ssh: connect to host 208.80.155.149 port 29418: Connection refused [23:28:30] fatal: Could not read from remote repository. [23:28:37] paladox: what does your "listenAddress" line look like? [23:28:47] listenAddress says it can listen based on the hostname too [23:28:55] listenAddress = 208.80.155.149:29418 [23:28:56] Maybe we should just do that rather than specify two IPs? [23:29:06] aha [23:29:18] that changes things for my understanding :) [23:29:29] 'hostname':'port' (for example review.example.com:29418) [23:29:32] Works apparently [23:29:32] i was thinking we are only talking about network interfaces [23:29:36] when it comes to Listen [23:29:50] and we just have to skip the v6 if on labs [23:29:55] You'd think [23:30:03] that is what paladox meant too then [23:30:08] yeh [23:30:09] by "listen on the domain" right [23:30:12] gotcha [23:30:25] That should simplify it a bit too [23:30:27] well yea, let's try that then [23:30:28] Let's try that [23:30:29] :) [23:31:05] 10Operations, 10Traffic, 10netops: Japanese hotel resolving to esams and going the wrong way round - https://phabricator.wikimedia.org/T178726#3700842 (10Reedy) [23:31:13] thant works [23:31:16] thant = that [23:31:28] 10Operations, 10Traffic, 10netops: Japanese hotel resolving to esams and going the long way round - https://phabricator.wikimedia.org/T178726#3700855 (10Reedy) [23:31:29] listenAddress = gerrit2.git.wmflabs.org:29418 [23:31:51] nice [23:32:50] we dont use $host in the template yet, but yea [23:33:23] That also solves the problem of "don't listen on cobalt.wm.o" [23:33:58] that was the purpose of the patch actually [23:34:03] to stop that [23:34:07] Yep [23:34:31] Using the domain solves that without adding a weird issue for labs w/ ipv6 and also keeps from busting other links (but that's insane, that's what canonicalUrl is for you'd think) [23:35:08] adds $host to gerrit::jetty [23:36:11] (03PS6) 10Dzahn: gerrit: dont let sshd listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/354074 [23:36:15] yes, agree to all of that [23:36:45] eh, one more PS [23:37:11] (03PS7) 10Dzahn: gerrit: dont let sshd listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/354074 [23:37:34] there, like this right [23:37:51] yep [23:39:57] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/8407/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/354074 (owner: 10Dzahn) [23:40:48] well, what should happen on gerrit2001, btw [23:40:56] should it listen on gerrit-slave or not at all [23:41:07] (if it would work ) [23:41:32] we said we wanted to add --enable-ssh and --enable-http ? [23:41:34] or so [23:42:06] like that it would try to listen on gerrit.wikimedia.org also on slave and that won't work [23:42:51] yep [23:44:34] (03PS1) 10Chad: Gerrit: move LDAP spaces around for hostnames [puppet] - 10https://gerrit.wikimedia.org/r/385491 [23:44:52] (03CR) 10Chad: [V: 032 C: 032] Adding deleteproject @ stable-2.13 [software/gerrit] - 10https://gerrit.wikimedia.org/r/385117 (owner: 10Chad) [23:45:21] (03CR) 10Paladox: [C: 031] Gerrit: move LDAP spaces around for hostnames [puppet] - 10https://gerrit.wikimedia.org/r/385491 (owner: 10Chad) [23:45:27] SSH on the slave isn't really necessary I s'pose [23:46:08] (03PS8) 10Dzahn: gerrit: dont let sshd listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/354074 [23:46:13] (03CR) 10Paladox: [C: 031] gerrit: dont let sshd listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/354074 (owner: 10Dzahn) [23:46:29] (03CR) 10jerkins-bot: [V: 04-1] gerrit: dont let sshd listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/354074 (owner: 10Dzahn) [23:46:41] !log demon@tin Started deploy [gerrit/gerrit@21b7332]: pushing deleteproject plugin [23:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:51] !log demon@tin Finished deploy [gerrit/gerrit@21b7332]: pushing deleteproject plugin (duration: 00m 09s) [23:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:14] (03CR) 10Paladox: [C: 031] gerrit: dont let sshd listen on all interfaces (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/354074 (owner: 10Dzahn) [23:48:28] :) [23:49:15] (03PS9) 10Dzahn: gerrit: dont let sshd listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/354074 [23:49:39] (03CR) 10jerkins-bot: [V: 04-1] gerrit: dont let sshd listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/354074 (owner: 10Dzahn) [23:51:54] yea, i'm wrong :) [23:52:07] what's the right variable with the slave name :p [23:52:10] looks [23:52:14] slave_host [23:52:38] $slave_hosts [23:52:46] thanks, gerrit::jetty needs that too then [23:52:51] oh yea [23:52:55] yep [23:53:16] ok, at this point it's not that easy anymore [23:53:26] because we would have to know which of the possibly multiple slaves it is [23:53:33] yeh [23:54:01] how to disable the entire ssh section [23:54:06] just skip it all? [23:54:34] if not $slave .. [23:57:21] :'port' may be omitted to use the default of 29418. [23:57:36] To disable the internal SSHD, set listenAddress to off. [23:57:38] aha! [23:57:42] gonna use that [23:59:05] (03PS10) 10Dzahn: gerrit-ssh: don't listen on all interfaces, disable on slaves [puppet] - 10https://gerrit.wikimedia.org/r/354074